In a previous blog post and paper we presented a benchmark for evaluating generative text-to-image models based on a collected large scale preference dataset consisting of more than 2 million responses from real humans. This large dataset was acquired in just a few days using Rapidata’s unique platform, and in this post we will show how you can easily set up and run the annotation process to collect a huge preference dataset yourself.
The Data
The preference dataset is made up of a large amount of pairwise comparisons between images generated from different models. For this demo, you can download a small dataset of images generated using Flux.1 [pro] and Stable Diffusion 3 Medium. The dataset is available on our huggingface page. The dataset contains the relevant images (images.zip) as well as a csv-file defining the matchups (matchups.csv).
Configuring and Starting the Annotation Process
For the annotation setup we will utilize the Rapidata API which can be easily used through our python package. To install the package run
If you are interested in learning more about the package, take a closer look at the documentation. However, this is not needed to follow along with this guide.
As a first step, import the necessary packages and create the client object that will be used in configuring the rest of the setup.
To obtain the credentials required for the client, first sign up on the Rapidata page (the easiest is to sign up with a google account). Afterwards, request the API credentials on our website using the same email address you used to sign up.
Next initiate the setup by creating an order which will hold the configuration. Supply a meaningful name. Additionally, set the criteria that the rating will be based on. In this example we will go with text-image alignment.
Now, let us import the data from the downloaded dataset. The csv is neatly formatted with each of the image pairs and their respective prompts, making this straightforward. The images and prompts are added to the order as media and metadata respectively. For demonstration purposes, I sample a subset of the pairs. Based on the settings presented in step 4, this should allow the necessary amount of responses to be collected in less than ten minutes. If you do not mind waiting, feel free to include more pairs.
The last step is to define how many responses are desired per matchup - in this case 15, as well as adding a validation set,
which ensures that the labelers understand the question. We have prepared a predefined validation set for this specific task,
however these can also be customized if needed. Consult the documentation or reach out for more information in this regard.
By calling .create() the order is submitted for review and will be launched as soon as possible.
Fetching the Results
You can follow your order progress through the dashboard. You can also download the results from there when the order is finished. Alternatively you can fetch the results from the order object
If the kernel has been restarted, you can retrieve the order object based on the order id printed when the order was created.
Analyzing the Results
The raw results come as a json object, however for analysis purposes we can extract it to a pandas dataframe using this utility function.
Expand to see utility function, get_df_from_results()
To find a winner between the two model, we e.g., look at which model got the most votes.
Visualization
The following function provides a simple visualization of the individual matchups and the votes they received, similar to the image shown below.
Expand to see utility function, plot_image_comparison()
Conclusion
Through this blog post you have seen how easily you can start collecting preference data from real humans through the Rapidata API with just a few lines of code. This guide serves as a starting point and now you are ready to customize the setup to your specific needs. If you have any questions or need help, feel free to reach out to us at info@rapidata.ai.