On-Demand Human Preference Data for AI Training

In a previous blog post and paper we presented a benchmark for evaluating generative text-to-image models based on a collected large scale preference dataset consisting of more than 2 million responses from real humans. This large dataset was acquired in just a few days using Rapidata’s unique platform, and in this post we will show how you can easily set up and run the annotation process to collect a huge preference dataset yourself.

The Data

The preference dataset is made up of a large amount of pairwise comparisons between images generated from different models. For this demo, you can download a small dataset of images generated using Flux.1 [pro] and Stable Diffusion 3 Medium. The dataset is available on our huggingface page. The dataset contains the relevant images (images.zip) as well as a csv-file defining the matchups (matchups.csv).

Configuring and Starting the Annotation Process

For the annotation setup we will utilize the Rapidata API which can be easily used through our python package. To install the package run

pip install rapidata

If you are interested in learning more about the package, take a closer look at the documentation. However, this is not needed to follow along with this guide.

  1. As a first step, import the necessary packages and create the client object that will be used in configuring the rest of the setup.
    • To obtain the credentials required for the client, first sign up on the Rapidata page (the easiest is to sign up with a google account). Afterwards, request the API credentials on our website using the same email address you used to sign up.
import os
from dotenv import load_dotenv
from rapidata import RapidataClient, PromptMetadata
import pandas as pd

load_dotenv()
client_id = os.getenv('RAPIDATA_CLIENT_ID', "")
client_secret = os.getenv('RAPIDATA_CLIENT_SECRET', "")

client = RapidataClient(client_id=client_id, client_secret=client_secret)
  1. Next initiate the setup by creating an order which will hold the configuration. Supply a meaningful name. Additionally, set the criteria that the rating will be based on. In this example we will go with text-image alignment.
order_builder = client.create_compare_order("Benchmark Demo")

order_builder = order_builder.criteria("Which image fits the description better?")
  1. Now, let us import the data from the downloaded dataset. The csv is neatly formatted with each of the image pairs and their respective prompts, making this straightforward. The images and prompts are added to the order as media and metadata respectively. For demonstration purposes, I sample a subset of the pairs. Based on the settings presented in step 4, this should allow the necessary amount of responses to be collected in less than ten minutes. If you do not mind waiting, feel free to include more pairs.
example_csv_path = "/path/to/matchups.csv"

media_path = "/path/to/images"

media = []
prompts = []
for index, row in df.sample(40).iterrows():
    media.append([os.path.join(media_path, row["image1"]), os.path.join(media_path, row["image2"])])
    prompts.append(PromptMetadata(row["prompt"]))

order_builder = order_builder.media(media)
order_builder = order_builder.metadata(prompts)
  1. The last step is to define how many responses are desired per matchup - in this case 15, as well as adding a validation set, which ensures that the labelers understand the question. We have prepared a predefined validation set for this specific task, however these can also be customized if needed. Consult the documentation or reach out for more information in this regard. By calling .create() the order is submitted for review and will be launched as soon as possible.
order_builder = order_builder.responses(15)
order_builder = order_builder.validation_set_id("66d0e7ea8c33f56b460ea91f")
order = order_builder = order_builder.create()
print(order.order_id)

Fetching the Results

You can follow your order progress through the dashboard. You can also download the results from there when the order is finished. Alternatively you can fetch the results from the order object

results = order.get_results()

If the kernel has been restarted, you can retrieve the order object based on the order id printed when the order was created.

client = RapidataClient(client_id=client_id, client_secret=client_secret)
order_id = "your_order_id"  # as a string
order = client.get_order(order_id)
results = order.get_results()

Analyzing the Results

The raw results come as a json object, however for analysis purposes we can extract it to a pandas dataframe using this utility function.

Expand to see utility function, get_df_from_results()
def get_df_from_results(results):
    res = []
    for r in results["results"]:
        prompt = r["prompt"]
        agg_res = r['aggregatedResults']
        votes = {}
        image_names = {}
        for i in agg_res:
            votes['_'.join(i["imageName"].split('_')[:-2])] = i["votes"]
            image_names['_'.join(i["imageName"].split('_')[:-2])] = i["imageName"]
        res.append({"prompt": prompt, "image_flux": image_names.get('flux', ''), "image_stable_diffusion": image_names.get('stable_diffusion', ''),
                        'flux': votes.get('flux', 0), 'stable_diffusion': votes.get('stable_diffusion', 0)})
            
    df = pd.DataFrame(res)
    # get the ratio of votes, it should always be the highest, so not always flux/stable_diffusion, sometimes stable_diffusion/flux
    df['ratio'] = df[['flux', 'stable_diffusion']].max(axis=1) / df[['flux', 'stable_diffusion']].sum(axis=1)

    # sort by ratio
    df = df.sort_values('ratio', ascending=False)

    return df
results_df = get_df_from_results(results)

To find a winner between the two model, we e.g., look at which model got the most votes.

def get_votes_per_model(df):
    return df[['flux', 'stable_diffusion']].sum()

votes_per_model = get_votes_per_model(results_df)
print(votes_per_model)

Visualization

The following function provides a simple visualization of the individual matchups and the votes they received, similar to the image shown below.

Expand to see utility function, plot_image_comparison()
def plot_image_comparison(prompt, image1_path, image2_path, votes1, votes2):
    # Create figure and axes
    fig, (ax_images, ax_bar) = plt.subplots(2, 1, gridspec_kw={'height_ratios': [24, 1], 'hspace': 0})

    # Load images
    img1 = plt.imread(image1_path)
    img2 = plt.imread(image2_path)

    # Display images side by side
    ax_images.imshow(np.hstack((img1, img2)))


    text_settings = {
        'horizontalalignment': 'center',
        'verticalalignment': 'bottom',
        'transform': ax_images.transAxes,
        'fontsize': 13,
        'wrap': True
    }
    bbox_settings = {
        'alpha': 0.75,
        'edgecolor': 'none',
        'boxstyle': 'round,pad=0.2'  # This adds rounded corners
    }
    # add the names of the models, Flux.1 and Stable Diffusion as a text on top of the images
    ax_images.text(0.25, 0.9, 'Flux.1', **text_settings, bbox=dict(facecolor='#00ecbb', **bbox_settings))
    
    ax_images.text(0.75, 0.9, 'Stable Diffusion', **text_settings, bbox=dict(facecolor='#803bff', **bbox_settings))

    txt = ax_images.text(0.5, 0.05, prompt, 
                   horizontalalignment='center',
                   verticalalignment='bottom',
                   transform=ax_images.transAxes,
                   fontsize=13,
                   bbox=dict(facecolor='white', alpha=0.8, edgecolor='none'),
                   wrap=True)


    txt._get_wrap_line_width = lambda : 525
    ax_images.axis('off')

    # Calculate vote percentages
    total_votes = votes1 + votes2
    percent1 = votes1 / total_votes * 100
    percent2 = votes2 / total_votes * 100

    # Create horizontal bar for votes
    ax_bar.barh(y=0, width=percent1, height=0.5, align='center', color='#00ecbb', alpha=0.6)
    ax_bar.barh(y=0, width=percent2, height=0.5, align='center', color='#6400f9', alpha=0.6, left=percent1)

    # Configure bar axis
    ax_bar.set_xlim(0, 100)
    ax_bar.set_ylim(-0.25, 0.25)
    ax_bar.axis('off')  # Remove all axes

    # Adjust layout and reduce space between subplots
    plt.tight_layout()
    plt.subplots_adjust(top=0.67, bottom=0.0)

    plt.show()
for i, row in results_df[:5].iterrows():
    if isinstance(row['image_flux'],str) and isinstance(row['image_stable_diffusion'],str):
        plot_image_comparison(row['prompt'], 
                              os.path.join(image_path,'flux', row['image_flux']), 
                              os.path.join(image_path, 'stable_diffusion',row['image_stable_diffusion']), 
                              row['flux'], row['stable_diffusion'])

Conclusion

Through this blog post you have seen how easily you can start collecting preference data from real humans through the Rapidata API with just a few lines of code. This guide serves as a starting point and now you are ready to customize the setup to your specific needs. If you have any questions or need help, feel free to reach out to us at info@rapidata.ai.