Beyond Image Preferences - Rich Human Feedback

Introduction

The recent success of learning from human preferences, particularly in the domain of large language models (LLMs), has inspired similar methodologies for advancing text-to-image generation models. While simple preference annotations, indicating which of two images is better, are valuable, they often lack information for deeper analysis. For example, a preference annotation may not clarify whether an annotator chose one image over another because of its visually pleasing style, greater realism, or superior alignment with the text prompt. To address this gap, using the modalities presented in the work by Google Research, we present a dataset that captures detailed human feedback for text-to-image generation.

TL;DR: We collected 1.5 million annotations from >150 thousand individual humans using Rapidata via the Python API to build a dataset of detailed human feedback for text-to-image generation. The dataset contains:

  • Image ratings (based on likert scale) for each of the three criteria; style, coherence, prompt-alignment
  • Annotations indicating which words in the prompt are most often considered misaligned or misrepresented
  • Heatmaps for areas in the images with high misalignment or incoherence respectively

See how to easily collect your own rich feedback annotations at the end.

Annotation Setup

To provide richer feedback for training and evaluating text-to-image generation models, our dataset uses a multi-dimensional annotation approach based on the method described by Google Research in their paper. The core of our annotation process involves:

  1. Likert Scale Ratings Annotators rate images on a 1-5 scale across three criteria:
    • Style: How visually appealing is the image? Does it exhibit a pleasing aesthetic? 5 is the most appealing and 1 the least.
    • Coherence: Does the image make sense? How prominent are common ‘AI-artifacts’ such as implausible elements, or distorted objects/limbs? For this, we found that it yielded the best results to ask annotators to rate how erroneous the image is, i.e., 5 for the most errors (very incoherent) and 1 for no errors (perfectly coherent). However, for consistency with the other criteria, the scale is afterwards transformed so that 5 indicates high coherence and 1 low coherence.
    • Alignment: How well does the image match the given text prompt? A score of 5 indicates that the image matches the prompt very well and 1 indicates very poor alignment.
  2. Misaligned Word Selection To address cases where images fail to fully reflect the prompt, annotators are asked to select words from the prompt that are misaligned or missing in the image, if any. The images below illustrate the interface annotators are shown.

  1. Image Area Selection For images rated poorly on coherence or alignment, annotators pinpoint specific problem areas in the image. Each annotator can select up to three distinct points in an image.
    • For coherence, annotators highlight artifacts, malformed objects, or implausible elements.
    • For alignment, annotators highlight parts of the image that do not align with the text prompt.

By combining these modalities, the dataset provides insights not just into whether an image is good or bad, but also why it may fail to meet expectations.

Input images

All 13k images for this dataset are generated using recent text-to-image generation models. Roughly half of the images were originally generated for our text-to-image model benchmark. In brief, this includes ~6.5k different images generated based on 283 selected prompts using the following models:

  • Flux1.1-pro
  • Flux1-pro
  • DALL-E 2
  • Stable Diffusion 3
  • Imagen 3
  • Midjourney 5.2

More details are available in the datasets linked above and in our paper about the benchmark.

The other half of the images are generated using Stable Diffusion 3.5-Large using a selection of ~3k prompts from DiffusionDB.

We intentionally include the same prompt at least twice to also allow analysis of differences of outputs for the same prompt (and even model).

Results

The dataset is based on 1.5 million annotations coming from 152,684 different annotators around the world, ensuring good diversity to provide broadly representative results and avoid annotator bias.

The final style, coherence, and alignment scores were found as the average of the scores assigned by the annotators for each image. Each image was rated at least 20 times for each criteria by different annotators. The distribution of the scores are shown in the figure below. We additionally provide a normalised score between 0 and 1.

Based on these ratings, the images with low scores for coherence or alignment were annotated with the area selection described above, based on which heatmaps have been created. The heatmap is created as an average of the responses from all the annotators for each image. For each annotation we consider a gaussian with standard deviation equal to 5% of the smallest side centered at the selected point. Example heatmaps are shown below.

Similarly, the misaligned word selection can be visualized using a colormap to show which words are selected more often. Note that due to the relatively large number of annotators per image, the results can be interpreted as a distribution showing how misaligned each word is. E.g., words that may be slightly misaligned may be selected by some annotators, but not all, whereas words that are more obviously misaligned will be selected by significantly more. Example images are provided below.

The dataset can be previewed below, and is freely available on Huggingface Note that for improved quality, the results that we provide are weighted based on trust scores we have assigned to the annotators based on their previous performance. This is the reason that, e.g., the selected word ‘counts’ are not integers.

Discussion and future work

As the Google researchers also note, alignment heatmap annotations are inherently somewhat ambiguous. For example, in the leftmost image above, it is clear that the image has one too few cats and an extra dog. However, what to select in the image to convey that is not immediately clear. In this case, most annotators have selected the dog in the foreground.

Using the second image from the left as an example, it can also be argued that the point-highlighting approach used so far may be somewhat limiting. In this image, one could argue that the whole toilet including the seat is misaligned due to the wrong colors, however, this is difficult to mark using just a few points. In the future, it could be interesting to experiment with giving the annotators the ability to freely draw to highlight the erroneous areas.

We hope this dataset contributes to a deeper and more nuanced understanding of text-to-image models and their limitations. By shedding light on these models' strengths and weaknesses, we aim to inspire further innovation and drive their development.

If the reception of this dataset is positive, we plan to expand it continuously with new data to ensure it remains a valuable resource for the community.

Replicating the Annotation Setup

For researchers interested in producing their own rich preference dataset, you can directly use the Rapidata API through python. The code snippets below show how to replicate the modalities used in the dataset. Additional information is available through the documentation

Creating the Rapidata Client and Downloading the Dataset

First install the rapidata package, then create the RapidataClient() this will be used create and launch the annotation setup

pip install rapidata
from rapidata import RapidataClient, LabelingSelection, ValidationSelection client = RapidataClient()

As example data we will just use images from the dataset. Make sure to set streaming=True as downloading the whole dataset might take a significant amount of time.

from datasets import load_dataset ds = load_dataset("Rapidata/text-2-image-Rich-Human-Feedback", split="train", streaming=True) ds = ds.select_columns(["image","prompt"])

Since we use streaming, we can extract the prompts and download the images we need like this:

import os tmp_folder = "demo_images" # make folder if it doesn't exist if not os.path.exists(tmp_folder): os.makedirs(tmp_folder) prompts = [] image_paths = [] for i, row in enumerate(ds.take(10)): prompts.append(row["prompt"]) # save image to disk save_path = os.path.join(tmp_folder, f"{i}.jpg") row["image"].save(save_path) image_paths.append(save_path)

Likert Scale Alignment Score

To launch a likert scale annotation order, we make use of the classification annotation modality. Below we show the setup for the alignment criteria. The structure is the same for style and coherence, however arguments have to be adjusted of course. I.e. different instructions, options and validation set.

# Alignment Example instruction = "How well does the image match the description?" answer_options = [ "1: Not at all", "2: A little", "3: Moderately", "4: Very well", "5: Perfectly" ] order = client.order.create_classification_order( name="Alignment Example", instruction=instruction, answer_options=answer_options, datapoints=image_paths, contexts=prompts, # for alignment, prompts are required as context for the annotators. responses_per_datapoint=10, selections=[ValidationSelection("676199a5ef7af86285630ea6"), LabelingSelection(1)] # here we use a pre-defined validation set. See https://docs.rapidata.ai/improve_order_quality/ for details ) order.run() # This starts the order. Follow the printed link to see progress.

Alignment Heatmap

To produce heatmaps, we use the locate annotation modality. Below is the setup used for creating the alignment heatmaps.

# alignment heatmap # Note that the selected images may not actually have severely misaligned elements, but this is just for demonstration purposes. order = client.order.create_locate_order( name="Alignment Heatmap Example", instruction="What part of the image does not match with the description? Tap to select.", datapoints=image_paths, contexts=prompts, # for alignment, prompts are required as context for the annotators. responses_per_datapoint=10, selections=[ValidationSelection("67689e58026456ec851f51f8"), LabelingSelection(1)] # here we use a pre-defined validation set for alignment. See https://docs.rapidata.ai/improve_order_quality/ for details ) order.run() # This starts the order. Follow the printed link to see progress.

Select Misaligned Words

To launch the annotation setup for selection of misaligned words, we used the following setup

# Select words example from rapidata import LanguageFilter select_words_prompts = [p + " [No_Mistake]" for p in prompts] order = client.order.create_select_words_order( name="Select Words Example", instruction = "The image is based on the text below. Select mistakes, i.e., words that are not aligned with the image.", datapoints=image_paths, sentences=select_words_prompts, responses_per_datapoint=10, filters=[LanguageFilter(["en"])], # here we add a filter to ensure only english speaking annotators are selected selections=[ValidationSelection("6761a86eef7af86285630ea8"), LabelingSelection(1)] # here we use a pre-defined validation set. See https://docs.rapidata.ai/improve_order_quality/ for details ) order.run()