Fixing public datasets — Re-annotating Animals 10

Public datasets are surely handy, but are they always dandy? We found that many large public datasets contain some amount of erroneous data, however, scrolling through thousands of images manually is quite time-consuming. Luckily, we can parallelize this task by having thousands of people go through a dataset with Rapidata.

Dataset

For this post, we will have a look at Animals-10. The description says:

All the images have been collected from "google images" and have been checked by human. There is some erroneous data to simulate real conditions (eg. images taken by users of your app).

Well, I'd rather have a clean dataset and inject the noise only when I need it. Let's fix that.

Essentially, all we need is people to look at an image and specify whether it is one of the ten classes or something else. Rapidata's classify order is perfect for this task. Below we will show you how to set up a project to re-annotate the dataset. But first, let's look at the outcome.

Results

In total, out of the 23,999 images, 483 had an incorrect label. That's about 2% of the dataset! Errors fall into two buckets: images that should have been another of the 10 classes and images that should not be in this dataset altogether.

Here are a few examples for image labels that were fixed:

spider
Understandable mistake (original: sheep, new: dog)
spider
Fluffy like a sheep (original: sheep, new: cow)
cow
C'mon, this one is easy (original: cat, new: dog)
spider
A doggo roll (original: squirrel, new: dog)
cow
Not a dogfight (original: dog, new: chicken)
horse
You can ride on more than horses. (original: horse, new: cow)

Other categories

Here are the some images that don't fit into any of the 10 classes, which was a much bigger bucket:

cow
Colorful as a butterfly, but actually a Eurasian Hoopoe (original: butterfly, new: something else)
spider
This might be one in the future (original: butterfly, new: something else)
horse
What's up? (original: butterfly, new: something else)
cow
Homo Sapiens (original: cow, new: something else)
spider
Fancy speakers (original: chicken, new: something else)
horse
Maybe it has few horses under the hood (original: horse, new: something else)

Evidently, we successfully identified quite a few mislabelings. The corrected dataset is now available to be explored and downloaded on Hugging Face.

Setting up the project

With Rapidata, the labelers solve so-called Rapids, which are small tasks that can be solved in a few seconds. To get the highest quality, it is best practice to have some validation rapids from the same dataset. A validation rapid looks just like a normal rapid, except we know the true answer, and if answered incorrectly, the labeler is punished. Additionally, it affects their internal trust score.

To select a few validation images, I quickly checked the first 20 images of each category, which seem fine except cow_0012 is a bull, spider_0013 is some other kind of Anthropoda and horse_0004 shows two donkeys copulating.

cow
cow_0012
spider
spider_0014
horse
horse_0004

All in all, we can use this to make a comprehensive validation set, that won't be easy to learn by heart, even if a single person solves a lot of Rapids.

First of all, we have to create a client to interact with the Rapidata API. If you don't have credentials yet, quickly ping us at info@rapidta.ai.

rapi = RapidataClient(client_id=os.environ["CLIENT_ID"], client_secret=os.environ["CLIENT_SECRET"])

Then we can define the categories and the path to the images. Presenting the user with too many categories can lead to an overload, so it's best to use a maximum of 8 categories. Since we have 10 classes at hand, we first label 6 categories and specify an addional 7th category that we call "Something else". All the categories that are not part of the 6 categories are then labeled as "Something else". In a second step we can then label those with the remaining 4 categories.

CATEGORIES_BATCH_1 = ["Cat", "Dog", "Cow", "Spider", "Butterfly", "Elephant"]
CATEGORIES_BATCH_2 = ["Horse", "Chicken", "Sheep", "Squirrel"]

Creating a validation set

Now we can make a function that creates a validation set based on the existsing dataset. For the first batch of categories, we map the categories of the second batch to "Something else". We also skip the images that are not correctly labeled in the dataset.

def make_validation_set(rapi, name, categories, other_categories, num_images_per_category = 20):

    validation_set_builder = rapi.new_validation_set(name)

    category_selection = categories.copy()
    category_selection.append("Something else")

    for category in other_categories:
        for i in range(1, num_images_per_category // len(other_categories) + 1):
            validation_set_builder.add_classify_rapid(
                asset=MediaAsset(f"{MEDIA_PATH}/{category.lower()}/{category.lower()}_{i:04d}.jpeg"),
                question=f"What is shown in the image?",
                categories=category_selection,
                truths=["Something else"],
            )

    for category in categories:
        for i in range(1, num_images_per_category + 1):
            if category == "Horse" and i == 4:
                continue
            if category == "Spider" and i == 14:
                continue
            if category == "Cow" and i == 12:
                continue # Skip the image that is not a cow
            validation_set_builder.add_classify_rapid(
                asset=MediaAsset(
                    f"{MEDIA_PATH}/{category.lower()}/{category.lower()}_{i:04d}.jpeg"
                ),
                question=f"What is shown in the image?",
                categories=category_selection,
                truths=[category],
            )

    return validation_set_builder.create()

validation_set_batch_1 = make_validation_set(rapi, "Animal 10 validation for Batch 1", CATEGORIES_BATCH_1, CATEGORIES_BATCH_2)

Starting the order

Setting up the actual labeling order is now relatively simple. First we get all the image paths:

import glob
import random

image_paths = []

for category in CATEGORIES_BATCH_2:
    category_path = f"{MEDIA_PATH}/{category.lower()}/*.jpeg"
    all_images = glob.glob(category_path)

    # Exclude the first 20 images (index 0-19)
    available_images = all_images[21:]

    image_paths.extend(available_images)

random.shuffle(image_paths)

Then we can create the order:

order = (
    rapi.new_order(
        name="Animal 10 Full Batch 1",
    )
    .workflow(
        ClassifyWorkflow(
            question="What is shown in the image?",
            options=CATEGORIES_BATCH_1 + ["Something else"],
        )
    )
    .media([MediaAsset(path) for path in image_paths])
    .referee(ClassifyEarlyStoppingReferee(threshold=0.97, max_vote_count=20))
    .selections(
        [
            ValidationSelection(amount=1, validation_set_id=validation_set_batch_2.id),
            LabelingSelection(amount=2),
        ]
    )
    .priority(5)
    .create()
)

The ClassifyEarlyStoppingReferee is configured to stop if the likelihood that a certain category is correct is above 97%, which is often achieved with 3-4 human labels. In case these labels are ambiguous, more lables will be collected, but never more than 20 for a single image.

The selection part is also interesting. We first show a validation Rapid, followed by two actual rapids that we want to collect labels for.

Obtaining the results

We have two ways to monitor the progress of the order, either we head over to app.rapidata.ai or we can also use the SDK to show a progress bar using order.display_progress_bar()

Once the order is finished, we can download the results, either using the UI or the SDK with order.get_results().

That's it!