Solutions · Reinforcement learning
RLHF · DPO · preference optimization

Aligning models is a human problem.

RLHF and DPO need the same thing: a fast, diverse, trustworthy stream of human preferences. Rapidata is the stream.

100k annotations/hours · 34s32m+ curated annotators192 countries represented
preference space, drawn by humanslive
human-preferred regionRapidata1,247 human votes

Every vote nudges the policy toward the region humans actually prefer.

01the problem

The reward signal you actually want lives in human heads.

You can write a loss function for next-token accuracy. You cannot write one for “this image looks more cinematic” or “this answer is less condescending.”

That gap is what RLHF and DPO close by training on what humans actually prefer between paired model outputs. Both methods are extraordinarily effective. Both are starved for one input: real preference pairs, in volume, at speed, against the policy you have today.

Synthetic preferences hit a ceiling. LLM-as-judge bakes in the very biases you are trying to correct. Humans are the only ground truth that scales with the problem.

02why humans

Auto-judges learn to judge like humans. We just go ask the humans.

Automated judges are built from human preference data. It's a compression of how humans judged in the past. The point of going direct is not to replace judges entirely, but to avoid the compression step where proxy signals drift away from real preference.

via an auto-judgelonger path · compression introduced
Human preference data
Past pairwise labels.
Auto-judge
LLM-as-judge · VLM-as-judge · reward model.
Score new outputs
Fast, repeatable and cost effective.
discrepancythe auto-judge approximates how a human would have judged.
direct · via rapidata flowsshorter path · no compression
Human preference (live)
Pairwise votes from curated evaluators.
no proxy step
Score new outputs
Fast, repeatable and cost effective.

An auto-judge is a model trained to predict how a human would have judged. It works well on tasks where preference is stable and well-represented in the training data like factual QA, code correctness, simple instruction-following.

On tasks where preference shifts across culture, taste, demographics, or evolving model behavior, proxy signals gradually drift away from the human judgments they were trained to approximate.

where going direct earns its keep
  • Final-mile checkpoints, this often depends on subtle preference differences that humans detect reliably.
  • Subjective, taste-led modalities in image, video, voice, anything an auto-judge can't easily ground.
  • Cross-cultural rollouts,  and audiences your training data under-represents.
03failure modes we see in the wild

Three ways automated judges fail in production.

Reward model drift

Reward models are trained on older policy distributions. As the model evolves, the reward signal gradually drifts away from current human preference. Over time, you stop optimizing for humans and start optimizing for a stale proxy.

Auto-judge self-preference

Automated judges consistently favor outputs closer to the distributions they were trained on, reinforcing model-specific stylistic biases.

Cultural mono-alignment

Judge models usually flatten regional, linguistic, and demographic preference differences into a single global preference distribution. You ship a model that feels off in half your markets.

04how we deliver

We power both algorithms with the same primitive.

RLHF and DPO are different optimizers. They aren't different data problems. Whether you train a reward model and run PPO, or whether you use DPO and skip the reward model entirely, the input is the same: pairwise preferences. We give you that stream, live, ttl-bounded, demographically targetable.

01 — Signal

Pairs, rankings, matrices.

Use the same human preference signal across RLHF, DPO, and evaluation.

02 — Latency

Seconds to minutes per batch.

TTL-bounded collection keeps training loops moving without blocking.

03 — Audiences

Skill-qualified, country-targeted.

Swap global, curated, or custom audiences depending on the tasks.

04 — Modalities

Image, video, audio, text, interactive UIs.

Same primitive across every modality you optimize.

05 — Scale

32M+ annotators across 190+ countries.

Preference collection behaves like infrastructure, not manual ops.

06 — Integration

Everything via SDK.

plug in any of your workflows and get structured results via API

05the api

One primitive. Drop it into your training loop.

Ranking flows are the API primitive. Open a flow at the start of training, add a batch of rollouts each step, and pull back a reward signal in time for the next gradient update.

the flow primitive
1
create   flow = client.flow.create_ranking_flow(...)

Define the question, response budget, audience.

2
add_batch   item = flow.create_new_flow_batch(rollouts, time_to_live=300)

Stream rollouts in as you produce them. Fire and forget.

3
reward   r = item.get_win_loss_matrix()

Pull the win-loss matrix as a reward signal. Step.

→ for the full API + code samples, see the Flows page