Aligning models is a human problem.
RLHF and DPO need the same thing: a fast, diverse, trustworthy stream of human preferences. Rapidata is the stream.
Every vote nudges the policy toward the region humans actually prefer.
The reward signal you actually want lives in human heads.
You can write a loss function for next-token accuracy. You cannot write one for “this image looks more cinematic” or “this answer is less condescending.”
That gap is what RLHF and DPO close by training on what humans actually prefer between paired model outputs. Both methods are extraordinarily effective. Both are starved for one input: real preference pairs, in volume, at speed, against the policy you have today.
Synthetic preferences hit a ceiling. LLM-as-judge bakes in the very biases you are trying to correct. Humans are the only ground truth that scales with the problem.
Auto-judges learn to judge like humans. We just go ask the humans.
Automated judges are built from human preference data. It's a compression of how humans judged in the past. The point of going direct is not to replace judges entirely, but to avoid the compression step where proxy signals drift away from real preference.
An auto-judge is a model trained to predict how a human would have judged. It works well on tasks where preference is stable and well-represented in the training data like factual QA, code correctness, simple instruction-following.
On tasks where preference shifts across culture, taste, demographics, or evolving model behavior, proxy signals gradually drift away from the human judgments they were trained to approximate.
- Final-mile checkpoints, this often depends on subtle preference differences that humans detect reliably.
- Subjective, taste-led modalities in image, video, voice, anything an auto-judge can't easily ground.
- Cross-cultural rollouts, and audiences your training data under-represents.
Three ways automated judges fail in production.
Reward model drift
Reward models are trained on older policy distributions. As the model evolves, the reward signal gradually drifts away from current human preference. Over time, you stop optimizing for humans and start optimizing for a stale proxy.
Auto-judge self-preference
Automated judges consistently favor outputs closer to the distributions they were trained on, reinforcing model-specific stylistic biases.
Cultural mono-alignment
Judge models usually flatten regional, linguistic, and demographic preference differences into a single global preference distribution. You ship a model that feels off in half your markets.
We power both algorithms with the same primitive.
RLHF and DPO are different optimizers. They aren't different data problems. Whether you train a reward model and run PPO, or whether you use DPO and skip the reward model entirely, the input is the same: pairwise preferences. We give you that stream, live, ttl-bounded, demographically targetable.
Pairs, rankings, matrices.
Use the same human preference signal across RLHF, DPO, and evaluation.
Seconds to minutes per batch.
TTL-bounded collection keeps training loops moving without blocking.
Skill-qualified, country-targeted.
Swap global, curated, or custom audiences depending on the tasks.
Image, video, audio, text, interactive UIs.
Same primitive across every modality you optimize.
32M+ annotators across 190+ countries.
Preference collection behaves like infrastructure, not manual ops.
Everything via SDK.
plug in any of your workflows and get structured results via API
One primitive. Drop it into your training loop.
Ranking flows are the API primitive. Open a flow at the start of training, add a batch of rollouts each step, and pull back a reward signal in time for the next gradient update.
Define the question, response budget, audience.
Stream rollouts in as you produce them. Fire and forget.
Pull the win-loss matrix as a reward signal. Step.