Solutions · Online RLHF
continuous post-training · humans inside the loop

Online RLHF.
Feedback as fast as your model generates.

A continuous post-training system where human preferences are streamed directly into the optimizer. Every training step pushes a group of candidates as a Flow item, gets ranking back in seconds, and turns it into a reward signal.

Example setup
PolicyGPU clusterGenerationn candidates / promptFlow itempairwise comparisonsGlobal crowd6K+ ann / minELOranking + CIReward signalelo → policythroughput6,320 ann/minflow items live323

Generation, evaluation, and training run concurrently. Human feedback latency is comparable to image generation time, so people never become the bottleneck.

01the shift

Static datasets cap out. Online loops don’t.

Traditional RLHF collects a preference dataset once, trains a reward model, and runs PPO against it for weeks. The reward model drifts away from the policy after a few hundred steps, and the dataset stops reflecting the things your current model is getting wrong. Online RLHF closes that gap by treating each training step as its own micro-experiment — a small group of candidates, ranked by humans, fed back to the optimizer within seconds.

same wall-clock window→ time
Traditional RLHFstatic dataset · periodic retrain
collect 100k pairs (weeks)reward model trainPPO runevalnext collection
Online RLHFcontinuous · feedback inside the step
one training step = one flow item · ~3-8s end-to-end · hundreds in parallel
feedback latency ≤ image generation time → humans stop being the bottleneck
02how it works

One training step = one Flow item.

For each prompt, the model emits a group of candidates, usually from 4 to 16. The group is submitted as a single Flow item; pairwise comparisons are presented to human annotators in parallel; responses are aggregated with Elo and rankings are prepared.

one flow item
8 candidates · 1 prompt
step 12,408
prompt: “a serene mountain at sunset, cinematic”
c1
c2
c3
c4
c5
c6
c7
c8
C(8, 2) = 28 possible pairs · sampled adaptively to maximize information gain
aggregated ranking
Elo
312 / 500 votes
  • 01
    1342
  • 02
    1308
  • 03
    1284
  • 04
    1267
  • 05
    1219
  • 06
    1198
  • 07
    1175
  • 08
    1156
preference signal → training
winner: c1 · 7 preference pairs
conf 0.94
01
Generate a group

Each training step samples 8 candidates per prompt across your parallel GPUs.

02
Open a Flow item

The group is submitted as one Flow item with min / desired / max response thresholds and a ttl.

03
Aggregate at scale

Pairwise comparisons stream in from the global crowd; Elo / Bradley-Terry produces a ranking with CI in seconds.

04
Train on the signal

The winner and full preference pairs become reward-modeling or DPO targets for the next gradient step.

03in code

Drop it into your training step.

One flow stays open for the entire run. Each step pushes a batch and polls back a ranking. Because the call is non-blocking and ttl-bounded, your training loop never waits on humans — incomplete items still return partial results.

online_rlhf.py
from rapidata import RapidataClient

client = RapidataClient()

# 1. Open one ranking flow for the entire training run
flow = client.flow.create_ranking_flow(
    name="online-rlhf · image-gen",
    instruction="Which image looks better?",
)

# 2. Inside the training loop — one flow item per step
for step in train_loop:
    candidates = policy.sample(prompt, n=8)        # 8 per prompt
    item = flow.create_new_flow_batch(
        datapoints=candidates,
        context=f"step {step}",
        time_to_live=300,                          # seconds, bounded
    )

    # Non-blocking inspection while votes accumulate
    status  = item.get_status()
    matrix  = item.get_win_loss_matrix()           # pandas DataFrame
    results = item.get_results()                   # rankings + scores

    # 3. Feed the win/loss matrix into your DPO / reward modeling
    optimizer.step(reward_signal=matrix)