Online RLHF.
Feedback as fast as your model generates.
A continuous post-training system where human preferences are streamed directly into the optimizer. Every training step pushes a group of candidates as a Flow item, gets ranking back in seconds, and turns it into a reward signal.
Generation, evaluation, and training run concurrently. Human feedback latency is comparable to image generation time, so people never become the bottleneck.
Static datasets cap out. Online loops don’t.
Traditional RLHF collects a preference dataset once, trains a reward model, and runs PPO against it for weeks. The reward model drifts away from the policy after a few hundred steps, and the dataset stops reflecting the things your current model is getting wrong. Online RLHF closes that gap by treating each training step as its own micro-experiment — a small group of candidates, ranked by humans, fed back to the optimizer within seconds.
One training step = one Flow item.
For each prompt, the model emits a group of candidates, usually from 4 to 16. The group is submitted as a single Flow item; pairwise comparisons are presented to human annotators in parallel; responses are aggregated with Elo and rankings are prepared.
- 011342
- 021308
- 031284
- 041267
- 051219
- 061198
- 071175
- 081156
Each training step samples 8 candidates per prompt across your parallel GPUs.
The group is submitted as one Flow item with min / desired / max response thresholds and a ttl.
Pairwise comparisons stream in from the global crowd; Elo / Bradley-Terry produces a ranking with CI in seconds.
The winner and full preference pairs become reward-modeling or DPO targets for the next gradient step.
Drop it into your training step.
One flow stays open for the entire run. Each step pushes a batch and polls back a ranking. Because the call is non-blocking and ttl-bounded, your training loop never waits on humans — incomplete items still return partial results.
from rapidata import RapidataClient
client = RapidataClient()
# 1. Open one ranking flow for the entire training run
flow = client.flow.create_ranking_flow(
name="online-rlhf · image-gen",
instruction="Which image looks better?",
)
# 2. Inside the training loop — one flow item per step
for step in train_loop:
candidates = policy.sample(prompt, n=8) # 8 per prompt
item = flow.create_new_flow_batch(
datapoints=candidates,
context=f"step {step}",
time_to_live=300, # seconds, bounded
)
# Non-blocking inspection while votes accumulate
status = item.get_status()
matrix = item.get_win_loss_matrix() # pandas DataFrame
results = item.get_results() # rankings + scores
# 3. Feed the win/loss matrix into your DPO / reward modeling
optimizer.step(reward_signal=matrix)