Rank | model | Bradley-Terry | Elo | Win Rate | Wins | Matches |
---|---|---|---|---|---|---|
1 | ![]() Flux 1 pro | 1135.04 | 1057.30 | 0.54 | 841050 | 1557622 |
2 | ![]() Flux 1.1 pro | 1082.65 | 1040.13 | 0.52 | 496062 | 953979 |
3 | Aurora | 1035.38 | 1023.89 | 0.51 | 250286 | 491566 |
4 | Imagen 3 | 1019.21 | 1018.15 | 0.50 | 455736 | 918268 |
5 | Frames | 990.08 | 1007.60 | 0.50 | 206445 | 414628 |
6 | ![]() Lumina | 969.12 | 1000.42 | 0.51 | 102679 | 201610 |
7 | Dall-E 3 | 952.00 | 993.32 | 0.49 | 713967 | 1464775 |
8 | ![]() Stable Diffusion 3 | 939.14 | 988.35 | 0.49 | 592115 | 1219975 |
9 | Midjourney 5.2 | 893.75 | 970.32 | 0.47 | 642301 | 1371196 |
10 | ![]() Janus 7B | 733.80 | 900.53 | 0.43 | 23895 | 55453 |
The Bradley-Terry ranking model is a probabilistic model used to predict outcomes in pairwise comparisons. It assigns a strength parameter (reported score) to each item, indicating its likelihood of winning against another. See the wikipedia article for mathematical details.
Here we evaluate the model across all criteria and determine which model has the best overall performance.
All results are directly based on feedback from real human raters. The process of how we came out with results is best described in our blog post.
Visual examples of the annotators’ preferences
Which image looks better overall?
Which image feels less weird or unnatural for its style when you look closely? I.e. fewer odd or strange-looking objects or elements
A black and white picture of a white man singing a song