Computer programming, software development, debugging, algorithms
GPT
GPT
Claude
Claude
Gemini
Gemini
Grok
Grok
DeepSeek
DeepSeek
Qwen
Qwen
Llama
Llama
Kimi K2.5
Kimi K2.5
Mistral
Mistral
Phi-4
Phi-4
Command R+
Command R+
IBM Granite
IBM Granite
Performance Trend
Top 3 models over last 4 weeks
Score Comparison
All models in Coding
Recent Growth Analysis
Performance changes in the last 7 days
Last 7 days
Last 7 days
Last 7 days
Last 7 days
Category Performance
Average accuracy: 83.1%
How Dynamic Rankings Work
The system learns and improves over time
The Learning Process
Every time Roundtable runs, the system tracks which models contribute most to the best answers. Over time, this builds a real-world performance database that shows which models are genuinely best at different types of tasks.
How Rankings Are Calculated
We begin with benchmark scores from standardized tests (like how students have SAT scores).
As Roundtable runs, we track which models win (contribute most to the best answers) in each category.
The final ranking combines benchmark scores (early on) with real-world performance (as data accumulates). After 50+ rounds, rankings are based entirely on actual performance.
Why This Matters
Static benchmarks tell you how models perform on tests. Dynamic rankings tell you how they perform on real questions from real users. This means the system gets smarter over time, always routing your questions to the model that's proven to be best for that specific type of task.