About Long-Context-Code-Bench
Long-Context-Code-Bench evaluates AI coding agents on enterprise-scale repositories with 40,000+ files. Unlike existing benchmarks that focus on small codebases, LCB measures what matters most for enterprise adoption: the ability to understand, modify, and integrate changes across massive real-world repositories.
This leaderboard shows results from head-to-head agent-as-judge evaluation, where each AI agent judges every other agent's solution to the same real GitHub PRs. Each agent is tested on recreating actual PR changes given only the PR description—no access to the solution or git history. Then, each agent acts as a judge, comparing pairs of solutions and deciding which is better based on correctness, completeness, code quality, and integration.
🔍 Most Openly Verifiable Benchmark: For each task, you can view each agent's judgment rationale, side-by-side diff comparisons, and complete agent execution logs—making this the most transparent and verifiable coding agent benchmark available. The entire benchmark is reproducible and fully open source at github.com/AugmentedAJ/Long-Context-Code-Bench.
Agent Leaderboard
Ranked by win rate from head-to-head agent-as-judge evaluation. Each agent judges every other agent's solution, and pairwise decisions determine win/loss/tie records.
| Rank | Agent | Win Rate | Wins | Losses | Ties |
|---|---|---|---|---|---|
| Loading leaderboard... | |||||
Head-to-Head Details by PR
View pairwise agent judgments and results for each PR
Loading cross-agent analyses...