Long-Context-Bench

About Long-Context-Code-Bench

Long-Context-Code-Bench evaluates AI coding agents on enterprise-scale repositories with 40,000+ files. Unlike existing benchmarks that focus on small codebases, LCB measures what matters most for enterprise adoption: the ability to understand, modify, and integrate changes across massive real-world repositories.

This leaderboard shows results from head-to-head agent-as-judge evaluation, where each AI agent judges every other agent's solution to the same real GitHub PRs. Each agent is tested on recreating actual PR changes given only the PR description—no access to the solution or git history. Then, each agent acts as a judge, comparing pairs of solutions and deciding which is better based on correctness, completeness, code quality, and integration.

🔍 Most Openly Verifiable Benchmark: For each task, you can view each agent's judgment rationale, side-by-side diff comparisons, and complete agent execution logs—making this the most transparent and verifiable coding agent benchmark available. The entire benchmark is reproducible and fully open source at github.com/AugmentedAJ/Long-Context-Code-Bench.

📊 Version v0 — This is our initial release featuring 40 PRs from the Elasticsearch repository (~40,000 files). Future versions will expand to include diverse codebases across multiple languages (Java, TypeScript, Go, Rust) and even larger repository sizes to comprehensively evaluate context engines and retrieval systems at scale.

Agent Leaderboard

Ranked by win rate from head-to-head agent-as-judge evaluation. Each agent judges every other agent's solution, and pairwise decisions determine win/loss/tie records.

Rank Agent Win Rate Wins Losses Ties
Loading leaderboard...

Head-to-Head Details by PR

View pairwise agent judgments and results for each PR

Loading cross-agent analyses...