OP here.
We recently open-sourced a small tool we built internally to help answer a question we couldn't find a good solution for: How do you evaluate AI coding agents on a real production codebase?
Like most teams, we had lots of opinions about which agents and models "felt" best, but no hard data. The missing piece wasn’t analysis; it was attribution. We needed to know which lines of code were written by which agent/model, without changing how engineers work.
The key insight was that Git already gives us most of what we need.
By reverse-engineering how tools like Cursor and Claude Code modify files, we attach attribution metadata directly to Git whenever an AI agent edits code. Engineers don’t have to opt in or change their workflows.
Once that data exists, we can run fairly simple queries to answer questions like:
- merged lines per dollar by agent/model
- bug rates correlated with AI-generated code
- how different developers actually use AI in practice
An unexpected side effect was code review: once we surfaced AI attribution in pull requests, reviews got faster because reviewers could focus on AI-generated code in sensitive areas.
We've open-sourced the data capture layer and code review extension so other teams can experiment with this approach. For us, the most valuable part wasn't which agent "won," but finally having a way to measure it at all.
Happy to answer questions or hear critiques.