test

How to Measure Software Teams’ AI Usage and Expertise

Written by: Mark Hull, Co-Founder and CEO, Exceeds AI | Last updated: April 22, 2026

Key Takeaways

  • AI now generates 41% of global code, yet traditional metrics blur AI and human work, hiding real ROI and risk.
  • Use a 7-step hybrid framework that combines code-level analysis, baselines, multi-tool tracking, expertise levels, and long-term outcomes.
  • Track AI-specific metrics such as AI code percentage, cycle time changes, incident rates, and tool effectiveness to show productivity gains without sacrificing quality.
  • Avoid survey-only and metadata-only tools. Use multi-signal detection for tool-agnostic visibility across Cursor, Copilot, Claude, and other assistants.
  • Exceeds AI delivers repository insights in hours with executive-ready ROI evidence and coaching recommendations. Connect your repo for a free pilot today.

Measure AI Impact on Engineering with a 7-Step Hybrid Framework

Measuring AI coding ROI requires metrics that capture AI-specific signals, not just traditional DORA indicators. The 2025 DORA Report found that high AI adoption amplifies both strengths and weaknesses, which means standard productivity metrics alone miss critical nuances of AI-assisted development. A hybrid framework combines code-level analysis with traditional outcomes so you can see how AI actually changes delivery, quality, and risk.

Step 1: Establish Pre-AI Baselines

Start by documenting your team’s performance before AI adoption using traditional metrics such as cycle time, deployment frequency, and defect rates. These indicators create your starting point. However, DORA metrics are increasingly misleading in AI-augmented contexts because they measure outcomes without decomposing AI impact. For more precise baselines, use AI-native tools that analyze historical code directly instead of relying only on metadata.

Step 2: Implement Quantitative Usage Tracking

With your baseline in place, the next step is tracking how AI usage changes day-to-day work. Deploy code-level AI observability to capture actual usage patterns. Analyze commit diffs and pull request content to identify AI-generated code, not just API telemetry. Companies like Zapier track employees’ AI token usage via dashboards and investigate cases where usage is five times higher than peers to separate efficient patterns from wasteful ones. Use similar approaches with tools that provide code-aware tracking without enterprise-level pricing.

Step 3: Map AI Usage Across Multiple Tools

Modern teams rely on several AI tools across the development lifecycle. Engineers might use Cursor for feature work, Claude Code for refactoring, GitHub Copilot for autocomplete, and other assistants for testing or documentation. Implement tool-agnostic detection that flags AI-generated code regardless of which assistant produced it. MetaCTO’s 2026 analysis reports 92% adoption for AI tools in the Development and Coding phase, with a 55% coding productivity lift, but that level of insight requires visibility across the entire AI toolchain. Prioritize solutions that support multiple tools without separate integrations for each one.

Step 4: Define AI Expertise Levels from Outcomes

Define AI expertise based on observable outcomes instead of self-reported skill. Track patterns such as code quality, review iterations, and long-term maintainability of AI-assisted contributions. The following table shows how these patterns map to three distinct expertise levels, which helps you segment your team for targeted coaching and enablement:

Exceeds AI Repo Leaderboard shows top contributing engineers with trends for AI lift and quality
Exceeds AI Repo Leaderboard shows top contributing engineers with trends for AI lift and quality
Expertise Level Usage Pattern Quality Indicators Productivity Signals
Low Sporadic AI use, high rework Above-average review iterations Minimal cycle time improvement
Medium Regular AI use, stable quality Standard review patterns 10-20% productivity gains
High Strategic AI use, quality maintained Reduced review iterations 20%+ productivity gains

Step 5: Add Qualitative Context to the Numbers

Combine quantitative data with targeted surveys and interviews to understand how developers experience AI tools. Use qualitative input to explain patterns you see in the code, but avoid relying only on self-reported data. DX’s Q4 2025 impact report found that daily AI users merge about 60% more pull requests than non-users, yet correlation alone does not prove causation without code-level analysis. Platforms that join survey data with repository signals give a more complete picture of AI’s impact.

Step 6: Track Longitudinal Outcomes for AI-Touched Code

Follow AI-touched code for at least 30 days to see how it behaves in production. Monitor incidents, rework, and technical debt linked to these changes. Cortex 2026 data reveals that incidents per pull request increased by 23.5%, which highlights the need to track long-term outcomes rather than stopping at initial review approval. This step connects AI usage to reliability and maintainability.

Step 7: Turn AI Data into Coaching and Process Changes

The longitudinal data from Step 6 reveals patterns that require action. Transform these findings into specific coaching recommendations and process improvements. Identify which teams use AI effectively and which ones struggle, then share practices from high performers and support teams that show higher rework or incident rates. This final step closes the loop by turning measurement into behavior change.

Actionable insights to improve AI impact in a team.
Actionable insights to improve AI impact in a team.

AI Developer Productivity Metrics to Prioritize

Focus on metrics that capture AI’s unique impact on the development process. Jellyfish’s 2025 data shows that increased AI adoption can reduce median cycle time, but you must measure this alongside quality indicators to avoid chasing speed at the cost of maintainability. The following four metrics provide a balanced view of AI impact by combining usage signals with quality proxies:

Exceeds AI Impact Report shows AI code contributions, productivity lift, and AI code quality
Exceeds AI Impact Report shows AI code contributions, productivity lift, and AI code quality
Metric How to Measure AI Usage Signal Quality Proxy
AI Code Percentage Commit and PR diff analysis Direct usage indicator Review iteration count
Cycle Time Delta AI versus human task comparison Productivity impact Rework rate tracking
Incident Rate (30-day) Production issue attribution Long-term quality Technical debt accumulation
Tool Effectiveness Cross-tool outcome comparison Adoption optimization Context-specific performance

Prove GitHub Copilot Impact with Outcomes

GitHub Copilot’s built-in analytics show usage statistics but do not connect that usage to business results. To demonstrate real impact, track how Copilot-touched code performs compared to human-only contributions. CodeRabbit’s December 2025 report found that AI-coauthored pull requests have about 1.7 times more issues than human-only pull requests, which underscores the need for outcome-based measurement beyond adoption metrics. Consider platforms that extend this analysis across all AI tools, not just Copilot.

AI Code Quality Analytics for Risk Management

Set up systematic quality tracking for AI-generated code so you can manage risk proactively. Exceeds AI’s 2026 benchmarks show that AI-generated code can produce higher defect rates than human-written code. Continuous monitoring helps you catch these patterns early and adjust guidelines, review practices, or training before issues reach customers.

Engineering AI Adoption Metrics for Coaching

Track AI adoption patterns across teams and individuals to uncover coaching opportunities. Greptile’s State of AI Coding 2025 report highlights the value of injecting organizational context into AI coding agents. Adoption metrics paired with context-aware performance data show where better prompts, training, or guardrails can unlock more value.

Gaps and Pitfalls in Traditional AI Measurement

Where Current Industry Approaches Fall Short

Most AI measurement approaches focus on single-tool adoption or rely on developer surveys that provide subjective data instead of objective proof. METR’s original July 2025 randomized controlled trial showed AI use increased task completion time by 19%, despite developers’ pre-task estimate of 24% savings. This gap between perception and reality illustrates why survey data alone cannot guide AI strategy.

Traditional developer analytics platforms such as Jellyfish, LinearB, and Swarmia track metadata but cannot distinguish AI-generated code from human-written code at the commit level. That limitation makes it impossible to prove causation between AI adoption and productivity outcomes. AI-native code analysis platforms address this by examining the code itself rather than only pipeline events.

Common Pitfalls: AI Bloat and Complexity

Teams often fall into the trap of measuring activity instead of outcomes. GitClear’s 2025 research found that code cloning increased fourfold during 2024 as AI code generation rose, which shows how volume metrics can mislead leaders about real progress.

Reduce false positives by using multi-signal detection that combines code pattern analysis, commit message parsing, and optional telemetry integration. This approach accurately identifies AI contributions across tools and contexts. Cheaper AI-focused solutions now make this level of detection accessible without custom internal development.

The Best Way: Exceeds AI for Proven AI Coding ROI

Implementing this multi-signal, code-level approach manually requires significant engineering effort. Exceeds AI provides tool-agnostic repository visibility that connects AI adoption directly to business outcomes. Unlike metadata-only platforms, Exceeds analyzes code diffs at the commit and pull request level to distinguish AI-generated contributions from human work across Cursor, Claude Code, GitHub Copilot, Windsurf, and other tools.

Exceeds AI Impact Report with Exceeds Assistant providing custom insights
Exceeds AI Impact Report with PR and commit-level insights

The platform delivers actionable insights rather than static dashboards. Engineering leaders receive executive-ready ROI evidence, while managers get specific coaching recommendations to scale effective AI adoption patterns across teams. Mark Hull, founder of Exceeds AI, used Anthropic’s Claude Code to develop three workflow tools totaling around 300,000 lines of code at a token cost of about $2,000, which demonstrates how the platform supports real-world, multi-tool AI coding analytics.

Simple GitHub authorization delivers insights within 60 minutes, with complete historical analysis finished within 4 hours. This rapid setup contrasts sharply with traditional platforms that often require many months before they can show ROI.

Connect my repo and start my free pilot to see your AI impact map in hours.

Case Study: Results from a 300-Engineer Organization

A mid-market software company with 300 engineers across several product teams implemented Exceeds AI to prove ROI on its AI tool investments. Within the first hour, the company discovered that GitHub Copilot contributed to 58% of all commits and that overall team productivity lifted by 18% in correlation with AI usage.

View comprehensive engineering metrics and analytics over time
View comprehensive engineering metrics and analytics over time

However, deeper analysis revealed increasing rework rates that reduced contribution stability. To understand the root cause, leadership used Exceeds Assistant to examine commit patterns. The analysis showed that a high percentage of commits were AI-driven and spiky, which indicated context switches that disrupted coding flow. With this diagnosis, leaders delivered targeted coaching to struggling teams and scaled practices from high-performing groups.

The company now has concrete proof of AI ROI with specific metrics and can make data-driven decisions about AI tool strategy and team-level coaching. This evidence allows them to justify continued AI investment with confidence.

Conclusion

Measuring real AI usage and expertise in software teams requires code-level visibility and AI-specific signals. The 7-step hybrid framework combines baselines, multi-tool tracking, expertise segmentation, and longitudinal outcomes so engineering leaders can prove ROI, surface risks, and scale effective adoption patterns.

Success depends on joining quantitative code analysis with qualitative insights while avoiding activity-only metrics inflated by AI-generated volume. Tool-agnostic detection, long-term outcome tracking, and clear coaching recommendations turn raw AI data into better engineering performance.

Connect my repo and start your free pilot to get code-level visibility into AI adoption patterns and present confident ROI evidence to your leadership.

Frequently Asked Questions

How is measuring AI usage different from traditional developer productivity metrics?

Traditional developer productivity metrics such as DORA indicators track outcomes for the entire development process without separating AI contributions from human work. Measuring AI usage requires code-level analysis that identifies which specific lines, commits, and pull requests involve AI assistance. This distinction matters because AI can inflate activity metrics while introducing quality risks that surface later. Effective AI measurement combines traditional productivity indicators with AI-specific signals such as tool usage patterns, quality comparisons between AI and human contributions, and longitudinal tracking of AI-touched code in production.

What is the most reliable way to identify AI-generated code across multiple tools?

The most reliable approach uses multi-signal detection that blends code pattern analysis, commit message parsing, and optional telemetry integration. AI-generated code often shows distinctive formatting, variable naming, and comment styles that algorithms can detect. Many developers also tag AI usage in commit messages with terms like “cursor,” “copilot,” or “ai-generated.” When available, official tool telemetry adds another layer of validation. This combined method reduces false positives and works across tools such as Cursor, Claude Code, GitHub Copilot, and others, giving you tool-agnostic visibility into your AI stack.

How do you measure AI expertise levels without relying on self-reported data?

Measure AI expertise through observable outcomes instead of self-reported proficiency. High-expertise users show consistent patterns: their AI-assisted code needs fewer review iterations, maintains stable quality metrics, and delivers measurable productivity gains without higher rework rates. Medium-expertise users show regular AI adoption with standard quality patterns. Low-expertise users display sporadic usage, higher rework, and above-average review cycles. Track these signals over time to see who needs coaching and who should share best practices across the organization.

What are the biggest risks of AI-generated code that traditional metrics miss?

The biggest risk comes from AI-generated code that passes initial review but creates problems 30 to 90 days later in production. These issues include subtle bugs, architectural misalignments, and maintainability problems that appear only under specific conditions or as the codebase evolves. Traditional metrics miss this because they focus on immediate outcomes such as merge success and initial cycle times. Longitudinal tracking reveals patterns where AI-touched code has higher incident rates, requires more follow-on edits, or accumulates technical debt. These hidden risks can significantly affect long-term stability and maintenance costs.

How quickly can engineering teams start measuring AI impact effectively?

With the right platform and approach, teams can see meaningful AI impact data within hours. Tools that provide repository-level access and analyze historical commit data can start producing insights as soon as they connect to your source control. Simple GitHub authorization often delivers first findings within 60 minutes, with complete historical analysis finished within 4 hours. This rapid time-to-value contrasts with traditional developer analytics platforms that may require weeks or months of new data before they become useful. The speed advantage comes from analyzing existing code history instead of waiting for future activity.

Discover more from Exceeds AI Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading