How to Track AI Technical Debt: A 30-Day Workflow

How to Track AI Technical Debt: A 30-Day Workflow

Written by: Mark Hull, Co-Founder and CEO, Exceeds AI

Key Takeaways for Your 30-Day AI Debt Plan

  • AI-generated code now accounts for 42% of committed code. Metadata-only dashboards cannot separate AI from human work, so you need line-level provenance tracking.
  • AI technical debt shows up as extra rework, incidents, and architecture drift 30–90 days after code passes review. Early detection lets managers correct course before quality erodes.
  • Five core metrics – AI Code Churn Rate, SATD Detection Rate, Architecture Drift Score, 30-Day AI-Touched Incident Rate, and Rework Velocity – give a practical 30-day view of AI technical debt.
  • Client-level capture tools like Exceeds Ink outperform heuristic detection by tying specific lines to their originating tool, model, session, and interaction mode.
  • Engineering teams can start tracking AI technical debt with Exceeds AI, establish baselines, and produce executive-ready reports within weeks.

Before You Begin: Access, Baselines, and Alignment

Three prerequisites determine whether a 30-day tracking workflow produces actionable data or noise.

Repo access. Code-level attribution requires read access to your repositories. Without it, every attribution claim is a heuristic guess. Exceeds AI uses scoped read-only access. Code exists on servers for seconds and is never stored permanently.

Baseline data. Collect 30 days of incident history, churn history, and PR rework rates before Day 1. This pre-measurement baseline is what you will compare your 30-day outcomes against to see whether AI technical debt is rising or falling. If your current tooling cannot produce per-line churn broken out by AI versus human authorship, that gap is the primary limitation this workflow addresses, because aggregate metrics hide the AI-specific signal.

Stakeholder alignment. Agree with engineering leadership on three success metrics before instrumentation begins. Common choices include 30-day AI code churn rate, AI-touched incident rate, and rework velocity. Locking these in advance prevents post-hoc metric selection that undermines credibility with the board.

With access, baselines, and alignment in place, you can define AI technical debt in terms that match your organization’s goals.

Step 1: Define AI Technical Debt in 30–90 Day Outcomes

AI adoption contributes to technical debt through three mechanisms: context debt (AI-generated code that lacks full system nuance), consistency debt (divergence in patterns and naming conventions across AI-generated modules), and verification debt (plausible-looking code that passes inattentive review without full evaluation of edge cases or security implications). These three mechanisms – context, consistency, and verification debt – show up as measurable costs over time.

A manager-centric definition anchored to 30-day outcomes captures those costs. AI technical debt is the additional rework, incident response, and architectural remediation cost incurred when AI-generated code that passed initial review degrades quality metrics within a 30-to-90-day window. It is distinct from traditional technical debt because its source is attributable to a specific tool, model, and interaction mode. That provenance makes the debt measurable and addressable at that level.

AI-generated code can increase maintenance costs compared with traditional development, and fixing bugs in AI-generated code can cost more than fixing bugs in human-written code due to the context gap requiring reverse-engineering of AI intent. Defining debt in dollar terms at the outset makes board reporting straightforward.

Step 2: Select and Calculate Five Core AI Debt Metrics

The following five metrics, each anchored to a provenance source, form the measurement core of the 30-day workflow. Together, they show both near-term code quality and downstream impact by tracking churn, SATD, incidents, rework, and architecture drift.

Actionable insights to improve AI impact in a team.
Actionable insights to improve AI impact in a team.

1. AI Code Churn Rate. Formula: (AI-touched lines rewritten or deleted within 30 days) ÷ (total AI-touched lines committed) × 100. Lower 30-day AI code churn rates are generally healthier. Significantly elevated rates can be critical compared to human baselines. Provenance source: Exceeds Ink Git Notes at refs/notes/exceeds-ink, which records the tool, model, and session for every attributed line. This line-level provenance lets you segment churn by AI tool instead of blending everything into a single aggregate.

2. Self-Admitted Technical Debt (SATD) Detection Rate. Formula: (AI-attributed commits containing TODO, FIXME, HACK, or equivalent markers) ÷ (total AI-attributed commits) × 100. GitClear’s analysis of 211 million lines of code found that duplicate code-block frequency rose approximately 8x year-over-year in 2024, a separate finding from the 42% AI code share and one that correlates with elevated SATD. Healthy threshold: below 8% of AI commits. Critical: above 20%. Provenance source: Ink commit-level attestation cross-referenced against diff content, which ties each SATD marker to its AI or human origin.

3. Architecture Drift Score. This metric captures how often AI-generated commits introduce structural divergence. Measure it as the rate of new modules or files introduced by AI-generated commits that deviate from established naming, layering, or dependency conventions. A practical proxy is: (AI-attributed files violating architectural lint rules) ÷ (total AI-attributed files introduced in the window) × 100. AI lacks inherent mechanisms for enforcing system-wide architectural discipline, which leads to divergence in code structure, naming conventions, and architectural patterns over time. Healthy threshold: below 10%. Critical: above 30%. Provenance source: Ink tool and interaction-mode classification, which highlights agent-mode sessions most likely to introduce structural drift.

4. 30-Day AI-Touched Incident Rate. Formula: (production incidents traceable to AI-attributed lines within 30 days of merge) ÷ (total AI-attributed PRs merged in the window) × 100. A January 2026 SmartBear survey of 273 software testing and quality decision-makers found that 70% said application quality had already degraded as AI accelerated development, with 60% reporting quality issues in the past year specifically because development outpaced testing. Healthy threshold: AI-touched incident rate within 1.2x of the human-only rate. Critical: above 2x. Provenance source: Ink per-commit attestation joined to incident tracking data such as JIRA or Linear, which connects specific incidents to AI-touched lines.

5. Rework Velocity. Formula: (follow-on commits modifying AI-attributed lines within 30 days) ÷ (total AI-attributed commits) × 100. The 30-day code turnover rate for AI-generated code is typically higher than for human-written code, with AI-to-human turnover ratios of 1.8–2.5x signaling potential quality erosion. Healthy threshold: AI rework velocity below 1.5x human baseline. Critical: above 2.5x. Provenance source: Ink longitudinal outcome tracking, which follows AI-touched code over 30-plus days using per-commit attestation.

Exceeds AI Impact Report shows AI code contributions, productivity lift, and AI code quality
Exceeds AI Impact Report shows AI code contributions, productivity lift, and AI code quality

Pro Tip: Heuristic and watermark-based AI detection tops out around 20–25% accuracy by Exceeds AI’s own assessment. Every metric above is only as reliable as its attribution source. Line-level Git Notes attestation from Exceeds Ink is the only non-heuristic method that connects a specific line to its tool, model, session, and interaction mode at commit finalization.

Step 3: Choose an Attribution Architecture That Matches Your Stack

Two architectural approaches exist for AI code attribution: heuristic detection and client-level capture.

Heuristic detection infers AI authorship from code patterns, commit message keywords, and timing signals. It requires no on-machine installation and produces results immediately. The accuracy ceiling is low, because pattern-matching cannot distinguish a developer who typed quickly from one who accepted a Cursor suggestion, and it is blind to interaction mode entirely.

The alternative architecture solves these limitations. Client-level capture observes what the AI tool actually does on the developer’s machine at the moment work is produced. Exceeds Ink uses a hook-direct model with per-tool checkpoint materializers for Claude Code, Cursor, and Codex. It fires from standard Git hooks (prepare-commit-msg, post-commit, post-rewrite), writes a structured attestation as a Git Note at refs/notes/exceeds-ink, and exits. No long-lived daemon runs on developer machines. No PATH-shimmed git binary is installed. No global git config is mutated. The attestation is portable, machine-readable JSON that lives in your own repository and travels across forks and mirrors.

Watch-out: Tools that run always-on daemons or replace the git binary create operational drag and CISO friction that slows deployment. Exceeds Ink’s short-lived hook processes avoid both problems. The Git Note is written before the commit is reachable for push, which removes the race window between commit and attribution that async daemon-based approaches introduce.

For multi-tool environments, where teams use Claude Code for large refactors, Cursor for feature work, Codex for batch transforms, GitHub Copilot for autocomplete, and Windsurf for specialized workflows, client-level capture with per-tool adapters is the only architecture that produces tool-by-tool outcome comparison instead of a single undifferentiated AI aggregate.

Step 4: Run a 30-Day Longitudinal Tracking Workflow

This 30-day sequence instruments your repos, validates coverage, and ends with a concrete report that compares AI-enabled work to your baseline.

Days 1–3: Instrument and baseline. Authorize repo access (GitHub, GitLab, or Azure DevOps). Install Exceeds Ink on developer machines via per-repo opt-in. Wire up adapters for each AI tool in use. Exceeds AI delivers first insights within 60 minutes and completes historical analysis within 4 hours. Record baseline values for all five metrics from the prior 30-day window.

Days 4–7: Validate attribution coverage. Review the Machine Integration Health dashboard to confirm hooks are installed, adapters are wired, and deliveries are succeeding across the fleet. Lines that cannot be confidently attributed are recorded as unknown_lines, not silently rolled into human or AI buckets. A high unknown rate at this stage indicates incomplete adapter coverage, not a measurement failure.

Week 2: First churn signal. AI Code Churn Rate and Rework Velocity become meaningful once the first cohort of AI-attributed commits has aged 7–10 days. Flag any PR where AI-touched lines are already being rewritten and review the interaction-mode classification. Agent-mode sessions without a preceding plan phase are the most common source of early churn.

Week 3: Incident correlation. Join Ink attestation data to your incident tracker. Identify whether any open incidents trace to AI-attributed lines from Weeks 1–2. A single data point does not define a trend, but it establishes the join path needed for the 30-day report.

Week 4: 30-day report assembly. Pull all five metrics against their baselines. The Exceeds AI platform produces executive-ready ROI reports that connect AI adoption to productivity and quality outcomes across every tool. Distribute the ink-prompting-coach skill to teams where agent-mode churn exceeds the critical threshold. It installs directly into Claude Code or Cursor as a SKILL.md and slash command, so coaching appears where the work happens.

Exceeds AI Impact Report with Exceeds Assistant providing custom insights
Exceeds AI Impact Report with PR and commit-level insights

Troubleshooting: If AI Code Churn Rate is elevated but the 30-day incident rate is within range, the likely cause is over-reliance on agent mode for tasks better suited to edit or ask mode. Ink’s interaction-mode classification (plan, ask, agent, edit, headless) surfaces this pattern directly. If both metrics are elevated, prioritize architectural lint rule enforcement before the next sprint cycle.

Step 5: Avoid Common AI Technical Debt Traps

False-positive attribution. Multi-edit Cursor sessions can attribute human-typed lines to Cursor if checkpoint materializers are not resolving against the actual working tree at commit finalization. Exceeds Ink’s per-tool materializers handle this by design. Human-typed lines within a Cursor session are retained as human, not overwritten by the AI attribution. Verify this behavior by inspecting the unknown_lines count in the first week’s attestations.

Multi-tool blind spots. Stack Overflow’s 2025 Developer Survey found that 66% of developers cited AI solutions that are almost right but not quite as their biggest frustration, and 45% said debugging AI-generated code is more time-consuming than writing it themselves. Teams that use multiple tools without per-tool attribution cannot identify which tool is the source of elevated churn. Aggregate AI churn rates mask tool-specific quality differences that require tool-specific interventions.

Metric gaming. Once churn rate becomes a tracked metric, developers may delay follow-on edits past the 30-day window rather than fix issues promptly. Pairing the 30-day window with a 90-day window exposes this behavior. If developers game the metric by delaying fixes, the 90-day churn rate will climb even as the 30-day rate stays flat. A gap between 30-day and 90-day rates that widens over successive measurement cycles signals delayed rework, not improved quality.

Mistake to avoid: Treating AI Code Churn Rate as the sole quality signal. A 2025 METR randomized controlled trial found that developers using AI tools believed they were working 20% faster while objective measurements showed a 19% slowdown. Self-reported productivity and churn rate can both look acceptable while incident rate and architecture drift accumulate. All five metrics must be tracked together.

Validate Success: What You Can See After 30 Days

After 30 days of longitudinal tracking with line-level provenance, the following outcomes are measurable:

  • AI Code Churn Rate segmented by tool (Claude Code, Cursor, Codex, Copilot, Windsurf) with trend direction established
  • 30-day AI-touched incident rate compared to human-only baseline, with specific PRs identified as outliers
  • Interaction-mode distribution across the team, with agent-mode-without-plan sessions flagged for coaching
  • SATD detection rate by tool and team, enabling targeted architectural review prioritization
  • An executive-ready report that connects AI adoption to productivity and quality outcomes with commit-level citations

Organizations with structured measurement programs capture more value from AI tools than those without. The 30-day workflow above provides that structure.

Connect my repo and start my free pilot and have your first executive-ready AI technical debt report within weeks.

Advanced Considerations: Scaling and Governance

Policy enforcement. Because Exceeds Ink’s attestation is structured JSON in the repository, it is a natural input to policy engines. Engineering organizations can express rules such as blocking deploys when AI authorship exceeds a defined threshold in sensitive paths, or requiring additional review on commits where agent-mode produced more than a specified percentage of the diff. These policies are expressible because the attestation is auditable and machine-readable, not a proprietary cloud-only record.

Skill transfer. When one team’s AI-coding patterns produce churn rates below the healthy threshold, those patterns are distributable. Exceeds AI’s Best Practices Insights pipeline distills the top three skills worth scaling across the organization, sorted by confidence. The ink-prompting-coach skill delivers those patterns into the developer’s own Claude Code or Cursor agent as a versioned SKILL.md, with rollback available if adoption does not land as expected.

Board reporting. McKinsey Digital’s synthesis of survey data from 50 CIOs estimated that technical debt amounts to 20–40% of the value of entire technology estates before depreciation, and companies in the 80th percentile for Tech Debt Score had 20% higher revenue growth than those in the bottom 20th percentile. Framing AI technical debt in revenue-impact terms, using the five metrics above as inputs to a maintenance cost model, converts a technical measurement exercise into a board-level risk and investment narrative.

Exceeds AI Repo Leaderboard shows top contributing engineers with trends for AI lift and quality
Exceeds AI Repo Leaderboard shows top contributing engineers with trends for AI lift and quality

Governance for regulated environments. Exceeds Ink’s HMAC-SHA256-signed remote ingest, LLM-based prompt redaction, aggregate-only mode, and self-host option address the four most common enterprise security objections. Privacy is configurable along four rungs – local only, aggregate only, abstracted replay, and full identified replay – with different teams in the same organization able to run at different rungs. SOC 2 Type II compliance is in progress.

Frequently Asked Questions

What is AI technical debt and how is it different from traditional technical debt?

Traditional technical debt accumulates from deliberate shortcuts, deferred refactoring, and aging dependencies. AI technical debt is the measurable cost of rework and incidents that emerge 30–90 days after AI-generated code passes review. Unlike traditional technical debt, it is attributable to specific tools and interaction modes, which makes it both measurable and correctable. See Step 1 for the full definition and how it differs from traditional debt.

Why cannot existing developer analytics tools like Jellyfish, LinearB, or Swarmia track AI technical debt?

These platforms were built for the pre-AI era and operate on metadata such as PR cycle times, commit volumes, review latency, and deployment frequency. They cannot distinguish which specific lines are AI-generated versus human-authored, which means they cannot attribute outcomes such as incident rates, rework, and architecture drift to AI usage. A tool that sees only that PR #1523 merged in four hours with 847 lines changed cannot tell you that 623 of those lines were Cursor-generated, that those lines required additional review iterations, or that the AI-touched module had a 2x higher incident rate 30 days later. That connection requires repo access and line-level provenance. Without it, AI technical debt remains invisible in the dashboard even as it accumulates in production.

How does multi-tool AI usage complicate technical debt tracking, and what does a solution need to handle it?

Engineering teams in 2026 routinely use Claude Code for large-scale refactoring, Cursor for feature development, Codex for batch transforms, GitHub Copilot for autocomplete, and Windsurf for specialized workflows, often within the same sprint. Aggregate AI churn rates that do not segment by tool mask the fact that one tool may be producing healthy churn rates while another is producing critical ones. A solution must have per-tool adapters that capture attribution at the moment of commit, not after the fact via heuristics. It must also handle interaction-mode differences, because an agent-mode Claude Code session that rewrites an entire module carries different debt risk than a Copilot autocomplete acceptance. Exceeds Ink’s per-tool checkpoint materializers for Claude Code, Cursor, and Codex, combined with interaction-mode classification, provide the architecture that makes tool-by-tool debt segmentation possible.

What does a healthy AI technical debt baseline look like after 30 days of tracking?

After 30 days of line-level longitudinal tracking, a healthy baseline shows AI Code Churn Rate below 12% at the 30-day window, AI-touched incident rate within 1.2x of the human-only baseline, rework velocity below 1.5x the human baseline, SATD detection rate below 8% of AI-attributed commits, and architecture drift score below 10% of AI-attributed files introduced. Teams that exceed the critical threshold on two or more metrics simultaneously are accumulating AI technical debt faster than they are shipping durable value. The corrective action is interaction-mode coaching, specifically reducing agent-mode sessions that lack a preceding plan phase, rather than reducing AI adoption overall. The goal is higher-quality AI usage, not less of it.

Conclusion: Turn AI Technical Debt Tracking into Advantage

AI technical debt is not a reason to slow AI adoption. It is a measurement problem. Teams that instrument line-level provenance across their full AI toolchain, track the five core metrics over a 30-day longitudinal window, and distribute coaching based on interaction-mode data compound the productivity gains from AI while preventing the quality erosion that undermines them.

The workflow above is repeatable. The metrics are calculable. The attribution is auditable. Higher AI enablement can correlate with better code maintainability, change confidence, and reduced time loss. Those outcomes require measurement infrastructure to achieve and sustain.

Exceeds AI, powered by Exceeds Ink, is the only platform that connects every AI-touched line to its tool, model, session, and long-term outcomes across Claude Code, Cursor, Codex, GitHub Copilot, and Windsurf, while using lightweight Git hooks instead of always-on processes or binary replacements, and without locking your provenance data inside a proprietary cloud. Setup takes hours. First insights arrive in minutes. Executive-ready reports follow within weeks.

Start your free pilot and gain the competitive advantage before the next sprint cycle ends.

Discover more from Exceeds AI Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading