Can AI triage handle project-specific jargon accurately?

AI triage models require fine-tuning on project-specific data to capture local severity conventions, component naming, and ownership structures. Studies report that domain-adapted variants like seBERT outperform general-purpose BERT on software engineering tasks, so a single pretrained model rarely serves every project in a heterogeneous organization without per-project adaptation.

What accuracy should teams expect from automated bug assignment?

Accuracy varies with the number of potential assignees. The Apache study shows a mean of 62% (46% for large assignee pools, up to 80% for smaller ones), while Microsoft's TRIANGLE reaches 90% through team-specialized agents. Ranked top-5 candidate lists consistently outperform single-assignment accuracy, which is why COMET reports substantial ACC@5 gains over prior methods.

How does AI triage handle class imbalance in bug databases?

Real bug trackers contain far more low-severity reports than critical defects, biasing classifiers toward majority classes. Reviews commonly recommend SMOTE or other oversampling, class-weighted loss functions, and ensemble methods; without them, AI triage underperforms on the high-severity minority cases that matter most.

Does Cosmos replace Jira or Linear for bug tracking?

No. Cosmos operates on top of existing issue trackers, so teams keep their current ticketing system. MCP connects it to established workflows, which cuts rewiring work rather than forcing a new tool.

What is the minimum data needed to start AI-assisted triage?

GitHub's native AI triage analyzes new or existing issue content to provide suggestions. Teams with sparse or inconsistently labeled issue history face a data quality problem before a model quality problem. Audit label consistency and define a clear taxonomy (bug, feature, question, severity tiers, component labels) before training any model.

AI-Assisted Bug Triage: Sort Defects in Minutes

AI-assisted bug triage classifies, prioritizes, deduplicates, and routes incoming defect reports faster than manual triage. Machine learning models apply those steps to ticket text, metadata, and historical issue data.

TL;DR

Manual bug triage consumes engineering hours, misroutes tickets, and scales poorly. AI classifiers trained on historical bug data can reduce that intake burden, with independent studies reporting F1 scores near 78% on bug classification and higher accuracy on narrower tasks like severity prediction.

How Modern Teams Speed Up Bug Triage

A senior engineer opens the bug tracker Monday morning to find 47 new reports, including twelve duplicates, eight feature requests miscategorized as bugs, and three Critical defects hidden under Minor labels. Two hours later, the engineer has finished triage, but the best coding hours of the day are gone.

Microsoft Research has studied high-severity production incidents, including the triage and resolution stages. Triage failure creates the same operational problem across engineering organizations: triage mistakes do not stay at intake. They cascade into slower mitigation, reassignment cycles, and avoidable engineering overhead.

Tooling addresses the same intake problem. Augment Cosmos is a unified cloud agents platform that runs agents across the software development lifecycle, with shared context and memory that compound across a team. Applied to bug triage, that means coordinating intake through specialized Experts, event-driven triggers, and human review on a single platform rather than a patchwork of scripts. Cosmos runs on Augment Code's Context Engine, which processes entire codebases across 400,000+ files through semantic dependency graph analysis, so triage agents reason over real codebase relationships rather than isolated ticket text.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Why Manual Bug Triage Breaks Down at Scale

Manual bug triage breaks down at scale because misrouting, duplicates, inconsistent issue data, and staffing overhead compound as teams and codebases grow.

Misrouting and Reassignment Cycles

Bug misrouting creates reassignment cycles because each newly assigned team repeats the same investigative steps. The Vista study describes bugs that bounce between teams through repeated reassignments. Teams sometimes send bugs back to the opener or opener's team for more information. Every reassignment wastes the full investigation time of the previous assignee.

Duplicate Report Processing

Duplicate report processing increases manual triage load because multiple reports can describe the same defect with different vocabulary. Engineers then inspect redundant issues before they can focus on unique defects. A 2025 review describes issue intake and preprocessing as including data deduplication aimed at preventing redundant investigations. The University of Alberta describes issue trackers as "frequently full of duplicate issues and bugs." One way to shrink the duplicate load is to catch defects earlier, before they become separate tickets, using context-aware detection tools that flag related issues at the source.

Inconsistent Incoming Data Quality

Inconsistent incoming bug data slows triage because minimal or uneven issue templates leave classifiers and human reviewers with incomplete signals. A 2025 study identifies a structural upstream problem: issue templates and required metadata are minimal or inconsistently used. Consequences include duplicate reports, slower resolution, and higher manual effort. The problem scales directly with contributor count.

Dedicated Staffing Requirements

Dedicated staffing becomes necessary when manual triage volume exceeds what ad hoc engineer attention can absorb. The Atlassian handbook describes bug triage as a process for identifying, categorizing, prioritizing, assigning, and resolving bugs. Atlassian assigns engineers to bug fixing on a rotating basis, and SLOs govern bug fix work.

Failure Mode	Source	Scale Effect
Reassignment cycles	Microsoft Research (Vista study)	Bugs bounce between teams through repeated reassignments
Inaccurate triage leads to repeated escalations and delays	Microsoft Research, ISSRE 2024	ISSRE 2024 attributes the delays to triage failures
Duplicate redundancy	Empirical literature	Grows with user base
Rotating on-call coverage	Atlassian	Atlassian rotates engineers through bug-fixing duty as standard operational practice

These patterns explain why teams look for automation at intake rather than waiting until defects reach the wrong queue.

The Four AI Techniques That Replace Manual Triage Steps

AI-assisted bug triage replaces manual intake work with four techniques. Classification sorts incoming reports, severity prediction assigns priority, duplicate detection removes redundant tickets, and automated assignment routes each report to an owner. Each one improves routing accuracy and triage speed when teams train and deploy it against the right workflow. An ACM review screened 1,825 papers on learning from software bug reports. The review extracted 204 papers as most relevant, which points to a large research area.

Technique 1: NLP Classification for Bug Categorization

NLP classification automates bug categorization by converting report text into model-readable features. Classifiers determine whether an incoming report is a valid bug, feature request, or user error, then identify the component or subsystem involved. They use TF-IDF sparse vectors or transformer-based contextual embeddings such as BERT, seBERT, and CodeBERT to feed models like SVM, Decision Trees, or Random Forests.

A 2025 classification study found TF-IDF combined with Decision Tree on bug titles achieved an F1 score of 78%. The seBERT model, pretrained on 119.7 GB of software engineering text, scored 77% with SVM. Empirical studies report that using bug report titles or full descriptions yields no significant difference in classification performance.

Technique 2: Severity and Priority Prediction

Severity prediction automates priority assignment by learning a project's severity patterns from historical bugs, then classifying incoming reports as Critical, Major, or Minor. Triagers no longer assess every report from scratch. Mashhadi et al. (2023) showed fine-tuned CodeBERT improves severity prediction results by 29-140% across evaluation metrics. The comparison used classic ML models as the baseline. An industrial study on severity classification found SVM at 91.45% accuracy, with ensemble methods improving further.

Severity prediction models need project-specific fine-tuning because severity conventions and jargon do not transfer cleanly between projects.

Technique 3: Duplicate Bug Report Detection

Duplicate bug detection reduces redundant investigation by matching semantically similar reports before separate teams analyze the same defect independently. Duplicate bug detection techniques fall into three generations.

IR-based methods such as TF-IDF plus cosine similarity are fast, but they fail on paraphrasing.
Siamese networks train paired neural networks to recognize semantic similarity regardless of vocabulary.
Transformer-based approaches include SBERT and SiameseQAT.

These methods reflect a shift from surface text matching toward semantic matching.

The GitBugs benchmark achieved Recall@10 of 0.61, meaning 61% of known duplicates appeared in the top-10 candidates. Many duplicates are "linguistically subtle" and describe the same defect with different vocabulary, which is why surface similarity approaches miss them. Some teams cut duplicate volume upstream by screening changes during review, a workflow several open-source code review tools now automate.

Technique 4: Automated Bug Assignment

Automated bug assignment speeds routing by recommending the developer or team most likely to resolve a report, which reduces manual ownership lookup and reassignment. It commonly uses two paradigms.

Learning-based classification treats each developer or team as a class label trained on historical assignment patterns.
IR-based expertise profiling builds each developer's profile from past resolved bugs and matches that profile against new reports.

Both approaches depend on historical assignment data, so teams need confidence thresholds when the candidate pool grows.

Because Cosmos runs on the Context Engine, its triage agents can weigh code ownership and dependency signals across hundreds of thousands of files when suggesting an assignee, going beyond the words in a single ticket.

Technique	Triage Step	Best Documented Accuracy	Key Constraint
NLP classification	Bug vs. feature filtering	F1: 78% (2025 study)	Using bug report titles alone was sufficient in one study, with title-based TF-IDF + Decision Tree reaching an F1 of 78%
Severity prediction	Priority assignment	91.45% accuracy in one bug-severity classification study using SVM	Severity conventions vary by project, so accuracy may not transfer across organizations
Duplicate detection	Deduplication	SBERT is used as a strong retrieval baseline in duplicate bug detection, but subtle duplicates remain hard	Surface similarity misses paraphrased duplicates
Auto-routing	Developer/team assignment	roughly 46-80% reported in prior studies, with some studies reaching 50-90% using large training sets	Accuracy may depend on factors such as training data and project context

Together, these four techniques cover the manual decisions that slow intake. They classify each report, gauge its urgency, flag duplicates, and route it to an owner.

Production Deployments With Measured Outcomes

Production AI bug triage deployments use models to suggest owners, shorten triage, and route incidents faster. Deployment examples document measured outcomes across assignment accuracy, triage time, mitigation speed, and investigation workflow.

Microsoft COMET and TRIANGLE

Microsoft's COMET and TRIANGLE systems show how production triage improves assignment accuracy and mitigation speed through ranked resolver suggestions and team-specialized agents. IEEE ISSRE 2024 published Microsoft's COMET system, one of the more thoroughly quantified production triage deployments in the literature. COMET achieved ACC@1 (top-1 resolver accuracy) of 0.61 and ACC@5 of 0.88, with a 35% reduction in Time to Mitigation (TTM). Because the correct resolver appears in the top-5 candidates 88% of the time, engineers can select from ranked suggestions instead of relying on perfect single-assignment accuracy.

Microsoft's TRIANGLE system is a multi-LLM-agent architecture for incident triage in production Azure incident-management workflows. Azure reported that TRIANGLE reached 90% accuracy in incident assignment and a 38% TTM (Time to Mitigate) reduction for one team, produced by team-specialized Local Triage agents; as of January 2025, 6 teams were in production and 15+ were onboarding.

Meta's Investigation Compression

Meta automates investigation so engineers spend less time reconstructing what went wrong. Its HawkEye system applies AI to debugging, using structured decision trees that let non-experts triage complex issues with little help from senior engineers. The same approach reaches beyond triage. Meta's newer capacity-efficiency agents compress roughly 10 hours of manual investigation into about 30 minutes for performance regressions, generating structured diagnoses engineers can review quickly. That compression of expert effort also appears in AI code review tools, which summarize large changes so reviewers spend less time reconstructing context.

Organization	System	Outcome	Source
Microsoft	COMET	35% Time to Mitigation (TTM) reduction	IEEE ISSRE 2024
Microsoft/Azure	TRIANGLE	90% accuracy, 38% TTM reduction	Azure AIOps blog
Meta	HawkEye	Non-experts triage complex issues with minimal coordination and assistance	Meta Engineering

Microsoft and Meta built these triage systems in-house. Cosmos brings the same patterns, ranked resolver suggestions and team-specialized triage Experts, to your existing trackers without the multi-year build. Explore Cosmos →

Implementing AI Triage: A Phased Rollout

AI-assisted triage implementation succeeds when teams roll out automation in phases, using human review, confidence thresholds, and feedback loops before full automation. Official documentation from GitHub and Atlassian discusses automation setup and incremental rule-building.

A phased rollout works best when teams progress through this sequence:

Deploy the AI classifier in shadow mode and measure acceptance and override rates.
Pre-fill severity, component, and owner fields during assisted triage while engineers confirm or override each suggestion.
Auto-apply classification and routing only above validated confidence thresholds, and send uncertain reports to human review.
Expand autonomy only after audit trails, feedback loops, and monitoring are in place.

The phases below keep automation limited until the team has evidence that suggestions are accurate enough to act on.

Phase 1: Shadow Mode

Shadow mode establishes a safe baseline. The AI classifier runs in read-only mode, surfacing suggestions to engineers without auto-applying any action, while override and acceptance tracking runs from day one. The acceptance rate it produces becomes the gate for advancing to Phase 2.

When using Augment Cosmos, teams can review and replay Sessions before enabling automated action. Those auditable runs create promotion criteria for suggestion-first triage.

Phase 2: Assisted Triage

Assisted triage keeps humans in the loop. AI pre-fills severity, component, and owner fields, and engineers confirm or override each suggestion. Every override becomes labeled training data, feeding a retraining cadence that sharpens later predictions.

Augment Cosmos keeps reviewer corrections in tenant memory, so a fix applied once carries into later runs instead of staying in one engineer's local process.

Phase 3: Selective Automation

Selective automation acts only above a validated confidence threshold. Reports above it receive auto-applied classification and routing, while reports below it go to a human review queue with pre-loaded AI suggestions and drift-detection alerts. Ericsson's confidence-gated system shows the trade-off. The system abstains rather than guesses, accepting lower automation coverage in exchange for higher accuracy on the cases it does handle.

In Augment Cosmos, policy-controlled Experts and human-in-the-loop controls route uncertain cases to human review rather than auto-applying low-confidence actions.

Phase 4: Expanded Autonomy

Expanded autonomy broadens automation only after monitoring, evaluation, and user feedback are in place. Auditability remains necessary as model scope grows. Broader automation should include full audit trails, while feedback loops capture overrides as training data continuously. Anthropic evals recommends four feedback mechanisms at this stage.

Open source

augmentcode/augment-swebench-agent★875

Star on GitHub

Automated evaluations in CI/CD
Production monitoring for drift
A/B testing for model changes
Ongoing user feedback with transcript review

When using Augment Cosmos for repeated intake workflows, shared Sessions, reusable Experts, and event-driven execution keep repeated triage runs visible across the software development lifecycle.

Avoiding Common Implementation Pitfalls

Common AI triage implementation pitfalls reduce trust and accuracy when teams automate too early, ignore drift, hide decision logic, or treat deployment as a one-time setup. Each pitfall weakens the feedback loops that make assisted triage reliable in production.

Automating before validating reduces trust because both GitHub and Atlassian ship native AI triage as suggestion-based tools rather than auto-applying systems. Enabling auto-action before measuring suggestion accuracy removes the only signal that reveals when the model is wrong.

Ignoring model drift reduces triage accuracy because product architecture changes alter issue vocabulary and team structure changes alter historical assignment patterns. MLOps principles commonly present continuous retraining as an operational practice, and significant changes in data or business context should trigger explicit model re-evaluation.

Black-box triage decisions reduce trust because engineers who cannot see why an issue was classified or routed a certain way cannot correct the model meaningfully and will stop trusting the system. In Augment Cosmos, risk-triage workflows auto-approve low-risk changes and route higher-risk ones to collaborative review, keeping the decision logic visible. Pairing learned triage with deterministic checks is its own design question, weighed directly in AI review versus static analysis.

A one-time setup mindset degrades long-term triage quality because models need ongoing ownership and performance management. Assign explicit ownership of the triage model's ongoing performance. Treat the model as a service with availability and accuracy expectations.

How Triage Quality Connects to DORA Metrics

Triage quality affects DORA metrics because bug intake decisions show up downstream in delivery stability. Bugs misclassified during intake create unplanned work later in the delivery process.

DORA's metrics history introduced deployment rework rate in its 2024 expansion from four to five metrics. The metric measures the percentage of deployments constituting unplanned work to fix bugs. Bugs with the wrong classification, priority, owner, or sprint target can generate unplanned rework now explicitly measured within DORA's stability factor.

Triage quality signal	DORA connection	Downstream effect
Wrong priority	Deployment rework rate	Unplanned work to fix bugs
Incorrect owner	Delivery stability	Reassignment cycles
Duplicate handling	Delivery stability	Redundant investigation
Triage time	Delivery process	Slower intake and resolution

The 2025 DORA report surveyed nearly 5,000 technology professionals. It found a notable tension: AI adoption improves software delivery throughput while maintaining a negative relationship with software delivery stability.

Teams evaluating AI triage programs should baseline deployment rework rate before implementation and track it as the primary outcome metric, alongside the broader engineering velocity metrics that frame overall delivery health.

Deploy AI Triage Before Your Next Sprint Planning Cycle

Bug triage consumes dedicated engineering hours and generates reassignment cycles across team boundaries. Teams use AI systems for bug and incident triage through ranked candidate lists and team-specialized agents.

Start with shadow mode, continue to assisted triage with human confirmation, and then move to selective automation gated by confidence thresholds. Human review adds the most value when teams use it to validate low-confidence cases, correct drift, and refine project-specific severity and ownership patterns. Teams that baseline reassignment rates, triage time, and deployment rework rate can measure whether the system reduces intake friction instead of moving it.

AI-Assisted Bug Triage: Sort Defects in Minutes

TL;DR

How Modern Teams Speed Up Bug Triage

The Agentic SDLC

Why Manual Bug Triage Breaks Down at Scale

Misrouting and Reassignment Cycles

Duplicate Report Processing

Inconsistent Incoming Data Quality

Dedicated Staffing Requirements

The Four AI Techniques That Replace Manual Triage Steps

Technique 1: NLP Classification for Bug Categorization

Technique 2: Severity and Priority Prediction

Technique 3: Duplicate Bug Report Detection

Technique 4: Automated Bug Assignment

Production Deployments With Measured Outcomes

Microsoft COMET and TRIANGLE

Meta's Investigation Compression

Implementing AI Triage: A Phased Rollout

Phase 1: Shadow Mode

Phase 2: Assisted Triage

Phase 3: Selective Automation

Phase 4: Expanded Autonomy

Avoiding Common Implementation Pitfalls

How Triage Quality Connects to DORA Metrics

Deploy AI Triage Before Your Next Sprint Planning Cycle

FAQ

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

How Modern Teams Speed Up Bug Triage

The Agentic SDLC

Why Manual Bug Triage Breaks Down at Scale

Misrouting and Reassignment Cycles

Duplicate Report Processing

Inconsistent Incoming Data Quality

Dedicated Staffing Requirements

The Four AI Techniques That Replace Manual Triage Steps

Technique 1: NLP Classification for Bug Categorization

Technique 2: Severity and Priority Prediction

Technique 3: Duplicate Bug Report Detection

Technique 4: Automated Bug Assignment

Production Deployments With Measured Outcomes

Microsoft COMET and TRIANGLE

Meta's Investigation Compression

Implementing AI Triage: A Phased Rollout

Phase 1: Shadow Mode

Phase 2: Assisted Triage

Phase 3: Selective Automation

Phase 4: Expanded Autonomy

Avoiding Common Implementation Pitfalls

How Triage Quality Connects to DORA Metrics

Deploy AI Triage Before Your Next Sprint Planning Cycle

FAQ

Can AI triage handle project-specific jargon accurately?

What accuracy should teams expect from automated bug assignment?

How does AI triage handle class imbalance in bug databases?

Does Cosmos replace Jira or Linear for bug tracking?

What is the minimum data needed to start AI-assisted triage?

Related

Written by

Ani Galstian

Give your codebase the agents it deserves