AI-Augmented DevOps: Using LLMs to Review Terraform Plans and Predict Deployment Failures

How AI agents are transforming infrastructure operations—from automated Terraform plan reviews to predicting deployment failures before they happen, cutting incident response time by 60%.

·
16 min read

I’m going to say something controversial: by the end of 2025, having an AI copilot review your infrastructure changes will be as standard as running terraform plan. Not because it’s trendy—because it catches the kind of subtle, costly mistakes that humans miss when they’re reviewing their 47th Terraform PR of the week.

Last quarter, I implemented an LLM-powered review system on an AKS-based Azure platform. Within a few weeks, it was catching issues that would have caused production outages — misconfigured network security groups, resource deadlocks, over-permissive RBAC assignments. The team went from spending 8 hours a week on PR reviews to about 45 minutes, with better outcomes.

This isn’t science fiction. This is production-ready technology you can deploy this week.

Why Manual Infrastructure Reviews Don’t Scale

Let’s be honest about the current state of infrastructure code review. You’ve got a PR with 800 lines of Terraform spanning 15 Azure resources. You’re supposed to catch:

  • Security misconfigurations (open NSGs, overly permissive IAM)
  • Cost implications (someone just requested a Standard_E96as_v5 VM)
  • Blast radius (this change affects 12 downstream services)
  • Compliance violations (PII storage without encryption)
  • Resource dependency cycles
  • Naming convention violations
  • Missing tags required for cost allocation

And you have 20 minutes before the next meeting.

What actually happens:

  • You skim the diff, check that resources have tags, approve
  • Someone deploys on Friday afternoon
  • Saturday morning: PagerDuty alerts because the AKS node pool scaled to 0
  • Root cause: A typo in a variable reference that you didn’t catch

I’ve seen this pattern dozens of times. Humans are bad at reviewing infrastructure code because our brains aren’t optimized for spotting subtle configuration errors in 800-line diffs.

AI models are.

The AI-Augmented Review Architecture

Here’s the architecture I use for AI-powered infrastructure reviews:

DeveloperGitHub PRTerraform CloudAzure OpenAI GPT-4Knowledge BaseHuman Reviewer9a. Critical issues found9b. Minor issues or clean — ship 1. Open PR with Terraform changes2. Trigger speculative plan3. Generate plan output (resources · cost · deps)4. Post plan as PR comment5. Send plan + change context6. Query historical data (incidents · security · cost · compliance)7. Return relevant context8. Post structured AI review (security · cost · blast radius · compliance)Block PR mergeDetailed findings + remediation stepsPush fixes — triggers re-reviewRoute for optional human reviewFinal approvalTrigger terraform applyApply successful

The key insight: AI reviews every PR instantly, human reviews only PRs flagged for complex business logic or policy exceptions.

Building the AI Review Agent

Let me walk you through the actual implementation. This is production code, not a demo.

Step 1: Capture Terraform Plan Output

# .github/workflows/terraform-plan.yml
name: Terraform Plan with AI Review

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.0

      - name: Terraform Init
        working-directory: ./terraform
        run: terraform init

      - name: Terraform Plan
        working-directory: ./terraform
        id: plan
        run: |
          terraform plan -no-color -out=tfplan
          terraform show -no-color tfplan > plan_output.txt
        continue-on-error: true

      - name: Upload plan for AI review
        uses: actions/upload-artifact@v4
        with:
          name: terraform-plan
          path: terraform/plan_output.txt

      - name: Call AI Review Agent
        env:
          AZURE_OPENAI_KEY: ${{ secrets.AZURE_OPENAI_KEY }}
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python scripts/ai-review-agent.py \
            --plan-file terraform/plan_output.txt \
            --pr-number ${{ github.event.pull_request.number }} \
            --repo ${{ github.repository }}

Step 2: Build the AI Review Agent

Here’s the Python agent that does the heavy lifting:

# scripts/ai-review-agent.py
import os
import sys
import argparse
from openai import AzureOpenAI
from github import Github

class TerraformAIReviewer:
    def __init__(self, openai_key, openai_endpoint, github_token):
        self.client = AzureOpenAI(
            api_key=openai_key,
            api_version="2024-02-01",
            azure_endpoint=openai_endpoint
        )
        self.github = Github(github_token)

        # Load historical incident data
        self.knowledge_base = self.load_knowledge_base()

    def load_knowledge_base(self):
        """Load historical incidents, best practices, compliance rules"""
        return {
            "past_incidents": [
                {
                    "description": "NSG rule allowed 0.0.0.0/0 on port 3389, led to security breach",
                    "severity": "critical",
                    "pattern": "azurerm_network_security_rule.*source_address_prefix.*0.0.0.0"
                },
                {
                    "description": "AKS cluster without network policy caused cross-namespace data leak",
                    "severity": "high",
                    "pattern": "azurerm_kubernetes_cluster.*network_profile.*network_policy = null"
                },
                {
                    "description": "VM without managed identity required key rotation, caused outage",
                    "severity": "medium",
                    "pattern": "azurerm_virtual_machine.*identity.*type.*(?!SystemAssigned)"
                }
            ],
            "cost_thresholds": {
                "Standard_E96as_v5": {"monthly": 3500, "warning": "This is a $3,500/month VM"},
                "Premium_LRS": {"gb_monthly": 0.20, "warning": "Consider Standard_LRS for non-prod"}
            },
            "compliance_rules": [
                {
                    "rule": "All storage accounts must have encryption at rest",
                    "check": "azurerm_storage_account.*enable_https_traffic_only.*true"
                },
                {
                    "rule": "All databases must have backup retention >= 7 days",
                    "check": "azurerm_mssql_database.*backup_retention_days >= 7"
                }
            ]
        }

    def review_plan(self, plan_file):
        """Send Terraform plan to GPT-4 for analysis"""

        with open(plan_file, 'r') as f:
            plan_content = f.read()

        # Build context-aware prompt
        system_prompt = self.build_system_prompt()
        user_prompt = self.build_user_prompt(plan_content)

        response = self.client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.3,  # Lower temp for more consistent analysis
            max_tokens=2000
        )

        return response.choices[0].message.content

    def build_system_prompt(self):
        """Create system prompt with context from knowledge base"""

        incidents_context = "\n".join([
            f"- {inc['description']} (Severity: {inc['severity']})"
            for inc in self.knowledge_base["past_incidents"]
        ])

        return f"""You are an expert Azure infrastructure reviewer specializing in Terraform.
Your job is to analyze Terraform plans and identify potential issues before deployment.

CRITICAL PAST INCIDENTS TO WATCH FOR:
{incidents_context}

COST AWARENESS:
- Flag any resources with monthly cost > $1000
- Warn about Premium storage in non-production environments
- Alert on unnecessary high-SKU resources

SECURITY CHECKLIST:
- Open network security group rules (0.0.0.0/0)
- Missing encryption at rest
- Public IP assignments without justification
- Missing managed identities
- Overly permissive IAM roles

BLAST RADIUS ANALYSIS:
- Count resources affected by this change
- Identify critical resources (databases, AKS clusters, network infrastructure)
- Flag changes during business hours if high risk

OUTPUT FORMAT:
Provide a structured review with:
1. SEVERITY: [CRITICAL/HIGH/MEDIUM/LOW/CLEAN]
2. SUMMARY: One-line assessment
3. ISSUES: Numbered list of concerns with line numbers
4. COST IMPACT: Estimated monthly cost change
5. RECOMMENDATION: Approve, approve with warnings, or block

Be concise but thorough. Reference specific line numbers when possible."""

    def build_user_prompt(self, plan_content):
        """Create user prompt with plan content"""

        # Truncate if plan is too large (GPT-4 context limits)
        max_chars = 12000
        if len(plan_content) > max_chars:
            plan_content = plan_content[:max_chars] + "\n\n[... truncated ...]"

        return f"""Review this Terraform plan for an Azure infrastructure deployment:

{plan_content}

Analyze this plan against security best practices, cost optimization, compliance requirements, and past incidents. Provide your assessment."""

    def post_review_to_pr(self, repo_name, pr_number, review_content):
        """Post AI review as PR comment"""

        repo = self.github.get_repo(repo_name)
        pr = repo.get_pull(pr_number)

        # Format comment with clear visual indicators
        severity = self.extract_severity(review_content)

        icon_map = {
            "CRITICAL": "🚨",
            "HIGH": "⚠️",
            "MEDIUM": "⚡",
            "LOW": "ℹ️",
            "CLEAN": "✅"
        }

        icon = icon_map.get(severity, "🤖")

        formatted_comment = f"""## {icon} AI Infrastructure Review

{review_content}

---
*This review was generated by Azure OpenAI GPT-4. A human reviewer may still be required for final approval.*
"""

        pr.create_issue_comment(formatted_comment)

        # Block PR if critical issues found
        if severity == "CRITICAL":
            pr.create_review(
                body="❌ AI review detected critical issues. Blocking merge until resolved.",
                event="REQUEST_CHANGES"
            )
        elif severity in ["HIGH", "MEDIUM"]:
            pr.create_review(
                body="⚠️ AI review found issues requiring attention. Review carefully before merging.",
                event="COMMENT"
            )
        else:
            pr.create_review(
                body="✅ AI review passed. Human review recommended for business logic validation.",
                event="COMMENT"
            )

    def extract_severity(self, review_content):
        """Extract severity level from AI response"""
        for severity in ["CRITICAL", "HIGH", "MEDIUM", "LOW", "CLEAN"]:
            if severity in review_content:
                return severity
        return "UNKNOWN"

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--plan-file', required=True)
    parser.add_argument('--pr-number', required=True, type=int)
    parser.add_argument('--repo', required=True)
    args = parser.parse_args()

    reviewer = TerraformAIReviewer(
        openai_key=os.environ['AZURE_OPENAI_KEY'],
        openai_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
        github_token=os.environ['GITHUB_TOKEN']
    )

    review = reviewer.review_plan(args.plan_file)
    reviewer.post_review_to_pr(args.repo, args.pr_number, review)

    print(f"AI review posted to PR #{args.pr_number}")

if __name__ == "__main__":
    main()

Example AI Review Output

Here’s an actual review the system generated for a PR that would have caused an outage:

## 🚨 AI Infrastructure Review

**SEVERITY:** CRITICAL

**SUMMARY:** NSG rule exposes RDP port to the internet; AKS cluster lacks network policy; Premium storage in dev environment

**ISSUES:**

1. **CRITICAL - Security:** Network security rule allows RDP (port 3389) from 0.0.0.0/0 (line 45)
   - Historical incident: This pattern led to unauthorized access in Q2 2024
   - Recommendation: Restrict to corporate VPN IP range or use Azure Bastion

2. **HIGH - Security:** AKS cluster does not enable network policy (line 112)
   - Without network policies, pods can communicate across namespaces unrestricted
   - Historical incident: Led to cross-namespace data leak in Q3 2024
   - Recommendation: Add `network_policy = "calico"` to network_profile block

3. **MEDIUM - Cost:** Using Premium_LRS storage account for dev environment (line 78)
   - Premium storage costs $0.20/GB vs $0.05/GB for Standard_LRS
   - Estimated waste: $450/month for this 3TB volume
   - Recommendation: Use Standard_LRS for non-production workloads

4. **LOW - Best Practice:** Virtual machine missing managed identity (line 156)
   - Will require manual key rotation and secret management
   - Recommendation: Add SystemAssigned identity block

**COST IMPACT:** +$1,850/month (Premium storage $450 + new VMs $1,400)

**RECOMMENDATION:** ❌ BLOCK - Critical security issues must be resolved before merge.

The developer fixed issues 1-3, got an instant re-review, and merged 30 minutes later. No human reviewer needed.

Predictive Deployment Failure Detection

The second major use case: predicting deployment failures before they happen. This is where AI really shines.

The Prediction Model Architecture

Phase 1 — Data CollectionPhase 2 — Feature EngineeringPhase 3 — Model TrainingPhase 4 — Real-Time PredictionPhase 5 — Deployment GateHistorical Deployments6 months of dataPipeline LogsSuccess / failure outcomesSystem MetricsPerformance signalsIncident ReportsRoot cause patternsExtract 12 predictive featureschange size · resource count · team velocitytime of day · recent failures · test coveragecritical resource touches · deployment historyRandom Forest ClassifierTrain on labelled deploymentsValidationTarget: > 80% accuracyDeploy to Azure ML EndpointReal-time scoring APIIncoming Deployment RequestRisk Score 0 – 100with feature importanceExplainabilityTop risk contributorsRecommendationsActionable remediationAuto-DeployRisk < 20Low riskHuman ReviewRisk 20 – 70Medium riskBlock DeployRisk > 70High risk Continuous learning

Training the Failure Prediction Model

The classifier is boring on purpose — a Random Forest, well-understood and easy to explain to a sceptical platform team. Twelve features feed it:

  • Change shape: resource count, change size in lines, dependency count, test coverage %
  • Temporal signals: hour of day, day of week, days since last deploy
  • Failure context: failed deploys in the last 24h, team velocity over 7d
  • Blast-radius flags: critical-resource changed, DB schema changed, network config changed
self.model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=20,
    class_weight='balanced',   # imbalanced — most deploys succeed
    random_state=42,
)
self.model.fit(X_train, y_train)

Trained on 180 days of CI/CD history, it lands in the low-80s test accuracy. The output isn’t a black-box number — it includes the top contributing factors, so the PR comment reads “Risk 72/100 (HIGH) — 3 failed deploys in last 24h · 1,400-line change · off-hours deploy” instead of just a score. Engineers act on explanations, not verdicts.

The CI/CD wiring is mundane: a GitHub Action computes the deployment metrics (change size, hour, recent-failure count, critical-resource flag), POSTs them to an Azure ML endpoint, and writes the response as a PR comment. Same shape as any other status check.

Real-World Results: The Numbers

Here are the before/after outcomes from a production deployment of these systems on an AKS platform running dozens of services with a team doing frequent daily deploys.

Before AI-Augmented DevOps:

  • PR review time: 2–4 hours per PR (human bottleneck)
  • Deployment failures: ~18% failure rate on first attempt
  • Mean time to detect issues: hours (manual review misses subtle problems)
  • Incident response time: 4–8 hours

After AI-Augmented DevOps (4 months):

  • PR review time: ~5 minutes (AI instant review + human oversight only for complex logic)
  • Deployment failures: ~6% failure rate (roughly two-thirds reduction)
  • Mean time to detect issues: seconds (AI catches issues in plan phase)
  • Incident response time: ~45 minutes (AI provides root cause analysis)

Net outcome: Significantly faster deployment velocity, fewer incidents, and engineers spending their review time on architectural decisions rather than catching typos in resource names.

AI for Post-Incident Analysis

The third use case: automated root cause analysis. When something goes wrong, AI can analyze logs, metrics, and changes faster than any human.

Incident Analysis Workflow

Incident alert triggeredPhase 1 — Data Collection (0 – 30 s)Phase 2 — AI Analysis (30 – 60 s)Confidence?Automated RollbackRevert deployment · rescale · reroute traffic— executes in 2 – 5 minSuggested ActionsPrioritised steps with expected outcomes— human executes in 10 – 30 minEscalate to SREAll data + AI hypotheses + alert— manual debugging, 30+ minResolution verifiedPhase 4 — Post-Incident (automated)Incident closedGather logs30-minute windowGather metricsanomaly detectionRecent deploymentslast 24 hoursService dependencymap snapshotCorrelate datapattern matching + historical comparisonGenerate hypothesesroot cause candidates ranked by probabilityAssess confidence levelHIGH > 80% · MEDIUM 50-80% · LOW < 50%AI-drafted post-mortem reportUpdate knowledge baselearn from this incident HIGH confidenceMEDIUM confidenceLOW confidence

Automated Root Cause Analysis

# scripts/incident-analyzer.py
class IncidentAnalyzer:
    def __init__(self, openai_client):
        self.client = openai_client

    def analyze_incident(self, incident_id):
        """Perform automated root cause analysis"""

        # Collect incident data
        logs = self.fetch_logs(incident_id, lookback_minutes=30)
        metrics = self.fetch_metrics(incident_id)
        recent_changes = self.fetch_recent_deployments(hours=24)
        dependencies = self.map_service_dependencies()

        # Build comprehensive context
        context = f"""
INCIDENT DETAILS:
- ID: {incident_id}
- Time: {incident.timestamp}
- Severity: {incident.severity}
- Affected services: {', '.join(incident.services)}

RECENT CHANGES (Last 24h):
{self.format_changes(recent_changes)}

ERROR LOGS:
{logs[:5000]}  # Truncate to fit context

METRICS AT INCIDENT TIME:
{self.format_metrics(metrics)}

SERVICE DEPENDENCIES:
{self.format_dependencies(dependencies)}
"""

        # Ask AI for root cause analysis
        response = self.client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": self.get_rca_system_prompt()},
                {"role": "user", "content": context}
            ],
            temperature=0.2
        )

        analysis = response.choices[0].message.content

        # Extract actionable recommendations
        recommendations = self.extract_recommendations(analysis)

        return {
            'root_cause': analysis,
            'confidence': self.assess_confidence(analysis),
            'recommendations': recommendations,
            'auto_remediation_possible': self.check_auto_remediation(recommendations)
        }

    def get_rca_system_prompt(self):
        return """You are an expert SRE analyzing production incidents.

Analyze the provided incident data and determine the root cause.

Consider:
1. Correlation between recent deployments and incident timing
2. Error patterns in logs (cascading failures, resource exhaustion, network issues)
3. Metric anomalies (latency spikes, error rate increases, resource saturation)
4. Service dependency impacts (did upstream service fail first?)

Provide:
1. PRIMARY ROOT CAUSE: Most likely cause with confidence level
2. CONTRIBUTING FACTORS: Secondary issues that amplified the incident
3. REMEDIATION STEPS: Immediate actions to resolve (prioritized)
4. PREVENTION: Long-term fixes to prevent recurrence

Be specific. Reference log lines and metric values. Suggest concrete actions."""

Key Takeaways

  • AI infrastructure reviews catch issues humans miss—especially in large diffs with subtle configuration errors
  • LLM-powered plan reviews take 5-10 seconds—vs 20-40 minutes for human review, with better accuracy
  • Failure prediction models reduce deployment failures by 60-70%—by identifying high-risk changes before they deploy
  • Context matters more than model size—feeding historical incidents and compliance rules to the AI dramatically improves relevance
  • AI augments, doesn’t replace, human judgment—use AI for rapid analysis, humans for business logic and policy exceptions
  • Start with Terraform plan reviews—lowest-hanging fruit, highest ROI, easiest to implement
  • Track confidence scores—AI should indicate certainty; low-confidence predictions require human review

AI for DevOps isn’t hype. It’s production-ready, cost-effective, and the teams adopting it are moving 3x faster with fewer incidents. The question isn’t “should we do this?”—it’s “how fast can we deploy it?”

What to Do Next

  1. Set up Azure OpenAI: Request access, deploy GPT-4 Turbo model
  2. Start with PR reviews: Implement the Terraform review agent this week
  3. Collect training data: Export your last 6 months of deployment history
  4. Train a failure predictor: Use the provided code as a starting point
  5. Measure impact: Track review time, failure rates, and incident response time

The teams that master AI-augmented DevOps in 2025 will have an insurmountable advantage. Start now.

Divyansh Srivastav

DevOps Architect · Kubernetes Platform Engineering