AI-Augmented DevOps: Using LLMs to Review Terraform Plans and Predict Deployment Failures
How AI agents are transforming infrastructure operations—from automated Terraform plan reviews to predicting deployment failures before they happen, cutting incident response time by 60%.
I’m going to say something controversial: by the end of 2025, having an AI copilot review your infrastructure changes will be as standard as running terraform plan. Not because it’s trendy—because it catches the kind of subtle, costly mistakes that humans miss when they’re reviewing their 47th Terraform PR of the week.
Last quarter, I implemented an LLM-powered review system on an AKS-based Azure platform. Within a few weeks, it was catching issues that would have caused production outages — misconfigured network security groups, resource deadlocks, over-permissive RBAC assignments. The team went from spending 8 hours a week on PR reviews to about 45 minutes, with better outcomes.
This isn’t science fiction. This is production-ready technology you can deploy this week.
Why Manual Infrastructure Reviews Don’t Scale
Let’s be honest about the current state of infrastructure code review. You’ve got a PR with 800 lines of Terraform spanning 15 Azure resources. You’re supposed to catch:
- Security misconfigurations (open NSGs, overly permissive IAM)
- Cost implications (someone just requested a Standard_E96as_v5 VM)
- Blast radius (this change affects 12 downstream services)
- Compliance violations (PII storage without encryption)
- Resource dependency cycles
- Naming convention violations
- Missing tags required for cost allocation
And you have 20 minutes before the next meeting.
What actually happens:
- You skim the diff, check that resources have tags, approve
- Someone deploys on Friday afternoon
- Saturday morning: PagerDuty alerts because the AKS node pool scaled to 0
- Root cause: A typo in a variable reference that you didn’t catch
I’ve seen this pattern dozens of times. Humans are bad at reviewing infrastructure code because our brains aren’t optimized for spotting subtle configuration errors in 800-line diffs.
AI models are.
The AI-Augmented Review Architecture
Here’s the architecture I use for AI-powered infrastructure reviews:
The key insight: AI reviews every PR instantly, human reviews only PRs flagged for complex business logic or policy exceptions.
Building the AI Review Agent
Let me walk you through the actual implementation. This is production code, not a demo.
Step 1: Capture Terraform Plan Output
# .github/workflows/terraform-plan.yml
name: Terraform Plan with AI Review
on:
pull_request:
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.0
- name: Terraform Init
working-directory: ./terraform
run: terraform init
- name: Terraform Plan
working-directory: ./terraform
id: plan
run: |
terraform plan -no-color -out=tfplan
terraform show -no-color tfplan > plan_output.txt
continue-on-error: true
- name: Upload plan for AI review
uses: actions/upload-artifact@v4
with:
name: terraform-plan
path: terraform/plan_output.txt
- name: Call AI Review Agent
env:
AZURE_OPENAI_KEY: ${{ secrets.AZURE_OPENAI_KEY }}
AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
python scripts/ai-review-agent.py \
--plan-file terraform/plan_output.txt \
--pr-number ${{ github.event.pull_request.number }} \
--repo ${{ github.repository }}
Step 2: Build the AI Review Agent
Here’s the Python agent that does the heavy lifting:
# scripts/ai-review-agent.py
import os
import sys
import argparse
from openai import AzureOpenAI
from github import Github
class TerraformAIReviewer:
def __init__(self, openai_key, openai_endpoint, github_token):
self.client = AzureOpenAI(
api_key=openai_key,
api_version="2024-02-01",
azure_endpoint=openai_endpoint
)
self.github = Github(github_token)
# Load historical incident data
self.knowledge_base = self.load_knowledge_base()
def load_knowledge_base(self):
"""Load historical incidents, best practices, compliance rules"""
return {
"past_incidents": [
{
"description": "NSG rule allowed 0.0.0.0/0 on port 3389, led to security breach",
"severity": "critical",
"pattern": "azurerm_network_security_rule.*source_address_prefix.*0.0.0.0"
},
{
"description": "AKS cluster without network policy caused cross-namespace data leak",
"severity": "high",
"pattern": "azurerm_kubernetes_cluster.*network_profile.*network_policy = null"
},
{
"description": "VM without managed identity required key rotation, caused outage",
"severity": "medium",
"pattern": "azurerm_virtual_machine.*identity.*type.*(?!SystemAssigned)"
}
],
"cost_thresholds": {
"Standard_E96as_v5": {"monthly": 3500, "warning": "This is a $3,500/month VM"},
"Premium_LRS": {"gb_monthly": 0.20, "warning": "Consider Standard_LRS for non-prod"}
},
"compliance_rules": [
{
"rule": "All storage accounts must have encryption at rest",
"check": "azurerm_storage_account.*enable_https_traffic_only.*true"
},
{
"rule": "All databases must have backup retention >= 7 days",
"check": "azurerm_mssql_database.*backup_retention_days >= 7"
}
]
}
def review_plan(self, plan_file):
"""Send Terraform plan to GPT-4 for analysis"""
with open(plan_file, 'r') as f:
plan_content = f.read()
# Build context-aware prompt
system_prompt = self.build_system_prompt()
user_prompt = self.build_user_prompt(plan_content)
response = self.client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.3, # Lower temp for more consistent analysis
max_tokens=2000
)
return response.choices[0].message.content
def build_system_prompt(self):
"""Create system prompt with context from knowledge base"""
incidents_context = "\n".join([
f"- {inc['description']} (Severity: {inc['severity']})"
for inc in self.knowledge_base["past_incidents"]
])
return f"""You are an expert Azure infrastructure reviewer specializing in Terraform.
Your job is to analyze Terraform plans and identify potential issues before deployment.
CRITICAL PAST INCIDENTS TO WATCH FOR:
{incidents_context}
COST AWARENESS:
- Flag any resources with monthly cost > $1000
- Warn about Premium storage in non-production environments
- Alert on unnecessary high-SKU resources
SECURITY CHECKLIST:
- Open network security group rules (0.0.0.0/0)
- Missing encryption at rest
- Public IP assignments without justification
- Missing managed identities
- Overly permissive IAM roles
BLAST RADIUS ANALYSIS:
- Count resources affected by this change
- Identify critical resources (databases, AKS clusters, network infrastructure)
- Flag changes during business hours if high risk
OUTPUT FORMAT:
Provide a structured review with:
1. SEVERITY: [CRITICAL/HIGH/MEDIUM/LOW/CLEAN]
2. SUMMARY: One-line assessment
3. ISSUES: Numbered list of concerns with line numbers
4. COST IMPACT: Estimated monthly cost change
5. RECOMMENDATION: Approve, approve with warnings, or block
Be concise but thorough. Reference specific line numbers when possible."""
def build_user_prompt(self, plan_content):
"""Create user prompt with plan content"""
# Truncate if plan is too large (GPT-4 context limits)
max_chars = 12000
if len(plan_content) > max_chars:
plan_content = plan_content[:max_chars] + "\n\n[... truncated ...]"
return f"""Review this Terraform plan for an Azure infrastructure deployment:
{plan_content}
Analyze this plan against security best practices, cost optimization, compliance requirements, and past incidents. Provide your assessment."""
def post_review_to_pr(self, repo_name, pr_number, review_content):
"""Post AI review as PR comment"""
repo = self.github.get_repo(repo_name)
pr = repo.get_pull(pr_number)
# Format comment with clear visual indicators
severity = self.extract_severity(review_content)
icon_map = {
"CRITICAL": "🚨",
"HIGH": "⚠️",
"MEDIUM": "⚡",
"LOW": "ℹ️",
"CLEAN": "✅"
}
icon = icon_map.get(severity, "🤖")
formatted_comment = f"""## {icon} AI Infrastructure Review
{review_content}
---
*This review was generated by Azure OpenAI GPT-4. A human reviewer may still be required for final approval.*
"""
pr.create_issue_comment(formatted_comment)
# Block PR if critical issues found
if severity == "CRITICAL":
pr.create_review(
body="❌ AI review detected critical issues. Blocking merge until resolved.",
event="REQUEST_CHANGES"
)
elif severity in ["HIGH", "MEDIUM"]:
pr.create_review(
body="⚠️ AI review found issues requiring attention. Review carefully before merging.",
event="COMMENT"
)
else:
pr.create_review(
body="✅ AI review passed. Human review recommended for business logic validation.",
event="COMMENT"
)
def extract_severity(self, review_content):
"""Extract severity level from AI response"""
for severity in ["CRITICAL", "HIGH", "MEDIUM", "LOW", "CLEAN"]:
if severity in review_content:
return severity
return "UNKNOWN"
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--plan-file', required=True)
parser.add_argument('--pr-number', required=True, type=int)
parser.add_argument('--repo', required=True)
args = parser.parse_args()
reviewer = TerraformAIReviewer(
openai_key=os.environ['AZURE_OPENAI_KEY'],
openai_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
github_token=os.environ['GITHUB_TOKEN']
)
review = reviewer.review_plan(args.plan_file)
reviewer.post_review_to_pr(args.repo, args.pr_number, review)
print(f"AI review posted to PR #{args.pr_number}")
if __name__ == "__main__":
main()
Example AI Review Output
Here’s an actual review the system generated for a PR that would have caused an outage:
## 🚨 AI Infrastructure Review
**SEVERITY:** CRITICAL
**SUMMARY:** NSG rule exposes RDP port to the internet; AKS cluster lacks network policy; Premium storage in dev environment
**ISSUES:**
1. **CRITICAL - Security:** Network security rule allows RDP (port 3389) from 0.0.0.0/0 (line 45)
- Historical incident: This pattern led to unauthorized access in Q2 2024
- Recommendation: Restrict to corporate VPN IP range or use Azure Bastion
2. **HIGH - Security:** AKS cluster does not enable network policy (line 112)
- Without network policies, pods can communicate across namespaces unrestricted
- Historical incident: Led to cross-namespace data leak in Q3 2024
- Recommendation: Add `network_policy = "calico"` to network_profile block
3. **MEDIUM - Cost:** Using Premium_LRS storage account for dev environment (line 78)
- Premium storage costs $0.20/GB vs $0.05/GB for Standard_LRS
- Estimated waste: $450/month for this 3TB volume
- Recommendation: Use Standard_LRS for non-production workloads
4. **LOW - Best Practice:** Virtual machine missing managed identity (line 156)
- Will require manual key rotation and secret management
- Recommendation: Add SystemAssigned identity block
**COST IMPACT:** +$1,850/month (Premium storage $450 + new VMs $1,400)
**RECOMMENDATION:** ❌ BLOCK - Critical security issues must be resolved before merge.
The developer fixed issues 1-3, got an instant re-review, and merged 30 minutes later. No human reviewer needed.
Predictive Deployment Failure Detection
The second major use case: predicting deployment failures before they happen. This is where AI really shines.
The Prediction Model Architecture
Training the Failure Prediction Model
The classifier is boring on purpose — a Random Forest, well-understood and easy to explain to a sceptical platform team. Twelve features feed it:
- Change shape: resource count, change size in lines, dependency count, test coverage %
- Temporal signals: hour of day, day of week, days since last deploy
- Failure context: failed deploys in the last 24h, team velocity over 7d
- Blast-radius flags: critical-resource changed, DB schema changed, network config changed
self.model = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_split=20,
class_weight='balanced', # imbalanced — most deploys succeed
random_state=42,
)
self.model.fit(X_train, y_train)
Trained on 180 days of CI/CD history, it lands in the low-80s test accuracy. The output isn’t a black-box number — it includes the top contributing factors, so the PR comment reads “Risk 72/100 (HIGH) — 3 failed deploys in last 24h · 1,400-line change · off-hours deploy” instead of just a score. Engineers act on explanations, not verdicts.
The CI/CD wiring is mundane: a GitHub Action computes the deployment metrics (change size, hour, recent-failure count, critical-resource flag), POSTs them to an Azure ML endpoint, and writes the response as a PR comment. Same shape as any other status check.
Real-World Results: The Numbers
Here are the before/after outcomes from a production deployment of these systems on an AKS platform running dozens of services with a team doing frequent daily deploys.
Before AI-Augmented DevOps:
- PR review time: 2–4 hours per PR (human bottleneck)
- Deployment failures: ~18% failure rate on first attempt
- Mean time to detect issues: hours (manual review misses subtle problems)
- Incident response time: 4–8 hours
After AI-Augmented DevOps (4 months):
- PR review time: ~5 minutes (AI instant review + human oversight only for complex logic)
- Deployment failures: ~6% failure rate (roughly two-thirds reduction)
- Mean time to detect issues: seconds (AI catches issues in plan phase)
- Incident response time: ~45 minutes (AI provides root cause analysis)
Net outcome: Significantly faster deployment velocity, fewer incidents, and engineers spending their review time on architectural decisions rather than catching typos in resource names.
AI for Post-Incident Analysis
The third use case: automated root cause analysis. When something goes wrong, AI can analyze logs, metrics, and changes faster than any human.
Incident Analysis Workflow
Automated Root Cause Analysis
# scripts/incident-analyzer.py
class IncidentAnalyzer:
def __init__(self, openai_client):
self.client = openai_client
def analyze_incident(self, incident_id):
"""Perform automated root cause analysis"""
# Collect incident data
logs = self.fetch_logs(incident_id, lookback_minutes=30)
metrics = self.fetch_metrics(incident_id)
recent_changes = self.fetch_recent_deployments(hours=24)
dependencies = self.map_service_dependencies()
# Build comprehensive context
context = f"""
INCIDENT DETAILS:
- ID: {incident_id}
- Time: {incident.timestamp}
- Severity: {incident.severity}
- Affected services: {', '.join(incident.services)}
RECENT CHANGES (Last 24h):
{self.format_changes(recent_changes)}
ERROR LOGS:
{logs[:5000]} # Truncate to fit context
METRICS AT INCIDENT TIME:
{self.format_metrics(metrics)}
SERVICE DEPENDENCIES:
{self.format_dependencies(dependencies)}
"""
# Ask AI for root cause analysis
response = self.client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": self.get_rca_system_prompt()},
{"role": "user", "content": context}
],
temperature=0.2
)
analysis = response.choices[0].message.content
# Extract actionable recommendations
recommendations = self.extract_recommendations(analysis)
return {
'root_cause': analysis,
'confidence': self.assess_confidence(analysis),
'recommendations': recommendations,
'auto_remediation_possible': self.check_auto_remediation(recommendations)
}
def get_rca_system_prompt(self):
return """You are an expert SRE analyzing production incidents.
Analyze the provided incident data and determine the root cause.
Consider:
1. Correlation between recent deployments and incident timing
2. Error patterns in logs (cascading failures, resource exhaustion, network issues)
3. Metric anomalies (latency spikes, error rate increases, resource saturation)
4. Service dependency impacts (did upstream service fail first?)
Provide:
1. PRIMARY ROOT CAUSE: Most likely cause with confidence level
2. CONTRIBUTING FACTORS: Secondary issues that amplified the incident
3. REMEDIATION STEPS: Immediate actions to resolve (prioritized)
4. PREVENTION: Long-term fixes to prevent recurrence
Be specific. Reference log lines and metric values. Suggest concrete actions."""
Key Takeaways
- AI infrastructure reviews catch issues humans miss—especially in large diffs with subtle configuration errors
- LLM-powered plan reviews take 5-10 seconds—vs 20-40 minutes for human review, with better accuracy
- Failure prediction models reduce deployment failures by 60-70%—by identifying high-risk changes before they deploy
- Context matters more than model size—feeding historical incidents and compliance rules to the AI dramatically improves relevance
- AI augments, doesn’t replace, human judgment—use AI for rapid analysis, humans for business logic and policy exceptions
- Start with Terraform plan reviews—lowest-hanging fruit, highest ROI, easiest to implement
- Track confidence scores—AI should indicate certainty; low-confidence predictions require human review
AI for DevOps isn’t hype. It’s production-ready, cost-effective, and the teams adopting it are moving 3x faster with fewer incidents. The question isn’t “should we do this?”—it’s “how fast can we deploy it?”
What to Do Next
- Set up Azure OpenAI: Request access, deploy GPT-4 Turbo model
- Start with PR reviews: Implement the Terraform review agent this week
- Collect training data: Export your last 6 months of deployment history
- Train a failure predictor: Use the provided code as a starting point
- Measure impact: Track review time, failure rates, and incident response time
The teams that master AI-augmented DevOps in 2025 will have an insurmountable advantage. Start now.
Divyansh Srivastav
DevOps Architect · Kubernetes Platform Engineering