When AI coding assistants almost destroyed production: the AWS CDK disaster I prevented

When AI coding assistants almost destroyed production: the AWS CDK disaster I prevented

An AI coding assistant suggested "fixing" a deployment error by modifying a production stack, which would have wiped all production data on the next destroy. Here's why trusting AI with infrastructure code without understanding the implications is extremely dangerous.

Last week, I was reviewing a client’s AWS CDK infrastructure setup when I discovered something that made my blood run cold. Their development team had been working with AI coding assistants to solve a deployment issue, and the AI had suggested a “quick fix” that would have resulted in the complete destruction of their production database and all associated data.

The terrifying part? The developer was ready to implement the suggestion without understanding the catastrophic implications.

This wasn’t a case of malicious AI or a particularly complex scenario. It was a textbook example of how AI coding assistants optimize for making code work without understanding the broader consequences of infrastructure decisions. The AI saw an error message and suggested a logical-seeming solution that would have turned a minor deployment issue into a business-ending disaster.

Here’s what happened, why it was so dangerous, and what every technical leader needs to know about AI-assisted infrastructure development.

The scenario: a “simple” CDK multi-stack deployment

The team was building a data migration system using AWS CDK, split across multiple stacks for logical separation:

  1. Production Stack: The main application infrastructure with RDS database, Lambda functions, and core business logic
  2. Migration Stack: A temporary stack designed to import data from their legacy system
  3. Cleanup Stack: Another temporary stack for post-migration cleanup tasks

This is a reasonable architectural approach. You want your core production infrastructure separate from temporary migration components so you can destroy the temporary pieces after the migration completes.

The migration stack needed to access certain resources created by the production stack - specifically, it needed IAM permissions to read from the production database and write to specific S3 buckets.

In CDK, there are several ways to handle cross-stack resource access. The team initially tried to use CDK’s cross-stack references, but ran into a common issue: you can’t directly attach policies to resources that are imported from other stacks.

This is where the AI assistant entered the picture.

The AI’s “logical” but catastrophic suggestion

When the deployment failed with an error about being unable to attach IAM policies to imported resources, the developer fed the error message to their AI coding assistant. The AI analyzed the error and suggested what seemed like a reasonable solution:

“Instead of trying to attach policies to imported resources, extend the production stack to include the IAM roles and policies needed by the migration stack. This will give you direct access to the resources without the cross-stack import limitations.”

The AI even provided clean, well-structured CDK code that would implement this approach. The code looked professional, followed CDK best practices, and would have solved the immediate deployment issue.

There was just one problem: implementing this suggestion would have created a dependency relationship that would destroy the entire production stack when the temporary migration stack was cleaned up.

The hidden danger: CDK dependency chains and cascade destruction

Here’s the critical concept that the AI assistant completely missed: In AWS CDK, when you create dependencies between stacks, you create potential cascade destruction scenarios.

The AI’s suggested approach would have created this dependency chain:

Migration Stack → Production Stack (extended with migration IAM)

This means the production stack would become dependent on resources defined in the migration stack. In CDK’s dependency model, this creates a deletion order requirement: you cannot delete the production stack while the migration stack still references its resources.

But here’s where it gets catastrophic: CDK resolves dependencies in both directions. If the migration stack is deleted, CDK evaluates whether any dependent resources also need to be cleaned up. With the AI’s suggested architecture, deleting the migration stack would trigger a cascade deletion of the production stack.

The practical impact: When the migration was complete and the team ran cdk destroy MigrationStack, CDK would have:

  1. Deleted the migration stack
  2. Evaluated dependency chains
  3. Determined that the production stack components added for migration support were no longer needed
  4. Initiated deletion of the production stack
  5. Wiped out the production database, all application data, and the entire business-critical infrastructure

The AI was optimizing for making the deployment work, not for understanding the catastrophic business implications of the architectural decision.

Facing a leadership challenge right now?

Don't wait for the next fire to burn you out. In a 30-minute discovery call we'll map your blockers and outline next steps you can use immediately with your team.

Why AI assistants are particularly dangerous for infrastructure

This scenario highlights fundamental problems with using AI assistants for infrastructure decisions:

AI optimizes for code functionality, not business consequences

AI coding assistants are trained to solve immediate technical problems. When they see a deployment error, they suggest solutions that make the code work. They don’t understand:

  • Business criticality of different infrastructure components
  • Long-term maintenance implications of architectural decisions
  • Cascade effects in infrastructure dependency chains
  • The difference between “working code” and “safe infrastructure”

Infrastructure code isn’t just code

This is the key insight that AI assistants consistently miss: infrastructure code directly controls business-critical systems and data. Unlike application code, where bugs might cause user experience issues, infrastructure mistakes can destroy businesses.

Every infrastructure decision has implications for:

  • Data persistence and backup strategies
  • Security boundaries and access controls
  • Disaster recovery and business continuity
  • Cost optimization and resource management
  • Compliance and audit requirements

AI assistants treat infrastructure code like any other programming problem, without understanding these broader implications.

Context matters more than syntax

The AI assistant understood CDK syntax perfectly and generated syntactically correct code. But it completely missed the operational context:

  • This was production infrastructure with live business data
  • The migration was temporary and would be destroyed after completion
  • The dependency relationship would create unexpected deletion behaviors
  • Alternative approaches existed that wouldn’t create these risks

Infrastructure decisions require understanding the full operational lifecycle, not just the immediate technical implementation.

The correct solution: isolated IAM resources

The safe solution to the original problem was straightforward once you understand the constraints:

Instead of extending the production stack or trying to attach policies to imported resources, create independent IAM resources in the migration stack that reference (but don’t depend on) production resources.

// Safe approach: Independent IAM in migration stack
const migrationRole = new iam.Role(this, 'MigrationRole', {
  assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
  inlinePolicies: {
    DatabaseAccess: new iam.PolicyDocument({
      statements: [
        new iam.PolicyStatement({
          effect: iam.Effect.ALLOW,
          actions: ['rds:DescribeDBInstances', 'rds:Connect'],
          resources: [`arn:aws:rds:${region}:${account}:db:production-db`]
        })
      ]
    })
  }
});

This approach:

  • Creates no dependency between stacks
  • Allows safe deletion of the migration stack
  • Provides the necessary permissions
  • Maintains clear architectural boundaries

The AI assistant could have suggested this approach, but it was fixated on solving the immediate error rather than understanding the architectural requirements.

Coaching for Tech Leads & CTOs

Ongoing 1:1 coaching for startup leaders who want accountability, proven frameworks, and a partner to help them succeed under pressure.

Real-world scenarios where AI infrastructure suggestions go wrong

This CDK scenario isn’t unique. I’ve seen AI assistants suggest dangerous infrastructure changes across multiple domains:

Security group modifications

The Scenario: Application can’t connect to database. AI suggests opening security group rules.

The AI Suggestion: “Add a security group rule allowing all traffic from 0.0.0.0/0 to port 5432 to resolve connectivity issues.”

The Reality: This opens the production database to the entire internet. The correct solution was fixing the application’s subnet configuration.

Auto-scaling configuration

The Scenario: Application experiencing high load during traffic spikes.

The AI Suggestion: “Increase maximum instance count to 1000 and set aggressive scaling policies to handle any load.”

The Reality: No cost controls or resource limits. A traffic spike or DDoS attack could generate $50K+ in AWS bills in hours. The correct solution involved traffic analysis and gradual scaling with cost monitoring.

Database migration scripts

The Scenario: Need to update database schema in production.

The AI Suggestion: “Run this ALTER TABLE command directly on production to add the new column.”

The Reality: The suggestion would lock the table for hours during the migration, causing complete application downtime. The correct approach required careful planning with read replicas and phased deployment.

Backup and recovery modifications

The Scenario: Need to modify backup retention policies to reduce costs.

The AI Suggestion: “Set backup retention to 1 day and disable cross-region replication to minimize storage costs.”

The Reality: This would have eliminated the ability to recover from any incident older than 24 hours and removed geographic disaster recovery capability. The correct approach involved analyzing actual recovery requirements and optimizing backup strategies accordingly.

The pattern: AI tools lack business context

In every case, the AI assistant provided technically correct solutions that would have caused business disasters. The pattern is consistent:

  1. Focus on immediate technical problem: AI sees error or inefficiency and optimizes for fixing it
  2. Ignore operational implications: No understanding of business impact, risk management, or long-term consequences
  3. Miss critical constraints: Don’t consider security, compliance, cost, or reliability requirements
  4. Provide dangerous simplifications: Suggest approaches that work in isolated scenarios but fail in production environments

This is why infrastructure decisions require human expertise, business context, and operational understanding that current AI tools simply don’t possess.

What technical leaders need to know about AI infrastructure risks

Establish AI usage policies for infrastructure

Not all code is equally risky. Infrastructure code that controls production systems, data persistence, security boundaries, and network access requires different handling than application logic.

Create distinct policies:

  • Application Code: AI assistance with human review
  • Infrastructure Code: AI research only, human-driven implementation
  • Security Configuration: No AI assistance for production systems
  • Database Operations: Senior engineer oversight required

Implement infrastructure-specific review processes

Standard code review processes aren’t sufficient for infrastructure code. You need reviews that specifically evaluate:

  • Dependency relationships and deletion ordering
  • Security implications and access boundaries
  • Cost impact and resource scaling behavior
  • Disaster recovery and backup implications
  • Compliance and audit trail requirements

Train teams on infrastructure risk assessment

Developers using AI tools for infrastructure need training on:

  • How cloud dependencies work and cascade deletion scenarios
  • Security implications of different architectural choices
  • Cost optimization and resource limit strategies
  • Operational impact of infrastructure changes
  • When to escalate infrastructure decisions to senior engineers

Use AI for research, not implementation

AI tools are excellent for researching infrastructure patterns, understanding error messages, and exploring different approaches. But the actual implementation decisions should always involve human expertise that understands the business and operational context.

Got a leadership question?

Share your toughest challenge and I might feature it in an upcoming episode. It's free, anonymous, and you'll get extra resources in return.

Case study: building safe AI-assisted infrastructure practices

After the near-disaster I described, I worked with this client to establish better practices for AI-assisted infrastructure development:

Step 1: Infrastructure classification system

We classified all infrastructure code into risk categories:

  • High Risk: Production data, security controls, network boundaries
  • Medium Risk: Application infrastructure, non-production environments
  • Low Risk: Development tools, temporary resources, documentation

AI assistance policies were tailored to each risk level.

Step 2: Review gate implementation

All infrastructure changes required two types of review:

  • Technical Review: Does the code work correctly and follow best practices?
  • Operational Review: Does the change align with business requirements, security policies, and operational constraints?

For changes involving AI assistance, we added a third review: AI Impact Assessment to evaluate whether the AI understood the full context of the suggestion.

Step 3: Training program

We implemented training that covered:

  • Infrastructure fundamentals: How cloud dependencies, networking, and security actually work
  • Risk assessment: How to evaluate the business impact of infrastructure changes
  • AI tool limitations: When AI suggestions are helpful vs. dangerous
  • Escalation protocols: When to involve senior engineers or architects

Step 4: Monitoring and learning

We tracked AI-assisted infrastructure decisions and their outcomes:

  • What AI suggestions were implemented and their long-term impact
  • What AI suggestions were rejected and why
  • What problems emerged from AI-assisted infrastructure changes
  • How to improve our review processes and training

The results

After six months with these practices:

  • Zero infrastructure incidents related to AI-assisted changes
  • Faster development for appropriate use cases where AI assistance was safe
  • Better team skills in evaluating infrastructure decisions
  • Improved risk management across all infrastructure changes

The key insight: AI tools became more valuable when we established clear boundaries around their use.

The broader implications: AI in enterprise infrastructure

This CDK incident represents a broader challenge as AI tools become more prevalent in enterprise infrastructure management:

The skills gap is widening

As AI tools make it easier to write infrastructure code, the gap between writing code and understanding infrastructure is growing. Developers can generate complex infrastructure configurations without understanding how they actually work.

Traditional review processes are insufficient

Code review processes designed for application code don’t catch infrastructure-specific risks. Organizations need new review frameworks that understand the operational implications of infrastructure changes.

Documentation and training become critical

As AI abstracts away infrastructure complexity, teams need better training on fundamentals. Understanding how cloud services actually work becomes more important, not less.

Risk management needs updating

Traditional risk management frameworks don’t account for AI-generated infrastructure decisions. Organizations need new approaches that consider the unique risks of AI-assisted infrastructure development.

Building infrastructure teams that use AI safely

For startups and growing companies

Start with human expertise: Before scaling AI-assisted infrastructure development, ensure you have senior engineers who understand cloud architecture, security, and operational implications.

Establish safety boundaries: Create clear policies about when AI tools are appropriate for infrastructure decisions and when human expertise is required.

Implement proper review processes: Design review workflows that catch infrastructure-specific risks that AI tools commonly miss.

Invest in training: Ensure team members understand the fundamentals that AI tools abstract away.

For established organizations

Audit existing AI usage: Review infrastructure changes that involved AI assistance for potential hidden risks or dependency issues.

Update governance frameworks: Modify existing infrastructure governance to account for AI-assisted development patterns.

Develop AI-specific risk assessments: Create frameworks for evaluating the risks of AI-generated infrastructure suggestions.

Create learning programs: Train teams on safe AI usage for infrastructure while maintaining fundamental understanding.

The technical leader’s role in AI infrastructure safety

As technical leaders, we have a responsibility to ensure AI tools enhance rather than endanger our infrastructure:

Setting boundaries and expectations

Technical leaders need to establish clear guidelines about when and how AI tools should be used for infrastructure decisions. This isn’t about restricting useful tools - it’s about ensuring they’re used safely.

Building review processes that catch AI-specific risks

Traditional infrastructure review processes need updating to catch the specific types of problems that AI tools commonly create: over-optimization for immediate problems while missing broader implications.

Developing team capabilities

Teams need training not just on how to use AI tools, but on how to evaluate AI suggestions critically and understand when human expertise is required.

Balancing innovation with safety

The goal isn’t to eliminate AI assistance, but to harness its benefits while maintaining the operational safety and business continuity that infrastructure decisions require.

Coaching for Tech Leads & CTOs

Ongoing 1:1 coaching for startup leaders who want accountability, proven frameworks, and a partner to help them succeed under pressure.

Conclusion: infrastructure requires human judgment

AI coding assistants are powerful tools that can accelerate infrastructure development when used appropriately. But infrastructure decisions have consequences that extend far beyond whether the code compiles and deploys successfully.

The CDK scenario I described illustrates a fundamental limitation of current AI tools: they optimize for immediate technical problems without understanding broader business, operational, or risk management context.

Infrastructure code isn’t just code - it’s your business continuity plan, your security boundary, and your disaster recovery strategy all rolled into one. These decisions require human judgment, business context, and operational expertise that current AI tools simply don’t possess.

Key takeaways for technical leaders

AI tools are research assistants, not infrastructure architects: Use AI to explore options and understand error messages, but make implementation decisions with full human understanding of the implications.

Infrastructure reviews need AI-specific considerations: Traditional code review processes don’t catch the types of risks that AI-generated infrastructure commonly introduces.

Team training becomes more critical, not less: As AI abstracts away complexity, teams need deeper understanding of fundamentals to use AI safely.

Risk management frameworks need updating: Organizations need new approaches to evaluate and manage the risks of AI-assisted infrastructure development.

The path forward

The companies that thrive with AI-assisted infrastructure development will be those that:

  • Establish clear boundaries around AI usage for different types of infrastructure decisions
  • Implement review processes that understand operational and business implications
  • Invest in team training that builds fundamental understanding alongside AI tool usage
  • Create risk management frameworks that account for AI-specific challenges

The companies that struggle will be those that treat infrastructure code like any other code and trust AI suggestions without understanding their broader implications.

The investment in safety

Implementing proper AI safety practices for infrastructure requires investment in training, review processes, and senior technical expertise. But this investment is minimal compared to the cost of infrastructure disasters.

I’ve seen companies spend months and hundreds of thousands of dollars recovering from infrastructure mistakes - lost data, security breaches, compliance violations, and extended downtime. The cost of proper oversight and review processes is always less than the cost of catastrophic infrastructure failures.

Trusting AI with infrastructure decisions you don’t understand is like giving someone the keys to your data center without knowing their qualifications. The potential consequences are simply too severe to rely on tools that don’t understand the business context of their suggestions.

The future of infrastructure development will be AI-assisted, but it must be human-guided. The teams that master this balance will build faster while maintaining the safety and reliability that businesses depend on.

Facing a leadership challenge right now?

Don't wait for the next fire to burn you out. In a 30-minute discovery call we'll map your blockers and outline next steps you can use immediately with your team.


I’ve helped numerous organizations develop safe practices for AI-assisted infrastructure development, from establishing governance frameworks to training teams on effective AI usage while maintaining operational safety. If you’re looking to harness AI tools for infrastructure development while avoiding the common pitfalls and hidden risks, I’d be happy to discuss how fractional CTO support can help your specific situation.

📈 Join 2,000+ Tech Leaders

Get my weekly leadership insights delivered every Tuesday. Team scaling tactics, hiring frameworks, and real wins from the trenches.

✓ No spam ✓ Unsubscribe anytime ✓ Trusted by 50+ startup CTOs
Back to all posts

Shape future content

Have a leadership challenge you'd like me to write about? Submit your topic suggestion or question. Selected topics may be featured in upcoming blog posts, and you'll receive practical insights and resources to help with your leadership journey.