
Why yesterday's AWS outage proves your disaster recovery plan isn't good enough
The October 20, 2025 AWS outage in us-east-1 took down Snapchat, Roblox, and thousands of services. I've helped multiple companies survive similar failures. Here's what most startups get catastrophically wrong about disaster recovery and multi-region architecture.
Yesterday’s AWS outage in us-east-1 took down Snapchat, Roblox, Robinhood, and thousands of other services for hours. I got three panicked calls from founders before 7 AM. Their startups were completely offline. No revenue. No access to customer data. Just watching their businesses burn while AWS engineers worked on fixes they had zero control over.
The worst part? Two of these companies had been operating for over two years. They’d raised millions. They had paying enterprise customers. And they were still running 100% in us-east-1 with zero disaster recovery plan.
After helping dozens of companies survive infrastructure failures over 15+ years, from database corruptions to complete datacenter failures, I’ve learned a hard truth: the AWS outage didn’t break your business. Your architecture choices did.
What actually happened (and what AWS won’t tell you plainly)
Let me walk you through what happened yesterday, translating AWS’s technical post-mortem into language that explains why your services were down and what it means for your architecture decisions.
According to AWS’s incident report, the problems began at 11:49 PM Pacific Time on October 19 and weren’t fully resolved until 3:01 PM Pacific Time on October 20. That’s over 15 hours of problems affecting us-east-1, one of the most critical AWS regions in the world.
Here’s what AWS said happened:
“Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time.”
Translation for non-technical founders: For over 3 hours, services weren’t completely down but were failing frequently and running slowly. Even worse, global services that companies thought were independent of us-east-1 (like IAM, which controls who can access what, and DynamoDB Global Tables, which is supposed to work across regions) also broke. This is the hidden dependency problem I’ll explain later.
At 12:26 AM, AWS identified the root trigger:
“AWS identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints.”
Translation: DNS is like the phone book of the internet. It translates service names into addresses that computers can actually connect to. When DNS breaks, your applications can’t find the databases they need to talk to. It’s like having someone’s phone number but the phone company’s directory system is down, so you can’t actually make the call.
DynamoDB is one of AWS’s most popular database services. Thousands of applications depend on it. When those applications couldn’t resolve DynamoDB’s DNS names, they couldn’t access their data. No data access means your application stops working.
AWS fixed the DNS issue by 2:24 AM. You’d think this meant recovery, right? Wrong. Here’s where it gets worse:
“After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB.”
Translation: Fixing the first problem exposed a second problem. EC2 (the service that runs virtual servers) depends on DynamoDB under the hood for its own internal systems. When DynamoDB had issues, EC2 couldn’t launch new server instances. Companies trying to scale up to handle traffic after the recovery couldn’t do it. Companies with auto-scaling configurations couldn’t replace failed instances.
The cascading failures continued:
“As AWS continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch.”
Translation: The problems cascaded. Load balancers couldn’t check server health, which caused networking problems, which broke serverless functions (Lambda), databases (DynamoDB again), and monitoring systems (CloudWatch). Each service that broke caused other services that depend on it to break.
This is called a cascading failure. It’s the worst-case scenario in distributed systems. One problem triggers another, which triggers another, creating a cascade that’s extremely difficult to stop.
AWS didn’t recover the Network Load Balancer health checks until 9:38 AM. During recovery, they had to throttle operations to prevent making things worse:
“As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations.”
Translation: To prevent overwhelming their already-struggling systems, AWS deliberately slowed down certain operations. If you were trying to launch new servers or process message queues during this time, your requests were intentionally delayed. This is like emergency rooms going on “diversion” during a crisis, you can still get care but it’s going to take longer.
By 3:01 PM Pacific Time, AWS reported all services returned to normal operations. That’s over 15 hours from start to finish. Some services like AWS Config, Redshift, and Connect had backlogs that took additional hours to process.
The scope was staggering. AWS’s status page lists 142 affected services. Not 10 or 20. One hundred and forty-two different services had problems because of cascading failures in us-east-1.
The real impact: what actually broke
Let me make this concrete. Here are just some of the services that were affected:
Core infrastructure: EC2 (virtual servers), Lambda (serverless functions), ECS (containers), EKS (Kubernetes), Elastic Load Balancing
Databases: DynamoDB, RDS, ElastiCache, Redshift, Aurora, Neptune, DocumentDB
Storage: S3, EBS, EFS, FSx
Networking: VPC, CloudFront, Route 53, Direct Connect, Transit Gateway, VPN
Security & Identity: IAM, Cognito, Secrets Manager, GuardDuty, Security Token Service
Development tools: CodeBuild, CloudFormation, Systems Manager, CloudWatch
Business applications: Connect (contact center), WorkSpaces, Chime, SES (email)
If you use AWS, you use many of these services. When us-east-1 has cascading failures like this, huge swaths of the AWS ecosystem break simultaneously.
This is what I mean by blast radius. A DNS issue in DynamoDB shouldn’t break your EC2 instances. Your EC2 problems shouldn’t break your load balancers. Your load balancer problems shouldn’t break Lambda. But in us-east-1, with its decades of accumulated technical dependencies and service interdependencies, that’s exactly what happens.
Facing a leadership challenge right now?
Don't wait for the next fire to burn you out. In a 30-minute discovery call we'll map your blockers and outline next steps you can use immediately with your team.
Why us-east-1 is different (and more dangerous) than other regions
Here’s what most CTOs don’t understand about us-east-1: it’s not just another AWS region. It’s fundamentally different in ways that make it far more dangerous as a single point of failure.
The historical accident that created dependency hell
us-east-1 was AWS’s first region. It launched before AWS had really figured out how to architect truly independent regions. Because of this legacy, certain AWS services and features only exist in us-east-1 or have critical components that run exclusively there.
Yesterday’s outage proved this in the worst way. AWS explicitly stated:
“Services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time.”
Translation: Even if you think you’re not using us-east-1, you probably are. IAM (Identity and Access Management) is supposed to be a global service. DynamoDB Global Tables are literally designed to work across multiple regions. But both had problems because they have dependencies on us-east-1 endpoints.
This means companies running entirely in us-west-2 or eu-west-1 still had authentication problems and database replication issues because of the us-east-1 outage. Your multi-region architecture didn’t protect you if you depended on these “global” services.
The blast radius problem
I worked with a fintech startup last year that learned this lesson the hard way. They’d architected everything in eu-west-1 (Ireland) because their customers were primarily European. They felt smart about avoiding us-east-1.
Then us-east-1 had an outage. Their application went down anyway.
Why? Their authentication system used AWS Cognito, which at the time had control plane operations that depended on us-east-1. Their monitoring used CloudWatch with cross-region dashboards that aggregated data through us-east-1 endpoints. Their CI/CD pipeline used CodePipeline with artifacts stored in S3 buckets that had replication configurations managed through us-east-1.
They weren’t running their application in us-east-1, but they had a dozen invisible dependencies on it. The outage exposed every single one.
The default trap
The other problem with us-east-1 is that it’s the default region in almost every AWS SDK, CLI tool, and tutorial. When developers write code or follow documentation, they end up in us-east-1 unless they explicitly choose otherwise.
I’ve reviewed infrastructure for dozens of startups over the years. In nearly every case, I find resources that were deployed to us-east-1 accidentally. An IAM role created by a developer following a tutorial. An S3 bucket created by the deployment pipeline. A Lambda function deployed during an experiment that never got cleaned up.
These accidental us-east-1 resources create hidden dependencies. When us-east-1 goes down, services break for reasons that aren’t immediately obvious because nobody remembered that dependency even existed.
The architecture choices that actually matter
Let me be direct: I don’t care if you use AWS, Google Cloud, Azure, or carrier pigeons to run your infrastructure. What matters is that you understand blast radius and have a strategy that matches your risk tolerance.
Here’s what I tell every CTO I work with: yes, some AWS services are us-east-1 only or us-east-1 dependent. But there are three architecture choices that dramatically reduce your risk without requiring you to rebuild your entire infrastructure.
Choice 1: Pick a different primary region
The simplest and most effective choice: stop using us-east-1 as your primary region.
Choose us-west-2 (Oregon), us-west-1 (California), or if you’re serving European customers, eu-west-1 (Ireland). These regions are mature, have nearly all the same services as us-east-1, and don’t carry the same legacy baggage and dependency hell.
When us-east-1 goes down (and it will again), your primary infrastructure keeps running. You might have some degraded functionality if you depend on global services with us-east-1 components, but your core application stays up.
I helped a SaaS company make this switch two years ago. They moved their primary region from us-east-1 to us-west-2 over a three-month period. It wasn’t trivial, but it wasn’t impossibly hard either. They did it gradually, moving one service at a time, validating everything worked, then moving the next service.
Yesterday, while their competitors running in us-east-1 were completely down, they had one minor issue with their CloudFront distribution that caused slightly slower page loads for about an hour. Everything else worked perfectly. Their customers barely noticed while their competitors were posting status page updates about complete outages.
The blast radius difference is dramatic.
Choice 2: Replicate your critical data to a second region
Here’s where most startups fail: they think disaster recovery means “we can rebuild from backups if everything burns down.” That’s not disaster recovery. That’s disaster archeology.
Real disaster recovery means you can actually serve customers from a different region when your primary region becomes unavailable.
The minimum viable disaster recovery architecture:
Primary region (us-west-2):
- All your application infrastructure
- Your primary databases
- Your primary cache layers
- Real-time replication sending data to secondary region
Secondary region (us-east-2):
- Scaled-down application infrastructure (or infrastructure-as-code ready to deploy)
- Read replicas of your databases receiving replicated data
- Ability to promote read replicas to primary if needed
You don’t need active-active. You don’t need perfect consistency across regions. You need the ability to fail over to the secondary region and serve customers with slightly stale data while you figure out what went wrong in the primary region.
For most startups, asynchronous replication with 5-30 seconds of lag is perfectly acceptable. Yes, if your primary region explodes, you might lose 30 seconds of data. But you’ll be serving customers again in minutes instead of hours.
Coaching for Tech Leads & CTOs
Ongoing 1:1 coaching for startup leaders who want accountability, proven frameworks, and a partner to help them succeed under pressure.
Choice 3: Map and eliminate your hidden dependencies
This is the work nobody wants to do because it’s tedious and doesn’t ship features. But it’s essential.
Sit down and map every service you use. For each service, answer these questions:
- Which AWS region does it run in?
- Does it have dependencies on other regions?
- If this region becomes unavailable, what breaks?
- How long can we operate without this service?
- What’s our recovery process if this service is unavailable?
I ran this exercise with an e-commerce startup last year. It took two full days with their entire engineering team. They identified 47 services and dependencies they didn’t know existed.
Some examples that surprised them:
- Their session management used ElastiCache in us-east-1 because that’s where the original developer had set it up three years ago
- Their email service used SES configured in us-east-1 even though their application ran in us-west-2
- Their mobile app had analytics that sent data to S3 buckets in us-east-1
- Their Terraform state files were stored in us-east-1 S3 buckets, meaning they couldn’t deploy infrastructure changes if us-east-1 was down
They spent the next month systematically eliminating or relocating these dependencies. When us-east-1 had an issue six months later, they had zero impact while their competitors scrambled.
The mapping exercise feels like a waste of time until the outage happens. Then it’s the difference between 5 hours of downtime and 5 minutes of degraded performance.
The disaster recovery ladder (and where you probably are)
Let me give you a framework for thinking about disaster recovery maturity. Most startups are on the bottom rungs and need to climb up.
Level 0: Hope and prayer
What it looks like: Everything in one region, no backups or minimal backups, no tested recovery procedures.
What happens during regional outage: Complete business failure. You’re offline until AWS fixes the problem. You have zero control over when you come back online.
Who should be here: Literally nobody. Even side projects should have backups.
Why startups stay here: “We’re moving fast and will fix it later.” Later never comes because it’s not urgent until it’s catastrophic.
Level 1: We have backups
What it looks like: Regular database backups to S3, maybe stored in a different region, some documentation about how to restore.
What happens during regional outage: Still completely offline, but you can theoretically rebuild in a different region from backups. Recovery time measured in hours or days depending on data size and team capability.
Who should be here: Very early stage startups (pre-revenue, no customers yet) can defensibly be here temporarily.
Why startups stay here: It feels like you’ve solved the problem because you have backups. You haven’t. You’ve only solved data loss, not service availability.
Level 2: Multi-AZ with backups
What it looks like: Infrastructure spread across multiple availability zones in one region, automatic failover within the region, regular backups to different region.
What happens during regional outage: Complete outage. Multi-AZ protects against datacenter failures but does nothing for regional failures like yesterday’s AWS outage.
Who should be here: Startups with some revenue and customers, working toward multi-region.
Why startups stay here: AWS documentation heavily promotes multi-AZ as “high availability” and many teams think this is good enough. It’s not.
Level 3: Multi-region with manual failover
What it looks like: Primary region with full infrastructure, secondary region with data replication and either scaled-down infrastructure or infrastructure-as-code ready to deploy quickly.
What happens during regional outage: Service interruption while you execute manual failover procedures, potentially 15-60 minutes of downtime depending on preparation and team capability. Once failed over, service runs from secondary region.
Who should be here: Most established startups and any company with paying enterprise customers.
Why startups stay here: This is the sweet spot for cost versus resilience. Going beyond this gets expensive and complex quickly.
Level 4: Multi-region with automatic failover
What it looks like: Application and data running in multiple regions simultaneously, automatic health checks and traffic routing, seamless failover when one region has problems.
What happens during regional outage: Possible brief degradation while traffic reroutes, but service stays available. Customers might not even notice.
Who should be here: Companies with serious uptime requirements, financial services, healthcare applications, companies with aggressive SLAs.
Why teams resist this level: Significantly more complex and expensive. Data consistency across regions becomes harder. Not all AWS services support this model easily.
Level 5: Active-active multi-region
What it looks like: Application actively serving traffic from multiple regions simultaneously, data synchronized across regions in near-real-time, can lose entire regions without customer impact.
What happens during regional outage: Seamless operation. Customers experience no disruption. Traffic automatically rebalances to healthy regions.
Who should be here: Large enterprises, global applications, companies where downtime has extreme business impact.
Why few companies reach this level: Very expensive, very complex, requires significant engineering investment. Only justified when downtime costs exceed the implementation and operational costs.
Most startups should target Level 3 within their first year of having paying customers. It’s achievable without massive engineering effort, provides meaningful protection, and dramatically reduces blast radius when problems happen.
If you have enterprise customers with SLAs, you need Level 3 minimum. If your SLAs promise 99.9% uptime or better, you probably need Level 4.
How to actually build multi-region DR (the practical version)
Let me walk through how to implement Level 3 disaster recovery without rebuilding your entire infrastructure. This is what I’ve done with multiple companies, and it works without requiring a six-month engineering project.
Phase 1: Choose your regions (Week 1)
Pick your primary and secondary regions based on three factors:
Geographic distribution: Choose regions far apart geographically. If your primary is us-west-2, your secondary should be us-east-2, not us-west-1. You want to avoid correlated failures like regional power grid problems or natural disasters.
Service availability: Verify that both regions support the AWS services you actually use. Most services are available in all major regions now, but check to avoid surprises.
Latency requirements: If you’re serving customers globally, consider whether you need a presence in multiple geographic areas anyway. A US company with European customers might want us-west-2 and eu-west-1.
Phase 2: Implement data replication (Weeks 2-4)
This is the foundation. You need your data synchronized to the secondary region before anything else matters.
For RDS (relational databases):
- Enable cross-region read replicas
- Configure automatic backups to both regions
- Test promoting a read replica to primary (do this in a test environment first)
- Document the exact steps to promote and update application configuration
For DynamoDB:
- Enable DynamoDB Global Tables for critical tables
- Understand the consistency model (eventual consistency across regions)
- Test reading from and writing to both regions
- Document failover procedures
For S3:
- Enable cross-region replication for critical buckets
- Consider versioning to protect against accidental deletions
- Test accessing data from both regions
- Document any region-specific bucket policies or access controls
For other data stores:
- Document the replication strategy for each service you use
- Test replication lag and understand what data loss window you’re accepting
- Verify replication is actually working (I’ve seen multiple cases where teams thought they had replication configured but it was broken)
Phase 3: Prepare secondary infrastructure (Weeks 4-6)
You have two options here depending on your risk tolerance and budget.
Option A: Warm standby (recommended for most startups):
- Deploy minimal infrastructure in secondary region (smaller instance sizes, fewer replicas)
- Keep it running 24/7 but not serving production traffic
- Regularly test that it can handle your production load if scaled up
- Cost is typically 20-40% of primary region costs
- Failover time: 10-30 minutes to scale up and switch traffic
Option B: Cold standby:
- Maintain infrastructure-as-code for secondary region but don’t run it constantly
- Regularly test deploying the infrastructure from scratch
- Keep monitoring and alerting infrastructure running even if application infrastructure isn’t
- Cost is minimal (just data replication and monitoring)
- Failover time: 30-90 minutes to deploy infrastructure and switch traffic
I generally recommend Option A for any company with revenue. The extra cost is minimal compared to the reduction in recovery time, and you can test your failover procedures much more easily.
Got a leadership question?
Share your toughest challenge and I might feature it in an upcoming episode. It's free, anonymous, and you'll get extra resources in return.
Phase 4: Build failover procedures (Weeks 6-8)
Write detailed runbooks for executing failover. These need to be clear enough that any senior engineer can execute them under pressure at 3 AM.
Your runbook should cover:
- How to verify that failover is actually necessary (decision criteria)
- How to communicate the outage and failover plan to stakeholders
- Step-by-step technical procedure for failing over each service
- How to update DNS to point to secondary region
- How to verify that failover was successful
- How to fail back to primary region once it’s healthy
Test this runbook by actually executing it. I recommend quarterly DR drills where you deliberately fail over to your secondary region during a planned maintenance window. You’ll discover all the things you forgot to document.
Phase 5: Implement monitoring and alerting (Weeks 8-10)
You need monitoring that can alert you to problems in your primary region even when that region is having issues. This means:
Multi-region health checks:
- Use Route 53 health checks or a third-party service like Pingdom
- Monitor from multiple geographic locations
- Alert when your primary region becomes unreachable
- Separate alerts for degraded performance versus complete failure
Cross-region monitoring infrastructure:
- Don’t run all your monitoring in the same region you’re monitoring
- Consider running monitoring infrastructure in a third region or using a monitoring SaaS
- Ensure you’ll receive alerts even if your primary region is completely down
Dependency monitoring:
- Monitor the health of AWS services you depend on using the AWS Health Dashboard API
- Set up alerts for us-east-1 service issues even if you don’t run there (because you probably have hidden dependencies)
- Track replication lag between regions and alert if it exceeds acceptable thresholds
Phase 6: Test everything (Ongoing)
Here’s the hard truth: your disaster recovery plan doesn’t work unless you’ve actually tested it. I’ve reviewed DR plans for dozens of companies over the years. The ones that have never tested their procedures discover during actual disasters that their plans are 30-50% wrong.
Monthly tests:
- Verify data replication is working
- Check replication lag and data consistency
- Confirm backup restoration works
- Test accessing resources in secondary region
Quarterly tests:
- Full failover drill during maintenance window
- Execute complete runbook from start to finish
- Measure actual recovery time
- Document what didn’t work as expected
- Update runbooks based on lessons learned
Annual tests:
- Disaster scenario testing with entire company
- Practice communication procedures with stakeholders
- Test failover with real production traffic (if possible)
- Review and update disaster recovery strategy
Testing is the difference between having a disaster recovery plan and having a disaster recovery plan that actually works.
The real cost of not having DR (and it’s not what you think)
Most founders think about disaster recovery in terms of the cost of an outage. Lost revenue during downtime, angry customers, support ticket volume. That’s real, but it’s not the biggest cost.
The biggest cost is what happens after the outage.
The enterprise customer aftermath
I watched a company lose their largest customer after a 6-hour outage caused by a regional AWS failure. The customer’s contract was worth $400,000 annually. The direct revenue loss during the outage was maybe $5,000.
But after the outage, the customer demanded:
- Complete infrastructure review and audit (80 hours of engineering time)
- Contractual SLA with financial penalties (reduced margins by 15% on that contract)
- Monthly infrastructure status reports (ongoing overhead)
- Approval rights on major infrastructure changes (massive operational drag)
- Quarterly disaster recovery tests with their team present (significant coordination overhead)
The company spent six months under this microscope before the customer finally left anyway, deciding they needed a vendor with “more mature infrastructure practices.”
The outage cost $5,000 in lost revenue. The aftermath cost a $400,000 customer and probably 500+ hours of engineering time that should have been spent building product.
If they’d had a Level 3 disaster recovery setup, the outage would have been 20 minutes instead of 6 hours. The customer would have been annoyed but not panicked. No aftermath, no microscope, no contract loss.
The DR infrastructure they needed would have cost maybe $3,000-5,000 per month. They were penny-wise and pound-foolish.
The fundraising impact
Another company I worked with had a complete infrastructure failure three weeks before their Series A was supposed to close. The failure itself was unrelated to AWS, it was a bad database migration that corrupted their primary database and their backups were incomplete.
They recovered, but it took 18 hours of complete downtime plus another week of degraded service while they rebuilt data from application logs and third-party sources.
Their lead investor didn’t pull out, but they did renegotiate terms. The valuation dropped 15% because of “infrastructure and operational risk.” The company ended up raising $1.2M less than they’d expected at a lower valuation.
That bad database migration cost them over a million dollars in dilution and smaller raise. Proper DR infrastructure with tested recovery procedures would have cost maybe $30,000 to implement.
The talent cost
Here’s one nobody talks about: the impact of disasters on your engineering team.
I’ve seen multiple cases where companies lost senior engineers after major outages. Not immediately, but within 3-6 months. The pattern is always the same:
The outage happens. The team scrambles. They work around the clock. They get it fixed. Management promises “we’ll invest in infrastructure and reliability.” Everyone’s exhausted but they believe it will get better.
Then nothing changes. The promises turn into “after we ship this next feature” or “when we have more engineering resources.” The team realizes leadership doesn’t actually prioritize reliability.
The senior engineers, who have options and know what good engineering organizations look like, start looking for new jobs. You lose your best people because you demonstrated that you don’t care about the operational foundation they have to support.
I worked with a company that lost 3 of their 7 senior engineers in the six months following a major outage where leadership refused to invest in proper disaster recovery. The cost of replacing those engineers (recruiting, hiring, onboarding, lost productivity) was easily $500,000+. The cost of the DR infrastructure they needed was under $100,000 annually.
You can’t treat your infrastructure like garbage and expect talented engineers to stick around.
What to do right now (even if you can’t do everything)
If you’re reading this and realizing your disaster recovery plan is “hope AWS doesn’t have problems,” here’s what to do immediately.
This week
Map your dependencies:
- Which AWS region is your primary infrastructure in?
- What services do you use and which regions do they run in?
- Which services have hidden dependencies on us-east-1?
- What breaks if your primary region becomes unavailable?
Check your backups:
- Do you have database backups?
- Are they stored in a different region than your primary infrastructure?
- Have you tested restoring from these backups?
- How long would it take to rebuild from backups?
Document current state:
- Write down your current disaster recovery capability honestly
- Document your recovery time objective (how long can you be down?)
- Document your recovery point objective (how much data can you afford to lose?)
- Share this with your leadership team so everyone understands the risk
This month
If you’re in us-east-1:
- Make a plan to migrate to us-west-2 or another region
- Start with new services launching in the new region
- Create a timeline for migrating existing critical services
- Don’t try to migrate everything at once, do it systematically
If you’re not in us-east-1 but have no DR:
- Choose a secondary region
- Enable cross-region replication for your primary database
- Set up cross-region backups for all critical data
- Test restoring from these backups in the secondary region
For everyone:
- Write a basic runbook for what to do during a regional AWS outage
- Set up monitoring that can alert you to problems even if your primary region is down
- Schedule a quarterly calendar reminder to test your DR procedures
- Review your customer SLAs and verify you can actually meet them if your primary region fails
This quarter
Implement Level 3 DR:
- Follow the phased approach I outlined earlier
- Deploy warm standby infrastructure in secondary region
- Set up automated data replication
- Create and test failover procedures
Update your communication:
- Add infrastructure reliability and disaster recovery to your engineering roadmap
- Communicate your DR capabilities to enterprise customers and prospects
- Include uptime and DR capabilities in your security questionnaires and vendor assessments
- Be honest about your current capabilities and your timeline for improvements
Build the culture:
- Make reliability and operational excellence a visible engineering priority
- Celebrate teams that improve reliability and reduce blast radius
- Include DR testing in your engineering rituals (quarterly drill, anyone?)
- Stop treating infrastructure work as something to do “later”
A final thought on hope versus strategy
Yesterday’s AWS outage proved something I’ve been saying for 20 years: hope is not a strategy.
Hoping AWS doesn’t have outages isn’t a strategy. Hoping your single region stays healthy isn’t a strategy. Hoping you’ll implement disaster recovery “when you’re bigger” isn’t a strategy.
Every minute your business runs without proper disaster recovery, you’re making a bet. You’re betting that the cost of building DR exceeds the expected cost of outages multiplied by the probability of outages.
For most startups, that’s a bad bet. The cost of implementing Level 3 DR is measured in thousands of dollars per month and a few weeks of engineering time. The cost of a major outage at the wrong moment can be a lost enterprise customer, a damaged fundraise, or even business failure.
I’ve helped companies survive catastrophic infrastructure failures. I’ve also watched companies die because an outage happened at exactly the wrong time during a critical enterprise proof of concept or right before a fundraise closed.
The companies that survive aren’t the ones that never have problems. They’re the ones that built systems to handle problems gracefully before those problems occurred.
Yesterday, us-east-1 had a bad day. Some companies barely noticed because they’d built for resilience. Other companies watched their businesses burn for 5+ hours with zero control over when they’d come back online.
Which company do you want to be next time AWS has problems?
Because there will be a next time.
Coaching for Tech Leads & CTOs
Ongoing 1:1 coaching for startup leaders who want accountability, proven frameworks, and a partner to help them succeed under pressure.
I’ve spent 15+ years helping companies build resilient infrastructure and survive the disasters that inevitably happen. If you’re running in us-east-1 without proper disaster recovery, or if you need help building multi-region architecture that actually works under pressure, let’s talk about how to protect your business before the next AWS outage proves your current approach isn’t good enough.
📈 Join 2,000+ Tech Leaders
Get my weekly leadership insights delivered every Tuesday. Team scaling tactics, hiring frameworks, and real wins from the trenches.