Following the massive service outage that occurred on October 20 in its US-EAST-1 (Northern Virginia) region, AWS has formally released the findings of its internal investigation. As initially suspected, the root cause was traced back to its core database service, DynamoDB—specifically, a design flaw in the service’s DNS automation module, which triggered a catastrophic chain reaction.
The incident had an extensive impact, disrupting 142 AWS services and thousands of customers, and required 15 hours for full recovery.
According to AWS, the DNS operations for DynamoDB are managed by two automated modules: the DNS Planner, responsible for generating new DNS plans, and the DNS Enactor, which deploys those plans to Amazon Route 53. To ensure high availability, AWS maintained three independent Enactor instances, each operating in a separate Availability Zone (AZ).
Under normal conditions, the Enactor validates plan versions before deployment, updates endpoints sequentially, retries on conflict, and finally cleans up outdated plans once the process is complete.
However, the failure sequence unfolded as follows:
- Enactor A began deploying a plan but experienced severe delays while updating multiple DNS endpoints, repeatedly retrying the operation. Meanwhile, the DNS Planner continued to generate new versions of the plan.
- Enactor B, operating independently, obtained the latest plan and rapidly completed all endpoint updates, subsequently initiating its cleanup process.
- The critical conflict occurred when the lagging Enactor A attempted to deploy its outdated plan to the same US-EAST-1 service nodes that Enactor B had just updated.
- In turn, Enactor B’s cleanup routine mistakenly identified Enactor A’s ongoing deployment as obsolete and deleted it.
The result was devastating: all IP addresses associated with US-EAST-1 service nodes were erased, leaving DNS records completely empty and unresolved, effectively halting the ability to deploy any new plans.
Although AWS managed to resolve the core DNS malfunction within roughly three hours, the cascading effects persisted for over a dozen hours.
One of the main reasons was the heavy interdependence of AWS’s core services on DynamoDB. For instance, the Droplet Workflow Manager (DWFM)—a tool responsible for managing the state of EC2 instances—experienced widespread lease expirations during the DynamoDB outage. When DNS service was restored, DWFM attempted to re-establish hundreds of thousands of leases simultaneously, overwhelming the system and causing severe congestion and crashes.
The DWFM failure in turn prevented new EC2 instances from launching properly, introduced delays in network configurations, and disrupted downstream services such as Network Load Balancers (NLB) and AWS Lambda, both of which rely on EC2. This chain of dependencies significantly prolonged the overall recovery time.
The incident underscored a crucial vulnerability of large-scale cloud infrastructures: while automation enhances efficiency, the intricate web of dependencies and potential race conditions can precipitate system-wide failures. Ironically, the very redundancy and automation mechanisms designed to bolster reliability became sources of conflict under extreme circumstances.
As a corrective measure, AWS announced the temporary suspension of all DynamoDB DNS Planner and DNS Enactor automation modules globally, pending comprehensive safety audits, race-condition fixes, and the implementation of more robust control mechanisms.
Given that US-EAST-1 is AWS’s oldest, largest, and most critical region—hosting numerous global control planes and management backends—its stability is paramount. And since DynamoDB underpins both internal Amazon systems (such as Amazon.com and Alexa) and major external clients (including Netflix), even a brief DNS disruption was sufficient to trigger such a far-reaching cascade of failures, serving as a stark reminder of the inherent risks in overreliance on critical infrastructure.
Related Posts:
- The Great Blackout: AWS Outage Cripples Half the Internet
- Double Trouble: DDoS and Internal Errors Cause Major Microsoft Azure Outage
- The Day the Internet Broke: Unpacking the June 12th Global Cloud Meltdown
- New Phishing Campaign Targets AWS Accounts: Security Experts Warn
Support Our Threat Intelligence
If you find our CVE report and cybersecurity news helpful, consider supporting our work.