OpenAI Services Hit by Major Outage Due to Telemetry Service Deployment

OpenAI Major Outage
Fake OpenAI and ChatGPT Website

OpenAI experienced a significant service disruption on December 11, 2024, impacting all its services, including ChatGPT, the API, and Sora. The outage, lasting over four hours, was caused by a new telemetry service deployment that inadvertently overwhelmed the Kubernetes control plane, according to a detailed post-mortem report released by OpenAI.

The incident began at 3:16 PM PST, with services experiencing “significant degradation or complete unavailability,” as stated in the report. The disruption was not attributed to a security breach or a recent product launch, but rather an internal change aimed at improving system reliability.

The report explains that OpenAI relies on hundreds of Kubernetes clusters globally. A key component of these clusters is the control plane, responsible for cluster administration. The new telemetry service, intended to enhance system observability, triggered a cascade of failures. “This new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster,” the report reveals. With thousands of nodes simultaneously performing these operations, the Kubernetes API servers were overwhelmed, crippling the control plane in most of OpenAI’s large clusters.

The impact was felt across OpenAI’s product suite: ChatGPT saw substantial recovery at 5:45 PM PST, with full recovery at 7:01 PM PST. The API experienced substantial recovery at 5:36 PM PST, achieving full recovery across all models at 7:38 PM PST. Sora also fully recovered at 7:01 PM PST.

A critical factor contributing to the incident was the testing and deployment process. While the change was tested in a staging cluster, it failed to reveal the issue. The report acknowledges that “the impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue fleet-wide.” Furthermore, the focus of pre-deployment testing was primarily on resource consumption of the telemetry service itself, with “no precautions…taken to assess Kubernetes API server load.”

The report also highlights the role of DNS caching in exacerbating the problem. “DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution,” the report explains. This delay masked the true extent of the problem, allowing the rollout to proceed further than it should have.

Remediation efforts were hampered by the very problem engineers were trying to fix. “In this case, our detection tools worked and we detected the issue a few minutes before customers started seeing impact. But the fix for this issue required us to remove the offending service. In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.” Engineers ultimately pursued a multi-pronged approach, including scaling down cluster size, blocking network access to Kubernetes admin APIs, and scaling up Kubernetes API servers.

OpenAI has outlined several preventative measures to avoid similar incidents in the future. The report concludes with an apology: “We apologize for the impact that this incident caused to all of our customers…We’ve fallen short of our own expectations.” The company emphasizes its commitment to providing reliable service and promises to prioritize the outlined preventative measures.

Related Posts:

Buy Me A Coffee