Open AI had one of its longest outages in recent history. It was due to issues with a new telemetry service deployed on Wednesday. This disruption affected many platforms. They include Chat GPT, the video generator Sora, and Open AI’s developer API. The outage began around 3 p.m. Pacific time and lasted three hours. Open AI acknowledged the problem shortly after it started.
In a postmortem published Thursday, Open AI explained the cause of the disruption. It wasn’t a security breach or a recent product launch. The issue came from a new telemetry service meant to collect metrics from Kubernetes. Kubernetes is an open-source system. It manages containers that run software in isolated environments. But the new service’s configuration triggered resource-heavy Kubernetes API operations unintentionally. This strain crashed the Kubernetes control plane in some of Open AI’s large clusters. It affected services that depend on Kubernetes, including DNS resolution.
One of the key systems affected was DNS resolution. It converts IP addresses from domain names like “google.com.” Open AI’s use of DNS caching, which holds data on looked-up domains, complicated the issue. It delayed the company’s ability to identify the problem’s full scope. Open AI detected the issue minutes before it affected customers. But the overwhelmed Kubernetes servers prevented a quick fix.
The company has since restored services. But this outage shows the risks of using new systems before we understand their impact. Managing infrastructure is complex.
Open AI has acknowledged a major failure in its infrastructure. An incident caused multiple systems to fail at once. This caused major problems. The company’s tests found no impact from recent changes on the Kubernetes control plane. The remediation process was delayed by a “locked-out effect.” It restricted engineers’ access to necessary systems.
In response, Open AI is taking steps to ensure that such problems do not happen again.
The measures are:
- Improve phased rollouts.
- Enhance change monitoring.
- Ensure engineers can access Kubernetes API servers in emergencies.
Open AI apologized. The incident disrupted customers, including Chat GPT users and developers. Businesses reliant on their products were also affected. The company admitted it fell short of its own goals. The organization is committed to improving its systems to prevent future outages.
Also Read: Open AI introduces