Cloudflare’s November 18 Outage: What Went Wrong ?

On 18 November 2025, at approximately 11:20 UTC, Cloudflare experienced a major service disruption that caused core traffic across its global network to fail. For millions of users, this appeared as error pages when attempting to load websites protected or powered by Cloudflare.

Although outages of this scale often raise immediate concerns about cyberattacks, Cloudflare confirmed that no malicious activity was involved. Instead, the incident was triggered by an unexpected internal change — a permissions update in one of Cloudflare’s database systems. This seemingly small modification produced a cascade of failures across the network, eventually bringing critical systems to a halt.

This blog breaks down why the outage happenedhow it escalated, and the steps Cloudflare is taking to prevent such failures in the future.

What Triggered the Outage?

The root of the problem was a permissions change in a ClickHouse database cluster. The change altered how database metadata was exposed, which unintentionally caused duplicate entries to be included in a “feature file” used by Cloudflare’s Bot Management system.

Why this mattered

The feature file is essential because it feeds Cloudflare’s machine-learning model with data about bot patterns. Every few minutes, this file is refreshed and distributed to Cloudflare’s entire network so it can identify emerging automated threats in real time.

But the permissions update caused:

  • The feature file to double in size, containing many duplicated rows.\
  • The oversized file to be propagated across all Cloudflare servers.
  • The systems reading this file to hit a built-in memory constraint and crash.

Because this ML-driven feature file is tightly integrated into Cloudflare’s core proxy, the failure disrupted not only bot detection but traffic routing across the entire platform.

Why the System Kept Failing and Recovering

One of the most confusing aspects of the outage — even for Cloudflare engineers — was the intermittent failure pattern. Network traffic would spike with errors, then stabilize briefly, then spike again.

This happened because:

  • The feature file was regenerated every five minutes.
  • Only some parts of the cluster had the updated permissions.
  • Depending on which node produced the file, either a “good” or “bad” version would spread to the network.

This created a cycle:

  1. Good file → systems recovered

  2. Bad file → systems crashed again

  3. Repeat

Initially, the repeated waves of errors made the issue appear similar to a large-scale DDoS attack. It took time for Cloudflare to isolate the source of the problem and recognize that the failures aligned with how the feature file was being generated.

Once every node began producing the incorrect file, the fluctuation stopped — and the system remained in a failing state.

When Cloudflare Identified the Root Cause

By 14:30 UTC, the team had traced the outage back to the oversized feature file. Engineers immediately stopped the distribution of new files and manually injected a previously known stable file into the system.

In addition, they restarted the core proxy (FL), allowing traffic to start flowing again. From that point, Cloudflare spent several hours clearing backlogs, stabilizing internal components, and restarting services that had entered bad states.

By 17:06 UTC, Cloudflare confirmed that all systems were operating normally again.

Why the Failure Was So Severe

To understand why a single oversized file could take down such a large infrastructure, it helps to look at how Cloudflare processes requests.

Every incoming request — website visits, API calls, mobile app requests — flows through:

  1. TLS/HTTP termination

  2. Core proxy system (FL / FL2)

  3. Pingora, which handles caching and contacting origin servers

At the proxy stage, Cloudflare runs multiple security and performance tools — including Bot Management. The bot detection module failed after encountering the enlarged feature file because it exceeded a limit of 200 features, a threshold designed to prevent memory overuse.

When the module crashed:

  • HTTP 5xx errors were returned across the network.
  • Bot scoring failed, giving all traffic a score of zero on older proxies.
  • Workers KV, Access, and the Dashboard — all dependent on the core proxy — experienced errors.

Why Cloudflare’s Status Page Went Down

Why Cloudflare’s Status Page Went Down

At the same time this internal failure was unfolding, Cloudflare’s external status page — hosted entirely outside its infrastructure — also went offline. This coincidence made engineers initially suspect an external attack. Later, it was confirmed to be unrelated, but it added confusion during early diagnosis.

How the Database Query Change Led to Duplicated Features

The source of the issue lay deeper in how ClickHouse handles distributed queries.

Before the change

  • Users only saw metadata from the default database.

After the change

  • Permissions were updated to give users explicit access to underlying r0 database tables.
  • Queries that previously returned a single set of metadata now returned duplicates — one set per database.
  • The bot feature file generator interpreted the duplicate metadata as valid new features.

This caused the file to balloon to more than twice its normal size.

Why the System Crashed Instead of Handling the Error Gracefully

Cloudflare’s bot system has a hard limit on the number of features to ensure:

  • predictable memory allocation
  • stable performance
  • protection from runaway resources

When the bad file exceeded this limit, the proxy’s Bot Management module triggered a panic, causing the system to fail rather than continue riskily. This unhandled error led to widespread 5xx errors.

Both FL and the newer FL2 proxy system were affected, although FL2 returned explicit 5xx errors while FL continued operating but with incorrect bot scoring.

Other Systems Affected During the Outage

Because the core proxy is a foundational service at Cloudflare, several products were incidentally affected, including:

Workers KV

  • Dependent on the proxy pipelineHigh errors until bypass implemented at 13:04

Cloudflare Access

  • Also dependent on Workers KV
  • Experienced authentication failures

Cloudflare Dashboard & Turnstile

  • Login systems, which use KV and Turnstile challenges, faced disruptions.
  • Two outage windows were recorded:
  • 11:30–13:10
  • 14:40–15:30

The second window resulted from backlog traffic overwhelming the login system after fixes were applied.

Why This Outage Was Significant

Cloudflare called this their worst outage since 2019. While the company has had isolated feature failures and dashboard outages in the past, this event interrupted core traffic routing, affecting millions of websites globally.

Cloudflare emphasized that any outage of this scale is unacceptable, given the company’s central role in Internet infrastructure.

How Cloudflare Plans to Prevent This in the Future

In response, Cloudflare committed to several remediation steps focused on improving system resilience and reducing the risk of similar failures.

1. Hardening Config File Ingestion

Making automated config files go through stricter validation — similar to user-generated data.

2. More Global Kill Switches

Allowing engineers to rapidly disable problematic features without waiting for them to propagate.

3. Resource Protection

Preventing error reports, core dumps, or debugging tools from overwhelming system resources during failures.

4. Reviewing Module Failure Modes

Ensuring that individual modules — like Bot Management — can fail safely without crashing the entire proxy system.

Closing Note

Cloudflare issued a formal apology, acknowledging the widespread impact the outage had on the Internet. As a critical part of global web infrastructure, the company emphasized its responsibility to ensure reliability and stated that rebuilding system resilience is already underway.

Prev
Next
Drag
Map
💬 Chat with 4Point2Tech _
Hi! I'm your 4Point2Tech Assistant. How can I help you today?