On 18 November 2025, at approximately 11:20 UTC, Cloudflare experienced a major service disruption that caused core traffic across its global network to fail. For millions of users, this appeared as error pages when attempting to load websites protected or powered by Cloudflare.
Although outages of this scale often raise immediate concerns about cyberattacks, Cloudflare confirmed that no malicious activity was involved. Instead, the incident was triggered by an unexpected internal change — a permissions update in one of Cloudflare’s database systems. This seemingly small modification produced a cascade of failures across the network, eventually bringing critical systems to a halt.
This blog breaks down why the outage happened, how it escalated, and the steps Cloudflare is taking to prevent such failures in the future.
The root of the problem was a permissions change in a ClickHouse database cluster. The change altered how database metadata was exposed, which unintentionally caused duplicate entries to be included in a “feature file” used by Cloudflare’s Bot Management system.
The feature file is essential because it feeds Cloudflare’s machine-learning model with data about bot patterns. Every few minutes, this file is refreshed and distributed to Cloudflare’s entire network so it can identify emerging automated threats in real time.
But the permissions update caused:
Because this ML-driven feature file is tightly integrated into Cloudflare’s core proxy, the failure disrupted not only bot detection but traffic routing across the entire platform.
One of the most confusing aspects of the outage — even for Cloudflare engineers — was the intermittent failure pattern. Network traffic would spike with errors, then stabilize briefly, then spike again.
This happened because:
This created a cycle:
Good file → systems recovered
Bad file → systems crashed again
Repeat
Initially, the repeated waves of errors made the issue appear similar to a large-scale DDoS attack. It took time for Cloudflare to isolate the source of the problem and recognize that the failures aligned with how the feature file was being generated.
Once every node began producing the incorrect file, the fluctuation stopped — and the system remained in a failing state.
By 14:30 UTC, the team had traced the outage back to the oversized feature file. Engineers immediately stopped the distribution of new files and manually injected a previously known stable file into the system.
In addition, they restarted the core proxy (FL), allowing traffic to start flowing again. From that point, Cloudflare spent several hours clearing backlogs, stabilizing internal components, and restarting services that had entered bad states.
By 17:06 UTC, Cloudflare confirmed that all systems were operating normally again.
To understand why a single oversized file could take down such a large infrastructure, it helps to look at how Cloudflare processes requests.
Every incoming request — website visits, API calls, mobile app requests — flows through:
TLS/HTTP termination
Core proxy system (FL / FL2)
Pingora, which handles caching and contacting origin servers
At the proxy stage, Cloudflare runs multiple security and performance tools — including Bot Management. The bot detection module failed after encountering the enlarged feature file because it exceeded a limit of 200 features, a threshold designed to prevent memory overuse.
When the module crashed:
At the same time this internal failure was unfolding, Cloudflare’s external status page — hosted entirely outside its infrastructure — also went offline. This coincidence made engineers initially suspect an external attack. Later, it was confirmed to be unrelated, but it added confusion during early diagnosis.
The source of the issue lay deeper in how ClickHouse handles distributed queries.
default database.r0 database tables.This caused the file to balloon to more than twice its normal size.
Cloudflare’s bot system has a hard limit on the number of features to ensure:
When the bad file exceeded this limit, the proxy’s Bot Management module triggered a panic, causing the system to fail rather than continue riskily. This unhandled error led to widespread 5xx errors.
Both FL and the newer FL2 proxy system were affected, although FL2 returned explicit 5xx errors while FL continued operating but with incorrect bot scoring.
Because the core proxy is a foundational service at Cloudflare, several products were incidentally affected, including:
The second window resulted from backlog traffic overwhelming the login system after fixes were applied.
Cloudflare called this their worst outage since 2019. While the company has had isolated feature failures and dashboard outages in the past, this event interrupted core traffic routing, affecting millions of websites globally.
Cloudflare emphasized that any outage of this scale is unacceptable, given the company’s central role in Internet infrastructure.
In response, Cloudflare committed to several remediation steps focused on improving system resilience and reducing the risk of similar failures.
Making automated config files go through stricter validation — similar to user-generated data.
Allowing engineers to rapidly disable problematic features without waiting for them to propagate.
Preventing error reports, core dumps, or debugging tools from overwhelming system resources during failures.
Ensuring that individual modules — like Bot Management — can fail safely without crashing the entire proxy system.
Cloudflare issued a formal apology, acknowledging the widespread impact the outage had on the Internet. As a critical part of global web infrastructure, the company emphasized its responsibility to ensure reliability and stated that rebuilding system resilience is already underway.

