Bleeding Llama: when your local AI server becomes an open window into your memory

On May 5th of this year, researchers at security firm Cyera published the technical details of a critical vulnerability in Ollama, the most widely used open-source platform for running large language models (LLMs) locally. Dubbed Bleeding Llama and tracked as CVE-2026-7482 with a CVSS score of 9.1 (Critical), the flaw allows any remote attacker to extract arbitrary fragments of server memory —including user conversations, API keys, environment variables, and proprietary code— using just three HTTP requests, with no credentials required.

At the time of disclosure, approximately 300,000 Ollama instances were directly exposed on the public internet, with no authentication layer whatsoever.

What is Ollama and why is it so popular?

Before diving into the vulnerability, some context is in order. Ollama is an open-source tool that lets anyone run large language models —such as Llama 3, Mistral, Gemma, or Phi— directly on their own hardware, without relying on external services like OpenAI or Anthropic. With over 170,000 GitHub stars and 100 million Docker Hub pulls, its adoption spans individual developers all the way to large enterprises that deploy it as an internal AI assistant for thousands of employees.

The value proposition is straightforward: privacy and full control. Data never leaves your infrastructure, there are no per-call API fees, and you don’t depend on a third party’s uptime. Ollama was designed around a “zero friction” principle: install with a single command, API available immediately, no additional configuration steps. That approach is perfectly suited for a developer using it on their personal laptop. The problem begins when that same configuration —built for localhost— gets scaled unchanged into corporate networks or internet-facing servers.

What is an “out-of-bounds heap read” and why is it dangerous?

To understand Bleeding Llama, you need to grasp a fundamental low-level concept: how programs manage memory.

When a program like Ollama runs, the operating system allocates a block of memory for it to work with. Part of that memory is called the heap, and it’s where the program stores dynamic data during execution: active user conversations, environment variables loaded at startup, authentication tokens from integrated services, source code submitted for analysis, and much more.

Think of the heap as a shared notebook where the program jots down everything it needs to remember while working. The program can write and read from any section of that notebook, but it should only access the pages assigned to it.

An out-of-bounds read occurs when the program, due to a logic error, reads beyond the boundary of its allocated block. In notebook terms: it starts reading pages that don’t belong to it —pages containing the private notes of other parts of the system.

This vulnerability class is well known in security —it’s a close cousin of the famous Heartbleed bug from 2014, which affected millions of servers running OpenSSL— yet it continues to be successfully exploited decades later because memory management errors are extraordinarily difficult to fully eliminate from code.

The incident: a silent flaw at the heart of the inference pipeline

The problem lies in how Ollama processes model files in GGUF format —the standard for storing LLM weights— specifically in the WriteTo() and ConvertToF32() functions found in fs/ggml/gguf.go and server/quantization.go.

What is a GGUF file?

A language model is nothing more than an enormous collection of numbers —called weights— organized into mathematical structures known as tensors. A GGUF file is the standard container that packages all those tensors together with metadata describing their shape: how many dimensions each tensor has, how many elements it contains along each dimension, what data type it stores, and so on.

The critical part: Ollama blindly trusts that metadata. When it receives a GGUF file to create a model instance, it reads the tensor dimensions declared in the file itself and assumes they are accurate. The flaw: it never verifies that those dimensions match the actual size of the data contained in the file.

How the exploit is built

An attacker can craft a malicious GGUF file where the tensor dimension field points to an arbitrarily large value —for example, declaring that a tensor contains 10 million elements when it actually holds only 100. When Ollama processes this file, the quantization loop inside ConvertToF32() iterates beyond the boundary of the heap-allocated buffer, reading memory it has no right to access.

What makes this bug especially severe is the conversion function involved: the F16→F32 transformation is lossless —it preserves every byte exactly, with no alteration or loss—. That illegally read content gets cleanly embedded into the resulting model file, ready to be exfiltrated.

The attacker then simply uses Ollama’s native /api/push endpoint to send that poisoned model —heap memory included— to a registry server under their control. From there, they can analyze the memory dump at leisure and extract any valuable data they find.

The attack executes completely silently: Ollama generates no errors, does not crash, and leaves no obvious traces in the logs. From the system’s perspective, everything worked normally.

The problem: three HTTP requests are enough to drain your AI server’s memory

The exploitation chain is alarmingly simple:

Step	Endpoint	Action
1	`POST /api/blobs/sha256:<hash>`	Upload the malicious GGUF file with inflated tensor dimensions
2	`POST /api/create`	Trigger quantization; the OOB read embeds heap contents into the model
3	`POST /api/push`	Send the poisoned model to the attacker’s registry

None of these endpoints require authentication in a standard Ollama deployment. And if the instance is configured with OLLAMA_HOST=0.0.0.0 —something documented and widely used for network deployments— any actor with access to port 11434 can run this sequence from anywhere in the world.

How easy is it to find vulnerable instances?

Tools like Shodan and Censys are specialized search engines for internet-exposed devices and services. Just as Google indexes web pages, these platforms index open ports and services. A simple search for port 11434 —Ollama’s default— returns tens of thousands of results: fully exposed instances, no authentication, ready to be queried by anyone.

No sophisticated hacking skills are needed to find targets. Anyone with basic familiarity with these tools can identify vulnerable servers worldwide within minutes.

The data at risk

What ends up in process memory at the time of an attack depends on the deployment context. In a corporate environment, the inventory is devastating:

Conversation history from every user sharing the server
System prompts containing proprietary business logic and model configurations
Environment variables holding API keys for AWS, GitHub, SaaS services, and more
Authentication tokens from active integrations
Source code submitted for review or analysis through frameworks like Claude Code or LangChain
Regulated data (PII, PHI, customer contracts) processed through the LLM

Analysis: the paradox of moving to local AI for privacy

The central irony of Bleeding Llama is that many organizations adopt Ollama precisely to protect their privacy: they don’t want their prompts, their code, or their customer data leaving for external services. That logic is sound. The mistake is assuming that “data doesn’t go to the cloud” is equivalent to “data is secure.”

Moving to local inference solves the problem of trusting an external provider, but transfers security responsibility to the organization itself. And that has implications that go well beyond installing a piece of software: it means taking on the operation of server infrastructure, with all the controls that entails.

The most dangerous scenario documented by the researchers is precisely the most common one in enterprise environments: a shared Ollama instance acting as an AI assistant for hundreds or thousands of employees. That single process concentrates the entire organization’s conversation history in its heap. An attacker who compromises it doesn’t get one session —they get the company’s complete AI activity profile.

The threat model many overlook

In cybersecurity, a threat model is the exercise of asking: who could attack us, how, and with what goal? Many organizations adopting local AI never perform this exercise for their Ollama infrastructure. They assume the risk is the external provider (OpenAI, Anthropic) and neglect the security of the local server.

Bleeding Llama proves that threat model is incomplete. The adversary isn’t only the AI provider; it’s also any actor who can reach port 11434 on your server, whether from the internet or from within your own internal network.

The disclosure chain: a problem in its own right

Cyera researcher Dor Attias reported the vulnerability to the Ollama team on February 2, 2026. A patch was available in version 0.17.1 by late February, but the release notes did not indicate it contained a critical security fix. To an administrator reviewing the changelog, it looked like a routine update.

The CVE request submitted to MITRE —the organization that coordinates the official vulnerability registry— in March went unanswered for weeks. This forced the team to turn to Echo, an alternative CVE Numbering Authority (CNA), which assigned the identifier CVE-2026-7482 on April 28th and published it on May 1st.

The result is a disclosure chain with multiple failure points:

The patch shipped without urgency signaling, which drastically reduced the update adoption rate.
MITRE’s CVE assignment process experienced delays, fragmenting the capacity for coordinated response.
Internet-exposed instances have no automatic notification mechanism; their administrators receive no alert when a critical vulnerability surfaces.

The outcome was a window of approximately three months during which the patch existed but thousands of operators never knew they needed to prioritize it. It’s the same pattern that repeats across other critical vulnerabilities in infrastructure software: the most dangerous gap isn’t always technical —it’s communicational.

Security recommendations

For administrators and operations teams

Update to Ollama v0.17.1 or later immediately. The patch implements validation of the tensor_element_count field against the actual buffer size before executing the quantization loop. Verify your installed version with ollama --version.
Block external access to port 11434. No Ollama instance should be reachable from the internet without an authenticated proxy in front of it. Use firewall rules or cloud security groups to restrict access to trusted networks only. Audit your exposure using Shodan or Censys.
Deploy a reverse proxy with mandatory authentication. Tools like nginx, Caddy, or Traefik configured with API key or OAuth2/OIDC authentication eliminate the unauthenticated access vector without modifying Ollama’s behavior.
Assume compromise if your instance was exposed. If you have or had an exposed, unpatched instance, treat the situation as an active breach: rotate all API keys, OAuth tokens, and external service credentials that were available as environment variables in the Ollama process.

For security teams and SOCs

Monitor the /api/create and /api/push endpoints. Set alerts on blob→create→push sequences executed within short time windows, and especially on name field values in push requests pointing to unrecognized or non-corporate external registries.
Audit agentic integrations. Any data processed by pipelines routing traffic through Ollama —Claude Code, LangChain, AutoGen, n8n, and similar— should be considered potentially exposed on unpatched instances. Document which data flows through each integration.
Segment the network. Local inference servers should be in isolated subnets with strict egress filtering: they should be able to receive internal requests, but not initiate arbitrary outbound connections to the internet. This would have blocked step 3 of the attack even on unpatched instances.
Consult available IoCs. Indicators of compromise associated with scanning and exploitation activity for CVE-2026-7482 are being actively compiled by the threat intelligence community. Incorporate them into your SIEM or detection platform.

For individual users

If you run Ollama with sensitive data on a local network, verify that no server port is reachable from outside your machine. Run netstat -an | grep 11434 to check whether the Ollama service is active and which network interfaces it’s listening on —confirm only 127.0.0.1 appears as the listening address, not 0.0.0.0, which would indicate exposure to the internet. This helps identify the attack surface and prevent remote exploitation of the memory leak.
Review what data passes through your instance. If you send code, documents, or personal information through Ollama, that data resides in the process heap and is vulnerable on unpatched versions.
Enable automatic updates or subscribe to release notifications on the official Ollama GitHub repository to receive alerts about future security fixes.

Wrapping up

Bleeding Llama exposes a real tension in the AI ecosystem: the very features that make Ollama easy to adopt —no authentication, no configuration, ready to use out of the box— are what make it critically dangerous when deployed outside the environment it was designed for.

The technical flaw is an out-of-bounds read in the GGUF parser’s memory handling; the systemic problem runs deeper. The security of an AI model cannot be separated from the security of the host running it, nor from the operational processes surrounding its deployment: timely patching, clear vulnerability communication, and proper network segmentation.

The lesson isn’t to stop using local inference tools —that path has its own risks— but to understand that moving to local AI isn’t just a privacy decision relative to external providers: it also means assuming responsibility for operating server infrastructure with the security controls that entails. A local LLM without hardening is no more private than a cloud LLM; it simply has a different attacker.

And in the case of Bleeding Llama, that attacker only needed three HTTP requests.

Sources: