PagedAttention Revolutionizes Large Language Model Efficiency
Discover how PagedAttention, inspired by OS virtual memory, drastically improves LLM performance by minimizing memory fragmentation and enabling flexible sharing.

🌟 Non-members read here
Large language models (LLMs), such as those powering services like GPT and PaLM, have become integral to modern technology, driving innovations from intelligent programming assistants to advanced conversational agents. Despite their transformative potential, the operational costs of these models, particularly when deployed as hosted services, remain exceptionally high. These expenses can be as much as ten times greater than traditional keyword search queries, with a significant portion attributed to the inefficient handling of memory during LLM inference.
The core challenge lies in how these powerful models manage their internal memory while generating text. As LLMs process information and produce output one token at a time, they rely on a component known as the Key-Value (KV) cache. This cache acts as the model’s short-term memory, holding the context derived from previously processed tokens. However, the KV cache presents a considerable memory challenge due to its dynamic and often substantial size. Traditional systems struggle to manage this variability because they typically allocate the KV cache as a single, contiguous block of memory. This approach leads to severe inefficiencies that impede overall performance.
The KV Cache: An Overlooked Performance Bottleneck
The architecture of LLMs, primarily based on the Transformer model, necessitates the KV cache for efficient text generation. Each token generated requires access to the context established by preceding tokens, which is precisely what the KV cache stores. The dynamic nature of this cache, growing and shrinking with each request, is the root cause of many performance issues.
Current methods of storing the KV cache, which involve reserving a continuous block of memory, introduce two major forms of fragmentation. These fragmentation issues significantly reduce the effective utilization of GPU memory, thereby limiting the number of requests an LLM can process concurrently. Addressing these inefficiencies is crucial for making LLMs more accessible and cost-effective.
Memory Fragmentation Explained
Memory fragmentation occurs in two primary ways within LLM serving systems. Both types contribute to substantial waste and hinder the system’s ability to maximize throughput. Understanding these mechanisms is key to appreciating the innovative solutions proposed by new research.
Internal Fragmentation: This occurs when systems pre-allocate a large block of memory for each request, often based on the maximum possible output length an LLM might generate (e.g., 2048 tokens). If a request produces a shorter output than anticipated, a significant portion of that reserved memory remains unused. This over-allocation leads to substantial waste, as memory is held hostage without serving any productive purpose. The inability to dynamically size memory allocations to actual needs is a core inefficiency.
External Fragmentation: This issue arises from the varying sizes of memory blocks reserved by different requests. As requests complete and release their memory, the GPU’s memory space becomes dotted with small, unusable gaps. Even if the total amount of free memory is sufficient for a new request, these scattered fragments are too small to accommodate it continuously. This effectively makes large portions of memory inaccessible, leading to a fragmented landscape that prevents optimal resource utilization. Reports indicate that in existing systems, only a fraction—between 20.4% and 38.2%—of the KV cache memory is actively used for token states, with the remainder being wasted due to these fragmentation issues.
Absence of Memory Sharing
Beyond fragmentation, existing LLM serving systems also suffer from a lack of flexible memory sharing. Advanced decoding strategies, such as parallel sampling or beam search, frequently generate multiple output sequences from a single initial prompt. In an ideal scenario, these multiple outputs could share common parts of the KV cache, particularly the context derived from the initial prompt.
However, current implementations struggle with this because each generated sequence’s KV cache is housed in its own isolated, contiguous memory block. This isolation prevents efficient sharing, forcing redundant storage of identical contextual information across different branches of a decoding process. The cumulative effect of these inefficiencies—both fragmentation and the inability to share memory—severely constrains the number of simultaneous requests an LLM can handle, directly impacting the system’s overall throughput and increasing operational costs.
PagedAttention: A Paradigm Shift in LLM Memory Management
To overcome the persistent memory challenges plaguing LLM serving systems, researchers have developed PagedAttention. This groundbreaking approach draws inspiration from a foundational concept in operating systems (OS): virtual memory and paging. By adapting these established principles, PagedAttention redefines how LLM memory is managed, leading to dramatic improvements in efficiency and performance.
The fundamental insight behind PagedAttention is to break away from the traditional contiguous memory allocation. Instead, it introduces a block-based memory management scheme that mirrors how modern operating systems handle physical and virtual memory. This conceptual leap allows for more dynamic and flexible resource utilization, directly addressing the core problems of fragmentation and limited sharing.
The Operating System Analogy
PagedAttention’s design is elegantly explained through an analogy to operating system concepts:
KV Blocks as Pages: In PagedAttention, the KV cache of each sequence is divided into small, fixed-size units known as KV blocks. These blocks are analogous to “pages” in an operating system’s virtual memory system. Each KV block is designed to store the keys and values for a predetermined number of tokens. Crucially, these blocks do not need to be physically contiguous in memory.
Tokens as Bytes: Within the PagedAttention framework, individual tokens residing in the KV cache are conceptually similar to bytes contained within an OS page. This granular division allows for precise control over memory allocation and deallocation.
Requests as Processes: Each LLM request is managed much like a “process” within an operating system. Just as a process’s logical memory addresses are mapped to physical memory pages, an LLM request’s “logical” KV blocks are mapped to “physical” KV blocks scattered across the GPU memory. This abstraction allows the system to treat memory as a unified pool, rather than fragmented, isolated segments.
Resolving Memory Woes with PagedAttention
The adoption of PagedAttention’s block-level memory management delivers substantial benefits that directly counter the inefficiencies of older systems. Its ability to dynamically allocate and share memory fundamentally changes the economics and performance profile of LLM deployments.
Near-Zero Fragmentation: Because KV blocks are not required to be physically contiguous in memory, PagedAttention can allocate blocks on demand as tokens are generated. This dynamic allocation virtually eliminates internal fragmentation, as memory is only consumed when genuinely needed. Furthermore, external fragmentation is effectively eradicated because all KV blocks are of a uniform size, preventing the creation of unusable small gaps in memory. This efficient use of memory ensures that nearly all allocated space is actively utilized.
Flexible Memory Sharing: PagedAttention introduces the capability to share KV blocks among different sequences, extending even across distinct requests. This feature is particularly valuable for advanced decoding techniques like parallel sampling or beam search. In these scenarios, multiple output paths often share a common initial prompt. PagedAttention allows these paths to share the KV cache for the initial prompt, leading to significant memory savings. It also employs a copy-on-write mechanism, another concept borrowed from operating systems, which ensures efficient sharing. If different sequences need to modify a shared block, a copy is made only for the modified portion, preventing unnecessary duplication while maintaining data integrity.
Introducing vLLM: The High-Throughput LLM Engine
Building upon the revolutionary PagedAttention mechanism, vLLM emerges as an advanced LLM serving system meticulously engineered for high throughput. vLLM integrates block-level memory management with a sophisticated scheduler that works in tandem with PagedAttention to optimize resource utilization and enhance performance significantly.
The primary advantages offered by vLLM are twofold: virtually eliminating waste in KV cache memory and providing flexible sharing of the KV cache both within and across different requests. These improvements translate directly into tangible performance gains, making LLM inference far more efficient and scalable.
Performance Metrics and Future Outlook
The impact of vLLM on LLM serving performance is profound. It has been shown to improve the throughput of popular LLMs by a factor of 2 to 4 when compared to leading systems like FasterTransformer and Orca, all while maintaining comparable latency. This performance boost becomes even more pronounced with longer sequences, larger models, and more complex decoding algorithms.
For instance, when serving a 13-billion-parameter LLM, vLLM can concurrently process 2.2 times more requests than an “oracle” version of Orca (which assumes perfect knowledge of output lengths) and an impressive 4.3 times more than standard Orca (Max). Beyond throughput, vLLM also demonstrates substantial memory savings, reducing memory consumption by 6.1% to 9.8% for parallel sampling and a remarkable 37.6% to 55.2% for beam search.
By intelligently adapting principles from operating systems, PagedAttention and vLLM are poised to make LLM serving dramatically more efficient. This innovation promises to lower operational costs for cloud providers and deliver faster, more responsive LLM applications for users globally. This development represents a significant breakthrough, addressing a critical bottleneck in LLM deployment and paving the way for the next generation of AI-powered services.