
vLLM
High-throughput LLM inference and serving engine
The Lens
VLLM is the fastest engine for serving them. It takes open-weight models and serves them over an OpenAI-compatible API, squeezing maximum throughput out of your GPUs.
What's free: Everything. Apache 2.0 license. The entire inference engine, all optimizations (PagedAttention, continuous batching, tensor parallelism), the OpenAI-compatible API server. All free.
vLLM's key innovation is PagedAttention, which manages GPU memory the way operating systems manage RAM, in pages instead of contiguous blocks. The result: 2-4x more throughput than naive inference. It's become the default serving engine for self-hosted LLMs.
The catch: you need serious GPUs. Running a 70B parameter model requires 2-4 A100 GPUs ($1-2/hr on cloud, or $10K+ each to buy). Even a 7B model needs a decent GPU with 16GB+ VRAM. vLLM is free but the hardware is emphatically not. And it's optimized for NVIDIA GPUs. AMD ROCm support exists but is second-class.
Free vs Self-Hosted vs Paid
fully free### What's Free Everything. Apache 2.0 license. All features, all optimizations, no restrictions.
### The Hardware Bill (This Is Your Real Cost) - **7B model (Llama 3.1 7B)**: 1x GPU with 16GB+ VRAM. Cloud: ~$0.50-1.00/hr. Buy: RTX 4090 ~$1,600. - **70B model (Llama 3.1 70B)**: 2-4x A100 80GB GPUs. Cloud: $4-8/hr (~$3,000-6,000/mo 24/7). Buy: ~$40K-80K. - **405B model**: 8x A100 or H100. Cloud: $16-32/hr (~$12K-24K/mo). Buy: you don't want to know.
### Cloud GPU Options - **RunPod**: A100 80GB at ~$1.64/hr. Good for experimentation. - **Lambda Labs**: A100 at ~$1.10/hr. Better for sustained use. - **AWS (p4d/p5)**: $12-40/hr. Enterprise-grade, enterprise-priced.
### vs Paying for API Access - OpenAI GPT-4o: $2.50-10/1M tokens. No hardware to manage. - Self-hosted Llama 70B via vLLM: ~$0.20-0.50/1M tokens at scale. But you're managing infrastructure.
### When Self-Hosting Makes Sense When: data privacy is non-negotiable, you're processing millions of tokens/day (cost crossover), or you need custom model fine-tuning. When not: you're processing <100K tokens/day (API is cheaper), or you don't have GPU expertise.
Software is free. Hardware costs $0.50-32/hr in the cloud. Self-hosting beats API pricing only at massive scale or when data privacy is non-negotiable.
Similar Tools
About
- Stars
- 75,275
- Forks
- 15,174
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.





