
llama.cpp
LLM inference in C/C++
The Lens
A server without a GPU. llama.cpp makes it possible. It runs quantized versions of open models (Llama, Mistral, Phi, Qwen, and dozens more) in pure C/C++ with optional GPU acceleration. No Python, no PyTorch, no CUDA dependency hell.
Everything is free under MIT. No paid tier, no cloud, no account. Download a model file (GGUF format), point llama.cpp at it, and you're running inference. It includes a built-in HTTP server that exposes an OpenAI-compatible API, so your existing code that talks to GPT can talk to a local model with one URL change.
The catch: you need hardware. A 7B parameter model needs ~4GB RAM (quantized). A 70B model needs ~40GB. Quality depends entirely on the model and quantization level; a heavily quantized model on a laptop won't match GPT-4. But for privacy-sensitive workloads, offline use, or just not wanting to pay per token, nothing else comes close.
Free vs Self-Hosted vs Paid
fully freeFully open source under MIT. The software is free; your cost is hardware.
**Hardware math:** - 7B model (good for coding, chat): Runs on a modern laptop with 8GB RAM. Free if you already own one. - 13B model (better quality): Needs 8-10GB RAM. Still laptop-friendly. - 70B model (approaching GPT-4 quality): Needs ~40GB RAM or a GPU with 24GB+ VRAM. A used RTX 3090 runs ~$700.
**Compared to API costs:** If you make 100K API calls/month to Claude or GPT-4, you're spending $500+/mo. A one-time $700 GPU investment pays for itself in 6 weeks for high-volume inference.
**The hidden cost:** Your time. Setting up, choosing the right model, tuning quantization, and debugging performance issues is hours of work that an API call handles in milliseconds.
Software is free. Cost is hardware: $0 on a laptop for small models, $700+ for serious inference.
Similar Tools

Running a big model on a small laptop

LLM inference server with continuous batching and SSD caching for Apple Silicon, managed from the macOS menu bar.

Local LLM interface with text, vision, and training

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
About
- Stars
- 101,650
- Forks
- 16,414
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.

