LLM Inference

13 open source tools compared. Sorted by stars. Scroll down for our analysis.

Tool	Stars	Velocity	Language	License	Score
ollama Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.	172.1k	+583/wk	Go	MIT License	100
Open WebUI Self-hosted AI interface for LLMs	137.9k	+470/wk	TypeScript	BSD 3-Clause	84
llama.cpp LLM inference in C/C++	112.5k	+2029/wk	C++	MIT License	91
vLLM High-throughput LLM inference and serving engine	80.5k	+301/wk	Python	Apache License 2.0	91
text-generation-webui Local LLM interface with text, vision, and training	47.2k	+27/wk	Python	GNU Affero General Public License v3.0	71
LocalAI Open-source AI engine, run any model locally	46.4k	+141/wk	Go	MIT License	83
CLIProxyAPI Wrap Gemini CLI, Antigravity, ChatGPT Codex, Claude Code, Qwen Code, iFlow as an OpenAI/Gemini/Claude/Codex compatible API service, allowing you to enjoy the free Gemini 2.5 Pro, GPT 5, Claude, Qwen model through API	33.7k	+632/wk	Go	MIT License	83
sglang SGLang is a high-performance serving framework for large language models and multimodal models.	28.0k	+148/wk	Python	Apache License 2.0	85
omlx LLM inference server with continuous batching and SSD caching for Apple Silicon, managed from the macOS menu bar.	15.0k	+676/wk	Python	Apache License 2.0	83
TensorRT-LLM TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.	13.7k	+45/wk	Python	-	71
ds4 DeepSeek 4 Flash local inference engine for Metal	11.6k	+1464/wk	C	MIT License	81
flash-moe Running a big model on a small laptop	3.9k	+14/wk	Objective-C	Not specified	62
tokenspeed TokenSpeed is a speed-of-light LLM inference engine.	1.1k	+54/wk	Python	MIT License	67

Stay ahead of the category

New tools and momentum shifts, every Wednesday.

Our Analysis

ollama172.1k★

Ollama makes it dead simple. Download it, run 'ollama run llama3' in your terminal, and you're chatting with an LLM locally. That's it. No Python environments, no CUDA configuration, no Docker. The most popular local LLM tool by a wide margin. Supports dozens of models: Llama, Mistral, Gemma, DeepSeek, Qwen, and more. Works on Mac, Linux, and Windows. The API is OpenAI-compatible, so any app that works with the OpenAI API can point at Ollama instead. Your data never leaves your machine. The catch: you need hardware. A Mac with 16GB RAM runs 7B parameter models fine. For 70B+ models, you need serious GPU power. Ollama makes the software easy. It can't make your laptop a data center. Response quality depends entirely on which model you run, and local models are still behind cloud models like GPT-4 and Claude for complex tasks.

Open WebUI137.9k★

Open WebUI gives you a ChatGPT-like interface for your own AI models, whether they're running locally with Ollama, through OpenAI's API, or any compatible endpoint. Chat with models, upload documents for RAG (retrieval-augmented generation, meaning the AI can read your files and answer questions about them), manage conversations, and share prompts. All running on your own server. community-maintained. The UI is polished. It feels like a commercial product. Multi-user support, conversation history, model management, function calling, web search integration, and image generation. It's the most feature-rich self-hosted LLM frontend available. Everything is free for self-hosting. No premium features, no gated functionality. They recently launched a cloud-hosted version, but the self-hosted version is the full product. The catch: the license is technically "Other." It uses a custom license that's permissive for personal and organizational use but restricts commercial redistribution. Read it before building a product on top of it. Also, running LLMs locally requires serious hardware. A 7B model needs 8GB+ RAM (or a decent GPU). Open WebUI itself is lightweight, but the models it talks to are not. And updates ship fast, which means occasional breaking changes.

llama.cpp112.5k★

A server without a GPU. llama.cpp makes it possible. It runs quantized versions of open models (Llama, Mistral, Phi, Qwen, and dozens more) in pure C/C++ with optional GPU acceleration. No Python, no PyTorch, no CUDA dependency hell. Everything is free under MIT. No paid tier, no cloud, no account. Download a model file (GGUF format), point llama.cpp at it, and you're running inference. It includes a built-in HTTP server that exposes an OpenAI-compatible API, so your existing code that talks to GPT can talk to a local model with one URL change. The catch: you need hardware. A 7B parameter model needs ~4GB RAM (quantized). A 70B model needs ~40GB. Quality depends entirely on the model and quantization level; a heavily quantized model on a laptop won't match GPT-4. But for privacy-sensitive workloads, offline use, or just not wanting to pay per token, nothing else comes close.

vLLM80.5k★

VLLM is the fastest engine for serving them. It takes open-weight models and serves them over an OpenAI-compatible API, squeezing maximum throughput out of your GPUs. What's free: Everything. Apache 2.0 license. The entire inference engine, all optimizations (PagedAttention, continuous batching, tensor parallelism), the OpenAI-compatible API server. All free. vLLM's key innovation is PagedAttention, which manages GPU memory the way operating systems manage RAM, in pages instead of contiguous blocks. The result: 2-4x more throughput than naive inference. It's become the default serving engine for self-hosted LLMs. The catch: you need serious GPUs. Running a 70B parameter model requires 2-4 A100 GPUs ($1-2/hr on cloud, or $10K+ each to buy). Even a 7B model needs a decent GPU with 16GB+ VRAM. vLLM is free but the hardware is emphatically not. And it's optimized for NVIDIA GPUs. AMD ROCm support exists but is second-class.

text-generation-webui47.2k★

Text-generation-webui gives you a browser-based interface to do it. Load a model, chat with it, fine-tune it, generate images. It's the Swiss Army knife for local AI. The entire project is free under AGPL-3.0. Every feature (chat, notebook mode, model loading, LoRA training, multimodal/vision support, extensions) ships at $0. The developer sells some extension packs on Gumroad, but those are optional add-ons, not core features. Self-hosting is the only option, and the setup complexity depends on your GPU situation. If you have an NVIDIA card with 8GB+ VRAM, the one-click installers work well. AMD and Apple Silicon support exists but can be finicky. Expect 30-60 minutes for first-time setup including downloading a model. Solo developers: this is your playground. Run models locally, experiment with fine-tuning, keep your data private. Small teams: share a beefy GPU server running the API mode. Beyond that, look at dedicated inference servers like vLLM. The catch: GPU hardware requirements are real. You need a decent GPU to run anything useful. A 7B parameter model needs ~6GB VRAM. Anything bigger needs proportionally more. No GPU, no party.

LocalAI46.4k★

LocalAI runs your own AI models locally and exposes them through an OpenAI-compatible API. LLMs, image generation, speech-to-text: all from a single server. No cloud, no API keys, no data leaving your machine. MIT-licensed, free. Docker-based setup handles most of the complexity. A config file defines which models to load and which backends to use (llama.cpp, whisper, stable diffusion, and more). CPU inference is supported, which means any machine can run it. GPU acceleration is faster but not required. Models download at first startup. Developers who want to swap out OpenAI API calls with local models point their existing code at LocalAI's endpoint and change nothing else. Good for privacy-sensitive applications, air-gapped environments, and teams that want to control costs without changing application code. The catch: local inference is slower than cloud for most hardware setups. Model selection lags the frontier. You get privacy and cost control; you give up raw performance and convenience.

CLIProxyAPI33.7k★

CLIProxyAPI wraps existing AI coding CLIs, Gemini CLI, Claude Code, ChatGPT Codex, and others, and exposes them as OpenAI/Gemini/Claude-compatible API endpoints. The pitch is that you get access to models like Gemini 2.5 Pro and GPT-5 through their free CLI tiers, served as a standard API you can plug into any app. Let me be direct: this is a proxy that routes around pricing by using free CLI tools as backends, and exploding because free model access is irresistible. The homepage points to a subscription service at z.ai. The catch: this sits in a gray area. You're wrapping free CLI tools and serving them as APIs, which likely violates the terms of service for most of those CLIs. The sustainability of this approach depends entirely on providers not shutting it down. The MIT license covers the code, but the underlying model access is not yours to redistribute. Use at your own risk.

sglang28.0k★

SGLang serves large language models in production with the kind of throughput numbers that make vLLM look conservative. The project reports up to 5x faster inference on general models and 7x on DeepSeek's MLA architecture. It powers an unusual cross-section of the industry: xAI, AMD, NVIDIA, LinkedIn, and the major cloud providers all run it, reportedly across more than 400,000 GPUs. Setup is the standard NVIDIA inference stack: CUDA, drivers, docs.sglang.io, and a chassis full of GPUs. Supported hardware spans NVIDIA GB200/H100/A100, AMD MI355/MI300, Intel Xeon, Google TPUs, and Ascend NPUs. Models include Llama, Qwen, DeepSeek, GLM, Gemini, Mistral, and most Hugging Face models. The framework is OpenAI-API compatible so existing clients drop in. Solo and small teams running open-weight models: this is one of the strongest options on the shelf, especially if you're on DeepSeek or running heavy agentic workloads. Large teams running production inference at scale: you're probably already evaluating it. The 400K-GPU adoption number is not marketing; xAI and LinkedIn deployments are real. The catch: serious production-grade inference is still serious work. Cold starts, KV cache tuning, and multi-node setups need real engineering. SGLang gives you a faster engine; it doesn't remove the operational burden of running an inference platform.

omlx15.0k★

Omlx puts an LLM inference server in your macOS menu bar. Click the icon, pick a model, and you have a local AI API running. It uses continuous batching (handles multiple requests efficiently) and SSD caching (models load faster after the first time) optimized specifically for Apple Silicon. This is the easiest way to run local LLMs on a Mac right now. No Docker, no Python environments, no config files. Menu bar app, one click, done. The API is OpenAI-compatible so any tool that talks to OpenAI can point at your local omlx instead. Apache 2.0 licensed, Python. The catch: Mac only. Apple Silicon specifically; Intel Macs are either unsupported or severely limited. The performance depends on your Mac's unified memory; 8GB will run small models, you need 32GB+ for anything serious. And 'menu bar simplicity' means less control over advanced settings like quantization, context length, and memory allocation.

TensorRT-LLM13.7k★

TensorRT-LLM squeezes maximum inference performance out of NVIDIA GPUs for large language models. It handles quantization (FP8, FP4, INT4), custom attention kernels, paged KV caching, and multi-GPU deployment through a Python API. If you are serving LLMs at scale on NVIDIA hardware, this is the optimization layer that makes the economics work. Running it yourself means you need NVIDIA GPUs, full stop. No AMD, no Apple Silicon, no CPU fallback. You will also need CUDA installed and compatible driver versions. The setup is not trivial, but NVIDIA provides containers and Docker images that smooth out the worst of it. Once running, the performance gains over naive PyTorch inference are substantial, often 2-4x throughput improvements. For teams already committed to NVIDIA hardware, TensorRT-LLM is the right call over vLLM when you need every last token per second. vLLM is easier to set up and supports more hardware. llama.cpp is better for local, single-GPU experimentation. TensorRT-LLM is for production serving where GPU cost is a real line item. The catch: you are locked to NVIDIA forever. The library only works on their GPUs, and if your cloud costs push you toward AMD or custom silicon, you are rewriting your inference stack from scratch.

ds411.6k★

DS4 is a single-model inference engine: it runs DeepSeek V4 Flash locally on Apple Silicon (Metal) or Linux (CUDA), and only that model. Antirez, the creator of Redis, built it as a focused experiment. The point is to run a 284B-parameter frontier-class model on a Mac Studio or a high-end Linux box without going through llama.cpp's generic GGUF loader. With 2-bit quantization the q2 build will fit on a 128GB Mac, q4 needs 256GB plus. Setup is a download script and a make. You get a CLI with `/think` and `/nothink` modes, and an OpenAI- and Anthropic-compatible HTTP server that drops into any client that already speaks those APIs. On a Mac Studio M3 Ultra Antirez reports 84 tokens/sec prefill and 37 tokens/sec generation at 2-bit. Context window goes up to 1 million tokens. Use this if you specifically want DeepSeek V4 Flash running locally on serious hardware. The appeal is sovereignty, not portability. Solo developers with a Mac Studio: this is a fun way to burn GPU hours. Anyone else: stick with llama.cpp or vLLM until DS4 ships a stable release. The catch: this is alpha code, by Antirez's own admission. He notes the implementation leans heavily on GPT 5.5 assistance and acknowledges debt to llama.cpp. One model, one workload, no production claims. Treat it accordingly.

flash-moe3.9k★

Flash-moe makes that possible. It uses a technique called Mixture of Experts (MoE) to run only the parts of the model that matter for each request, dramatically cutting the memory and compute needed. The pitch is simple: big model intelligence on small hardware. Models that normally need 32GB+ of VRAM can run on a laptop with 8-16GB of regular RAM. It's slower than running on a GPU, but it works. The catch: growing explosively but very early. The 'runs on a laptop' promise depends heavily on the model and your hardware. And MoE optimization is an active research area. Expect the approach to evolve fast.

tokenspeed1.1k★

TokenSpeed is an LLM inference engine aimed at agentic workloads. The claim is TensorRT-LLM-level performance with vLLM-level usability, which is bold positioning if it holds up. The architecture uses a local-SPMD modeling layer with static compilation and a C++ control plane with type-safe KV cache management. The team has shipped benchmarks against TensorRT-LLM on Kimi K2.5 that look favorable. Hardware target is NVIDIA Blackwell (B200) right now, with Hopper and AMD MI350 optimization in progress. Setup involves the usual NVIDIA stack: CUDA, drivers, the lightseek.org/tokenspeed getting-started guide, and Blackwell-class hardware you almost certainly don't own personally. Currently it runs Kimi K2.5; Qwen, DeepSeek, and MiniMax support is in progress. If you're standing up an inference service for agent workloads on Blackwell GPUs, this is worth evaluating against vLLM and TensorRT-LLM. Solo and small teams: stick with vLLM until TokenSpeed matures. Large teams running serious agent workloads on B200s: benchmark it, the agentic optimizations look real. The catch: explicitly preview/beta. The README says "do not use this preview release for production deployments." Model coverage is thin and the runtime is still gaining features like KV store and VLM support. Watch it, don't bet your inference layer on it yet.