
tokenspeed
TokenSpeed is a speed-of-light LLM inference engine.
The Lens
TokenSpeed is an LLM inference engine aimed at agentic workloads. The claim is TensorRT-LLM-level performance with vLLM-level usability, which is bold positioning if it holds up. The architecture uses a local-SPMD modeling layer with static compilation and a C++ control plane with type-safe KV cache management. The team has shipped benchmarks against TensorRT-LLM on Kimi K2.5 that look favorable.
Hardware target is NVIDIA Blackwell (B200) right now, with Hopper and AMD MI350 optimization in progress. Setup involves the usual NVIDIA stack: CUDA, drivers, the lightseek.org/tokenspeed getting-started guide, and Blackwell-class hardware you almost certainly don't own personally. Currently it runs Kimi K2.5; Qwen, DeepSeek, and MiniMax support is in progress.
If you're standing up an inference service for agent workloads on Blackwell GPUs, this is worth evaluating against vLLM and TensorRT-LLM. Solo and small teams: stick with vLLM until TokenSpeed matures. Large teams running serious agent workloads on B200s: benchmark it, the agentic optimizations look real.
The catch: explicitly preview/beta. The README says "do not use this preview release for production deployments." Model coverage is thin and the runtime is still gaining features like KV store and VLM support. Watch it, don't bet your inference layer on it yet.
Free vs Self-Hosted vs Paid
fully freeFree: MIT license, full source. No usage limits, no telemetry tier.
Self-hosted: Designed to run on your own NVIDIA Blackwell hardware. B200s start around $30,000 per card, so the GPU cost dwarfs everything else.
Paid: None. It's an open source inference engine; you pay for the hardware it runs on.
Free MIT inference engine if you can afford Blackwell GPUs and accept the beta label.
Get tools like this every Wednesday
One featured tool, three on the radar. No fluff.
License: MIT License
Use freely, including commercial. Just keep the license.
Commercial use: ✓ Yes
About
- Owner
- LightSeek Foundation (Organization)
- Stars
- 1,087
- Forks
- 105
Explore Further
More tools in the directory
everything-claude-code
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
189.4k ★ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
172.1k ★hermes-agent
The agent that grows with you
164.7k ★