The Lens

TokenSpeed is an LLM inference engine aimed at agentic workloads. The claim is TensorRT-LLM-level performance with vLLM-level usability, which is bold positioning if it holds up. The architecture uses a local-SPMD modeling layer with static compilation and a C++ control plane with type-safe KV cache management. The team has shipped benchmarks against TensorRT-LLM on Kimi K2.5 that look favorable.

Hardware target is NVIDIA Blackwell (B200) right now, with Hopper and AMD MI350 optimization in progress. Setup involves the usual NVIDIA stack: CUDA, drivers, the lightseek.org/tokenspeed getting-started guide, and Blackwell-class hardware you almost certainly don't own personally. Currently it runs Kimi K2.5; Qwen, DeepSeek, and MiniMax support is in progress.

If you're standing up an inference service for agent workloads on Blackwell GPUs, this is worth evaluating against vLLM and TensorRT-LLM. Solo and small teams: stick with vLLM until TokenSpeed matures. Large teams running serious agent workloads on B200s: benchmark it, the agentic optimizations look real.

The catch: explicitly preview/beta. The README says "do not use this preview release for production deployments." Model coverage is thin and the runtime is still gaining features like KV store and VLM support. Watch it, don't bet your inference layer on it yet.

Explore Further

GitHub Repository

Source code, issues, README

Reddit Discussions

Community opinions and use cases

Hacker News

HN threads and discussions

Dev.to Articles

Tutorials and write-ups

Tutorials & Guides

Getting started resources

Official Website

Docs, blog, and more

tokenspeed

The Lens

Free vs Self-Hosted vs Paid

License: MIT License

About

Explore Further

More tools in the directory

everything-claude-code

ollama

hermes-agent