Open SourceHigh Impact·Tuesday, March 17, 2026

Holotron-12B: Open Computer-Use Agent Built for Production Scale

H Company releases Holotron-12B, a 12B hybrid SSM-Attention computer-use model delivering 2x throughput over its predecessor on a single H100.

What happened

H Company released Holotron-12B, a multimodal computer-use model post-trained from NVIDIA's Nemotron-Nano-2 VL on proprietary data. The model uses a hybrid State-Space Model and attention architecture that eliminates KV cache growth, achieving 8.9k tokens/s at concurrency 100 on a single H100 GPU via vLLM v0.14.1. It outperforms the prior Holo2-8B by over 2x throughput on the WebVoyager benchmark. The model is available now on Hugging Face under NVIDIA's Open Model License, with a next-gen Nemotron 3 Omni variant already in preparation.

Why it matters to you

personalized

Holotron-12B's hybrid SSM-Attention architecture is a direct answer to the KV cache bottleneck that kills throughput in long-context agentic workloads. At 8.9k tokens/s on a single H100 with 100 concurrent workers, this is a meaningful infrastructure unlock — not a benchmark trick. The model runs on vLLM v0.14.1 today, meaning your existing serving stack likely needs only a version bump to deploy it.

What to do about it

Pull Holotron-12B from Hugging Face and run a head-to-head throughput test against your current vision-language model on a 10-screenshot agentic trace using vLLM's benchmark tool — if tokens/s doubles, you've just halved your GPU bill for computer-use workloads.

Try this now

Run this in your terminal: `pip install vllm==0.14.1 && python -m vllm.entrypoints.openai.api_server --model h-company/holotron-12b` then send a multimodal prompt with 3 screenshots via the OpenAI-compatible endpoint and measure response latency vs your current model.

Community

6 comments