Skip to main content
Vincony

Speed Benchmarks

Compare inference speed (tokens/sec) across AI providers. Speed-optimized providers like Groq use custom silicon for dramatically faster output.

Fastest: Gemma 2 9B (Groq) at 950 tok/s

Output Speed (tokens/sec)

Model Provider Tokens/sec TTFT (ms) Category
🥇Gemma 2 9B (Groq)Groq95030
Speed-Optimized
🥈Llama 3.3 70B (Groq)Groq82045
Speed-Optimized
🥉Llama 4 Maverick (Groq)Groq71055
Speed-Optimized
Mixtral 8x7B (Groq)Groq58060
Speed-Optimized
Gemini 2.5 Flash LiteGoogle48050
Standard
Gemini 2.5 FlashGoogle35080
Standard
GPT-5 NanoOpenAI26095
Standard
GPT-5 MiniOpenAI155180
Standard
Grok 3xAI140200
Standard
Devstral 2Mistral130170
Open Source
Claude Sonnet 4Anthropic125230
Standard
Gemini 2.5 ProGoogle120250
Standard
Gemini 3 Pro PreviewGoogle110280
Standard
Llama 4 MaverickMeta100260
Open Source
DeepSeek V3DeepSeek95280
Open Source
Mistral LargeMistral90240
Open Source
GPT-5OpenAI85320
Standard
GPT-5.2OpenAI75380
Standard
DeepSeek R1DeepSeek55400
Open Source
Claude Opus 4Anthropic42450
Standard

Custom Silicon

Groq's LPU (Language Processing Unit) is purpose-built for LLM inference, delivering 5–10× faster output than GPU-based providers.

TTFT Matters

Time-to-first-token (TTFT) affects perceived speed. Speed-optimized providers often respond in under 60ms — instant for users.

Trade-offs

Speed-optimized providers currently support fewer models. Frontier models (GPT-5, Claude Opus) prioritize quality over raw throughput.

Benchmarks are approximate figures based on publicly available data and community testing. Actual performance varies by prompt length, concurrency, region, and model configuration.

Vincony — Access the World's Best AI Models