"Which model should I run?" is the most common question I get. The answer is always "it depends," but not in a useless way — it depends on specific, enumerable things. Here's the framework.
Start with the task, not the model
The three task categories that matter for local inference:
- Chat + casual reasoning: summarization, rewriting, basic Q&A. A 7B–8B model is usually enough.
- Code generation and refactoring: needs 13B–32B minimum to be useful. Smaller models hallucinate APIs.
- Long-context reasoning: multi-file analysis, agentic workflows, document synthesis. 32B+ is the floor.
Most people over-spec. If you're using a 70B model to summarize emails, you're burning VRAM for nothing.
Then check the VRAM budget
Rough working numbers for 4-bit quantization, which is the sweet spot for most local use:
| Model size | VRAM needed | Fits on |
|---|---|---|
| 7B–8B | 6–8 GB | Most modern GPUs |
| 13B | 10–12 GB | 3060 12GB, 4070, 4080 |
| 32B | 20–24 GB | 3090, 4090, 7900 XTX |
| 70B | 40+ GB | Dual-GPU or 5090 |
If the model doesn't fit in VRAM, you'll offload to system RAM and tokens/sec falls off a cliff. Pick a size that fits comfortably with headroom for context.
Latency budget is the underrated constraint
For a chat-style use case, users tolerate 30+ tokens/sec. For an inline code completion tool, under 80 tokens/sec feels sluggish. For a background summarization job running on a cron, 10 tokens/sec is fine.
The model you pick has to deliver the latency your use case requires on the hardware you have. This is the number to measure before committing.
The decision tree, compressed
- What's the task? → picks minimum model size.
- What's the VRAM budget? → filters what will actually run.
- What's the latency target? → eliminates anything too slow on your hardware.
- Pick the smallest model that passes all three.
Anything bigger is waste. Anything smaller doesn't do the job.
Specific picks as of publication
- General chat, 8GB GPU: Llama 3.1 8B or Qwen 2.5 7B.
- Code, 12–16GB GPU: Qwen 2.5 Coder 14B.
- Reasoning + long context, 24GB GPU: Qwen 2.5 32B or DeepSeek R1 Distill 32B.
- Heavy lifting, 40GB+: Llama 3.3 70B or DeepSeek R1 70B distill.
These will date. The framework won't.