Picking the Right Local Model for the Job

"Which model should I run?" is the most common question I get. The answer is always "it depends," but not in a useless way — it depends on specific, enumerable things. Here's the framework.

Start with the task, not the model

The three task categories that matter for local inference:

Chat + casual reasoning: summarization, rewriting, basic Q&A. A 7B–8B model is usually enough.
Code generation and refactoring: needs 13B–32B minimum to be useful. Smaller models hallucinate APIs.
Long-context reasoning: multi-file analysis, agentic workflows, document synthesis. 32B+ is the floor.

Most people over-spec. If you're using a 70B model to summarize emails, you're burning VRAM for nothing.

Then check the VRAM budget

Rough working numbers for 4-bit quantization, which is the sweet spot for most local use:

Model size	VRAM needed	Fits on
7B–8B	6–8 GB	Most modern GPUs
13B	10–12 GB	3060 12GB, 4070, 4080
32B	20–24 GB	3090, 4090, 7900 XTX
70B	40+ GB	Dual-GPU or 5090

If the model doesn't fit in VRAM, you'll offload to system RAM and tokens/sec falls off a cliff. Pick a size that fits comfortably with headroom for context.

Latency budget is the underrated constraint

For a chat-style use case, users tolerate 30+ tokens/sec. For an inline code completion tool, under 80 tokens/sec feels sluggish. For a background summarization job running on a cron, 10 tokens/sec is fine.

The model you pick has to deliver the latency your use case requires on the hardware you have. This is the number to measure before committing.

The decision tree, compressed

What's the task? → picks minimum model size.
What's the VRAM budget? → filters what will actually run.
What's the latency target? → eliminates anything too slow on your hardware.
Pick the smallest model that passes all three.

Anything bigger is waste. Anything smaller doesn't do the job.

Specific picks as of publication

General chat, 8GB GPU: Llama 3.1 8B or Qwen 2.5 7B.
Code, 12–16GB GPU: Qwen 2.5 Coder 14B.
Reasoning + long context, 24GB GPU: Qwen 2.5 32B or DeepSeek R1 Distill 32B.
Heavy lifting, 40GB+: Llama 3.3 70B or DeepSeek R1 70B distill.

These will date. The framework won't.

Picking the Right Local Model for the Job

Start with the task, not the model

Then check the VRAM budget

Latency budget is the underrated constraint

The decision tree, compressed

Specific picks as of publication

More in the Setup Guide series

More in AI & LLMs

Running DeepSeek R1 Locally: The Complete Guide

RTX 3090 for Local LLMs in 2026: Is It Still Worth It?

You might also like

RTX 5090 vs 4090: Local LLM Inference Benchmark

Automating Repo Chores With GitHub Actions