AI & LLMsSetup Guide

Picking the Right Local Model for the Job

A decision framework for choosing between 7B, 13B, 32B, and 70B local models based on task, latency budget, and VRAM. No hype, just tradeoffs.

April 8, 20262 min read
Local LLMsModel SelectionOllamaInference

"Which model should I run?" is the most common question I get. The answer is always "it depends," but not in a useless way — it depends on specific, enumerable things. Here's the framework.

Start with the task, not the model

The three task categories that matter for local inference:

  • Chat + casual reasoning: summarization, rewriting, basic Q&A. A 7B–8B model is usually enough.
  • Code generation and refactoring: needs 13B–32B minimum to be useful. Smaller models hallucinate APIs.
  • Long-context reasoning: multi-file analysis, agentic workflows, document synthesis. 32B+ is the floor.

Most people over-spec. If you're using a 70B model to summarize emails, you're burning VRAM for nothing.

Then check the VRAM budget

Rough working numbers for 4-bit quantization, which is the sweet spot for most local use:

Model sizeVRAM neededFits on
7B–8B6–8 GBMost modern GPUs
13B10–12 GB3060 12GB, 4070, 4080
32B20–24 GB3090, 4090, 7900 XTX
70B40+ GBDual-GPU or 5090

If the model doesn't fit in VRAM, you'll offload to system RAM and tokens/sec falls off a cliff. Pick a size that fits comfortably with headroom for context.

Latency budget is the underrated constraint

For a chat-style use case, users tolerate 30+ tokens/sec. For an inline code completion tool, under 80 tokens/sec feels sluggish. For a background summarization job running on a cron, 10 tokens/sec is fine.

The model you pick has to deliver the latency your use case requires on the hardware you have. This is the number to measure before committing.

The decision tree, compressed

  1. What's the task? → picks minimum model size.
  2. What's the VRAM budget? → filters what will actually run.
  3. What's the latency target? → eliminates anything too slow on your hardware.
  4. Pick the smallest model that passes all three.

Anything bigger is waste. Anything smaller doesn't do the job.

Specific picks as of publication

  • General chat, 8GB GPU: Llama 3.1 8B or Qwen 2.5 7B.
  • Code, 12–16GB GPU: Qwen 2.5 Coder 14B.
  • Reasoning + long context, 24GB GPU: Qwen 2.5 32B or DeepSeek R1 Distill 32B.
  • Heavy lifting, 40GB+: Llama 3.3 70B or DeepSeek R1 70B distill.

These will date. The framework won't.

Follow Code_Racoon

New guides, benchmarks, and tools.