Running DeepSeek R1 Locally: The Complete Guide

Running large language models locally has gone from a niche experiment to a legitimate workflow option. DeepSeek R1 is one of the most capable open-weight models available, and with the right hardware, you can run it on your own machine with zero API costs and full privacy.

This guide covers everything from initial setup through optimization, with real benchmark numbers from my own testing.

Why Run DeepSeek R1 Locally?

There are three practical reasons to run inference locally rather than hitting an API:

Privacy. Your prompts and data never leave your machine.
Cost. After the hardware investment, inference is effectively free.
Latency. No network round-trips, no rate limits, no queue wait times.

The tradeoff is upfront hardware cost and setup time. Whether that tradeoff makes sense depends on your use case.

Hardware Requirements

DeepSeek R1 comes in several quantization levels. Here is what you actually need:

Quantization	VRAM Required	Recommended GPU
Q4_K_M	~24 GB	RTX 4090 / RTX 5090
Q5_K_M	~32 GB	RTX 5090 / 2x RTX 3090
Q8_0	~48 GB	A6000 / 2x RTX 4090
FP16	~96 GB	Multi-GPU or A100

For most developers, Q4_K_M on a single RTX 4090 or 5090 is the sweet spot. You get 90%+ of the full model quality at a fraction of the memory cost.

Step 1: Install Ollama

Ollama is the fastest way to get a local model running. It handles model downloading, quantization management, and serves a local API.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Windows
# Download from https://ollama.com/download

Verify the installation:

ollama --version
# ollama version 0.6.2

Step 2: Pull the Model

ollama pull deepseek-r1:latest

This downloads the default quantization (Q4_K_M, ~24 GB). The download takes 10-30 minutes depending on your connection.

For a specific quantization:

ollama pull deepseek-r1:q5_k_m

Step 3: Run Your First Prompt

ollama run deepseek-r1 "Explain the difference between async and parallel execution in Python"

You should see tokens streaming within a few seconds. If your GPU has enough VRAM, the model loads entirely into GPU memory and inference is fast.

Step 4: Use the API

Ollama exposes a local REST API on port 11434. You can integrate it into any application:

const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "deepseek-r1",
    prompt: "Write a Python function that implements binary search",
    stream: false,
  }),
});
 
const data = await response.json();
console.log(data.response);

For streaming responses, set stream: true and process the response as newline-delimited JSON.

Performance Benchmarks

I tested DeepSeek R1 Q4_K_M on three different GPU configurations using a standardized prompt set of 50 coding and reasoning tasks:

GPU	Tokens/sec	Time to First Token	VRAM Used
RTX 4090 (24 GB)	42 t/s	180 ms	22.1 GB
RTX 5090 (32 GB)	68 t/s	120 ms	22.1 GB
M4 Max (48 GB unified)	28 t/s	240 ms	23.8 GB

The RTX 5090 is the clear performance winner. The M4 Max is competitive for its form factor but cannot match a dedicated NVIDIA GPU on raw inference speed.

Optimization Tips

1. Pin the model in memory. By default, Ollama unloads models after 5 minutes of inactivity. For development use, keep it loaded:

ollama run deepseek-r1 --keepalive -1

2. Use the right context window. DeepSeek R1 supports up to 128K context, but larger contexts use more VRAM and slow down inference. For most coding tasks, 8K-16K is sufficient:

const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({
    model: "deepseek-r1",
    prompt: "...",
    options: { num_ctx: 8192 },
  }),
});

3. Monitor GPU utilization. Use nvidia-smi to verify the model is running on GPU, not falling back to CPU:

watch -n 1 nvidia-smi

If VRAM usage is near zero during inference, something is wrong with your CUDA setup.

Key Takeaways

DeepSeek R1 runs well on consumer GPUs with Q4_K_M quantization
Ollama is the simplest way to get started — install, pull, run
RTX 5090 delivers ~68 tokens/sec, making it genuinely usable for interactive coding
Local inference means zero API costs and full data privacy
The setup takes under 30 minutes from scratch

Tools and Resources

Ollama — local model runner
DeepSeek R1 on Hugging Face — model weights and documentation
Open WebUI — browser-based chat interface for local models

Running DeepSeek R1 Locally: The Complete Guide

Why Run DeepSeek R1 Locally?

Hardware Requirements

Step 1: Install Ollama

Step 2: Pull the Model

Step 3: Run Your First Prompt

Step 4: Use the API

Performance Benchmarks

Optimization Tips

Key Takeaways

Tools and Resources

More in the Setup Guide series

More in AI & LLMs

Picking the Right Local Model for the Job

RTX 3090 for Local LLMs in 2026: Is It Still Worth It?

You might also like

RTX 5090 vs 4090: Local LLM Inference Benchmark

Automating Repo Chores With GitHub Actions