You pull a 9B coding model in Ollama, ask it to write a simple function, and your laptop sounds like it’s trying to take flight. Three minutes pass. The fan doesn’t slow down. The output finally arrives — at roughly the pace of someone typing with one finger. You check nvidia-smi. The GPU is maxed out. The model is mostly running on your CPU.
That’s a VRAM problem, and it has a specific, fixable cause.
The 8GB Rule: What Your Machine Can Actually Run
How Model Size Translates to Memory
A 9-billion parameter model (like the popular Qwen3.5 9B, though smaller alternatives like Gemma 4 E4B exist) stored in FP16 (the default, uncompressed format) requires approximately 18GB of VRAM — 2 bytes per parameter. That immediately rules out any standard consumer GPU.
Quantization changes the math. At Q4_K_M (4-bit medium precision), a 9B-class model weighs approximately 5.0GB to 5.4GB. Your OS and display drivers consume another 1-2GB. That leaves a workable budget: an 8GB GPU can run a 9B model, with room for the context window — provided you manage that context aggressively.
The 8GB threshold is where “possible” becomes “practical.” You’re not barely running the model — you have enough headroom to maintain a useful conversation window and keep generation above 40 tokens/sec.
Dedicated VRAM vs. Apple Unified Memory
Nvidia and AMD GPUs use dedicated GDDR6/X VRAM — high-bandwidth memory physically attached to the card, isolated from system RAM. When a model fits in VRAM, inference is fast. When it doesn’t, Ollama offloads layers over the PCIe bus to system RAM, and generation speed collapses from 50 tokens/sec to 3.
Apple Silicon works differently. The M-series chips use a Unified Memory Architecture (UMA) where CPU and GPU share the same memory pool. A MacBook Pro with 32GB of unified memory can load a 9B or 12B model and maintain a large context window without any offloading penalty. The trade-off is raw bandwidth — M3/M4 Pro memory bandwidth runs around 150GB/s, roughly half of what a high-end Nvidia card delivers, but still far above what PCIe-routed system RAM can provide.
Check your memory budget before pulling a model:
# Windows/Linux (Nvidia)
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
# macOS (Apple Silicon unified memory check)
sysctl hw.memsize
(Note for advanced Mac users: run sysctl iogpu.wired_limit_mb to see your exact GPU memory ceiling).
The Magic of Quantization
What Quantization Does
A model weight in FP16 takes 2 bytes. The same weight in 4-bit integer takes 0.5 bytes. Quantization converts those high-precision floats to low-precision integers — trading a small amount of accuracy for a 75% reduction in memory footprint.
GGUF (GPT-Generated Unified Format) is the standard container format for quantized models. It’s what Ollama uses internally. When you run ollama pull qwen3.5:9b-instruct-q4_K_M, you’re downloading a GGUF file with 4-bit weights in the K_M (K-means, medium) quantization scheme.
Q4_K_M applies more precision to layers that matter more for output quality. It’s not uniform — the attention layers get more bits than the feedforward layers. The result is a model that behaves nearly identically to the FP16 version for code generation tasks, at a quarter of the memory cost.
Why Q4_K_M Is the Right Choice
| Quantization | 9B Model Size | Fits 8GB VRAM? | Code Quality | Tokens/sec (RTX 3070) |
|---|---|---|---|---|
| FP16 | 18GB | No | Baseline | — |
| Q8_0 | 9.5GB | Borderline | 99% | — |
| Q4_K_M | 5.2GB | Yes | 97% | 40–45 |
| Q3_K_M | 4.2GB | Yes | 90% | 50–55 |
| Q2_K | 3.0GB | Yes | 70% | 60+ |
Q4_K_M dominates: fits comfortably in 8GB, runs at good speed, and maintains near-baseline accuracy. Q3 and Q2 save memory, but the logical reasoning degradation shows up fast on anything more complex than autocomplete.
Pull the specific quantization explicitly — while Ollama often defaults to Q4_K_M, being explicit ensures predictability across environments:
# Explicit tag ensures you get exactly the quantization you expect
ollama pull qwen3.5:9b-instruct-q4_K_M
The Hidden Cost: Context Window and KV Cache
How Context Eats Your Memory
VRAM usage doesn’t stop when the model loads. As you send prompts and receive responses, Ollama grows a Key-Value (KV) cache — a block of VRAM that stores the attention vectors for every token in your conversation history.
The KV cache grows linearly with context length. For Qwen3.5 at Q4_K_M on an 8GB card:
- Model weights: ~5.2GB
- OS overhead: ~1.5GB
- Remaining for KV cache: ~1.3GB
At num_ctx 8192 with standard KV precision, that 1.3GB is more than sufficient. Because Qwen 3.5 uses Grouped Query Attention (GQA), its KV cache is highly efficient—1.3GB can actually hold upwards of 16,000 to 32,000 tokens. This gives an 8GB card massive headroom for deep coding sessions without offloading.
When the KV cache exceeds available VRAM, Ollama doesn’t error out gracefully. It starts offloading layers to system RAM, and your 45 tokens/sec drops to 3. Generation appears stuck. The fan spins up. You sit wondering if the model crashed.
The model isn’t crashed. You ran out of memory.
Diagram
Visualizes the VRAM budget of an 8GB GPU running a Q4_K_M quantized 9B model. When the KV Cache (context window) grows beyond available VRAM, Ollama offloads data across the PCIe bus to slower system RAM, causing a severe performance drop.
Visual Notes:
- Represents the hard limit of VRAM before offloading occurs.
- The PCIe bottleneck is the primary cause of token generation speed dropping to single digits.
Managing Context in Ollama
Set a hard cap on num_ctx in your Modelfile to prevent the cache from ever overflowing:
FROM qwen3.5:9b-instruct-q4_K_M
# Safe ceiling for 8GB GPU; drop to 4096 if you're seeing lag spikes
PARAMETER num_ctx 8192
For 8GB cards, 8,192 tokens was traditionally the practical ceiling with standard KV precision on older architectures. However, with GQA models like Qwen 3.5, you can often push this to 16,384 or beyond without spilling out of VRAM.
You can also reduce the KV cache’s own memory footprint with KV cache quantization. This requires enabling Flash Attention as well:
# Enables Flash Attention and shrinks the KV cache to 4-bit, roughly halving cache VRAM usage
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0
ollama serve
# Windows PowerShell
$env:OLLAMA_FLASH_ATTENTION = "1"
$env:OLLAMA_KV_CACHE_TYPE = "q4_0"
ollama serve
(Note: The commands above run ollama serve in the foreground. If Ollama runs as a background system service or taskbar app, you need to apply these environment variables in your system settings and restart the service.)
With Flash Attention and OLLAMA_KV_CACHE_TYPE=q4_0, an 8GB GPU can sustain context windows of 32,000+ tokens for GQA models while keeping all layers in VRAM.
Hands-On Example: Benchmarking Your Setup
Here’s how to verify your configuration before committing to a full workflow.
Step 1: Pull the model with explicit quantization
ollama pull qwen3.5:9b-instruct-q4_K_M
Step 2: Create a Modelfile with a locked context window
# Modelfile.bench
FROM qwen3.5:9b-instruct-q4_K_M
PARAMETER num_ctx 4096
SYSTEM "You are a coding assistant. Return only code."
Step 3: Build and run the benchmark model
ollama create bench-coder -f Modelfile.bench
ollama run bench-coder "Write a Python function that parses a CSV file and returns a list of dicts."
Step 4: Monitor VRAM while it generates
# Open a second terminal
watch -n 1 nvidia-smi
# macOS: install and run asitop
sudo asitop
You’re looking for two things: GPU utilization at 90–100%, and generation speed above 30 tokens/sec. If you see CPU offloading in nvidia-smi or generation drops below 10 tokens/sec, the context window is too large or the quantization doesn’t fit.
A healthy result looks like 40–55 tokens/sec with GPU at 95% and no CPU activity on the AI workload. That’s a setup you can actually work with daily.
Best Practices
Always specify the quantization tag. While Ollama’s defaults are generally sensible, pulling qwen3.5:9b-instruct-q4_K_M is explicit and predictable.
Close VRAM-heavy applications before a long coding session on 8GB hardware. Browsers with many tabs, video editors, and Electron apps all claim VRAM. On Windows, Task Manager’s GPU tab shows exactly which processes own your VRAM.
Don’t use context windows larger than 8,192 tokens on older architectures with 8GB VRAM unless you’ve set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q4_0. For GQA models, you have more flexibility, but keeping it at 8,192 ensures you never hit the PCIe wall.
Don’t assume system RAM substitutes for VRAM on Windows/Linux. A machine with 64GB of DDR5 and an 8GB GPU still runs at 2–3 tokens/sec when the model overflows VRAM — the PCIe bus is the bottleneck, not total system memory. Apple Silicon is the exception: its unified memory pool doesn’t incur this penalty.
Troubleshooting
Problem: Token generation is 1–5 tokens/sec.
Cause: Model layers offloaded to system RAM due to VRAM overflow. Run nvidia-smi dmon during generation — watch the fb (framebuffer) column reach the maximum MB limit of your GPU’s total VRAM.
Fix: Switch to q4_K_M if you’re on a larger format, or reduce num_ctx to 4096 in your Modelfile. Rebuild with ollama create and restart.
Problem: Ollama crashes instantly when you send a large prompt.
Cause: Your prompt plus existing KV cache exceeded available VRAM. Ollama ran out of space.
Fix: Add PARAMETER num_ctx 4096 to your Modelfile and set both OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q4_0 in your environment. Restart Ollama.
Problem: Slow generation on Apple Silicon despite ample unified memory.
Cause: Ollama may be falling back to CPU inference instead of using the Metal GPU backend.
Fix: Check Ollama’s startup logs for Metal acceleration confirmation. If it says CPU, ensure your Ollama version supports your Mac’s chip generation and that you haven’t set OLLAMA_NUM_GPU=0 accidentally.
Key Takeaways
8GB of VRAM is the practical minimum for running a 9B-class coding model at useful speed. Q4_K_M quantization cuts the memory footprint from 18GB to under 5.4GB with negligible code quality impact. The KV cache is the variable most people overlook: it grows with every token in your conversation, and when it overflows VRAM, generation speed falls off a cliff.
Manage these three variables — quantization level, context window size, and KV cache type — and local AI coding becomes genuinely faster than waiting for a cloud API.
