You install Ollama, pull a model, and type a prompt into your coding agent. Nothing happens. The cursor blinks. After 45 seconds, the agent returns connection refused. Ollama is running — you can see the icon in the system tray — but the default install was not configured for a coding workflow.
The default Ollama context window is 2,048 tokens (you’ll hit the limit mid-function). The default keep-alive timer unloads the model after 5 minutes of inactivity (a 10–15 second cold-start greets you every time you switch back to it). The API endpoint format your TUI expects differs from Ollama’s native API. None of this is obvious from the install docs.
Installing Ollama
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
On Linux, this installs Ollama as a systemd service that starts automatically on boot. On macOS, it installs as a menu bar application.
Windows: Download the installer from ollama.com and run it. Ollama installs as a background process and adds itself to startup.
After installation, verify the API is reachable:
curl http://localhost:11434/api/tags
The response should be a JSON object with an empty models array. If you get connection refused, the Ollama service isn’t running — start it manually with ollama serve.
Pulling the Right Code Models
Generic instruction-tuned models are trained to be conversational. When a coding agent sends a file context and asks for a function refactor, a conversational model often responds with explanation prose before the code block. TUI parsers that expect a clean code response choke on that preamble.
Use a model trained specifically on code:
# Qwen2.5-Coder: strong multi-language support, fast generation
ollama pull qwen2.5-coder:7b
# OpenCoder: open training data, predictable structure
ollama pull opencoder:8b
Ollama defaults to pulling the q4_K_M quantization for 7B and 8B models, which perfectly fits within an 8GB VRAM budget. You only need explicit tags (like qwen2.5-coder:7b-instruct-q8_0) if you want a larger, more precise model and have the hardware to run it.
Test the model directly before wiring it to a TUI:
ollama run qwen2.5-coder:7b "Write a Python function that returns the Fibonacci sequence up to n."
If the model returns clean code without conversational filler, you’re set. If it starts with “Sure! Here is a Python function that…”, the system prompt in your Modelfile will fix that.
Mastering the Modelfile
What a Modelfile Is
The Modelfile is Ollama’s configuration layer — analogous to a Dockerfile. It specifies the base model, inference parameters, system prompt, and template overrides. You create a custom model from a Modelfile, and Ollama registers it as a separate named model.
This is the right place to lock down coding behavior: context limits, temperature, and the system prompt that prevents the model from generating conversational filler. You set it once, not every time you run a prompt.
Building an Optimized Coding Modelfile
Create Modelfile.coder in your working directory:
FROM qwen2.5-coder:7b
# 8192 tokens fits within VRAM budget on an 8GB GPU
PARAMETER num_ctx 8192
# Low temperature for deterministic, consistent code output
PARAMETER temperature 0.2
# Prevent runaway generation past function end
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
SYSTEM """You are an elite developer. Return ONLY valid, executable code.
Do NOT explain the code unless asked. Do NOT include markdown fences unless asked.
Do NOT start your response with 'Sure', 'Of course', or any affirmation."""
Build the custom model:
ollama create my-coder -f Modelfile.coder
Test it:
ollama run my-coder "def parse_csv(filepath: str) -> list[dict]:"
The model should complete the function body without preamble. The qwen2.5-coder model natively supports system prompts in Ollama, so it usually respects this immediately. If you are using a different base model and it still generates conversational text, the model may have a strict chat template overriding your system prompt. Use the TEMPLATE parameter in the Modelfile to override it — see the Modelfile reference for the format your specific model expects.
Eliminating Cold-Start Delays
By default, Ollama unloads models from VRAM after 5 minutes of inactivity. The next request triggers a cold-start: loading 5GB of weights from disk back into VRAM. On an NVMe drive, that’s 3–8 seconds. On a slower drive, up to 30 seconds.
Set OLLAMA_KEEP_ALIVE=-1 to keep the model in VRAM until you manually unload it or restart Ollama.
Linux/macOS: Add this to your shell profile.
export OLLAMA_KEEP_ALIVE=-1
Windows: Set it for the current session in PowerShell, or globally via System Properties → Environment Variables.
$env:OLLAMA_KEEP_ALIVE = "-1"
Restart Ollama after changing this. The first request after startup is still a cold-start, but every subsequent request in the session responds instantly.
Exposing the Ollama API
Default Binding and When to Change It
Ollama binds to 127.0.0.1:11434 by default. Only processes on the same machine can reach the API. For local development, this is the correct and secure default.
Two scenarios require changing it:
WSL2 on Windows: Windows 10/11 automatically forwards localhost requests to WSL2, so if Ollama runs in WSL2, Windows tools can usually reach it at localhost:11434. However, if your editor runs inside WSL2 but Ollama runs natively on Windows, the loopback address 127.0.0.1 doesn’t cross the boundary. You must set OLLAMA_HOST=0.0.0.0:11434 in Windows, and access it from WSL2 via the host IP address (find it with cat /etc/resolv.conf or ip route inside WSL2).
Diagram
Visualizes the network boundary between Windows and WSL2, showing why the loopback address (127.0.0.1) fails from within WSL2 and requires binding Ollama to 0.0.0.0 to be reachable via the Host IP.
Visual Notes:
- The loopback boundary prevents WSL2 from communicating with native Windows processes over
localhost. - Exposing the service to
0.0.0.0allows WSL2 (and other network clients) to route requests via the internal Host IP.
Shared development machine: A Linux workstation running Ollama serving a MacBook on the same local network. Set OLLAMA_HOST=0.0.0.0:11434 on the Linux machine. Restrict access to your specific client IP with a firewall rule.
To configure this on Linux via systemd override:
sudo systemctl edit ollama
Add these lines in the editor that opens:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Never expose Ollama to the public internet. The API accepts arbitrary model requests with no authentication — a publicly exposed port is a compute resource available to anyone who finds it.
Verifying the API
Test that the model responds to a generation request:
macOS/Linux (bash):
curl http://localhost:11434/api/generate \
-d '{
"model": "my-coder",
"prompt": "def fibonacci(n):",
"stream": false
}'
Windows (PowerShell):
Invoke-RestMethod -Uri http://localhost:11434/api/generate -Method Post -Body (@{
model = "my-coder"
prompt = "def fibonacci(n):"
stream = $false
} | ConvertTo-Json)
The response JSON includes a response field with the generated completion. Using stream: false waits for the full completion before returning — good for testing. In production, set stream: true so your TUI displays tokens as they arrive rather than waiting for the full response.
Hands-On Example: Building a Focused Coding Model
Scenario: You need a model tuned specifically for writing pytest unit tests — terse output, no explanations, constrained to the pytest framework.
Step 1: Create Modelfile.tests
FROM opencoder:8b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
SYSTEM """You are a senior developer writing pytest unit tests.
Generate ONLY valid pytest code. Do NOT explain what the tests do.
Use descriptive test function names. Cover edge cases and failure conditions.
Import any required modules at the top of the output."""
Step 2: Build and test
ollama create pytest-coder -f Modelfile.tests
ollama run pytest-coder "Write tests for: def divide(a: float, b: float) -> float"
Expected output — clean pytest code, no prose:
import pytest
from mymodule import divide
def test_divide_positive_numbers():
assert divide(10, 2) == 5.0
def test_divide_negative_numbers():
assert divide(-10, 2) == -5.0
def test_divide_by_zero_raises():
with pytest.raises(ZeroDivisionError):
divide(10, 0)
def test_divide_float_result():
assert divide(1, 3) == pytest.approx(0.333, rel=1e-3)
Step 3: Wire it to OpenCode
Add pytest-coder to your opencode.json models list. In OpenCode, use /model pytest-coder when working on test files, and switch back to your general coding model for implementation files.
Best Practices
Pin temperature to 0.1–0.2. Higher values make the model more variable, which for code means inconsistent function signatures and variable names. Code generation benefits from determinism.
Pre-load models before long sessions. Run ollama run my-coder in a background terminal and send a throwaway prompt to trigger the cold-start load. Your TUI connects to an already-warm model and responds immediately.
Avoid OLLAMA_HOST=0.0.0.0 on untrusted networks. Coffee shop Wi-Fi, conference networks, and shared corporate Wi-Fi all qualify as untrusted. The Ollama API has no authentication — binding to 0.0.0.0 on a public network exposes it to other machines on the subnet.
Troubleshooting
Problem: Connection refused when your TUI tries to reach Ollama.
Cause: Ollama isn’t running, or it’s bound to the wrong address. Common in WSL2 setups where you need to access Windows from WSL2.
Fix: Run ollama serve manually in a terminal. Confirm it shows Listening on 127.0.0.1:11434. For accessing Windows from WSL2, set OLLAMA_HOST=0.0.0.0:11434 on Windows and connect from WSL2 using the Windows host IP.
Problem: The model ignores your system prompt and keeps generating prose.
Cause: Some base models have chat templates that override system prompt placement for specific model families.
Fix: Add the TEMPLATE parameter to your Modelfile with the exact prompt format the model expects. Qwen models use a <|im_start|>system format that must be declared correctly — see the model’s Hugging Face page for the canonical chat template.
Problem: Error: model requires more system memory (X MiB) than is available
Cause: Your num_ctx exceeds what your VRAM plus system memory can support for this model size.
Fix: Lower num_ctx to 4096, rebuild with ollama create, and retry. Set OLLAMA_KV_CACHE_TYPE=q4_0 (requires Ollama 0.1.40+) to reclaim additional headroom for the context window.
Key Takeaways
Ollama’s defaults are designed for getting started, not for production coding workflows. Three changes make it production-ready: set num_ctx in a Modelfile to match your VRAM budget, set OLLAMA_KEEP_ALIVE=-1 to eliminate cold-start delays, and pin temperature to 0.2 with a strict coding system prompt. With those in place, your local model is consistent, fast, and ready for integration with a terminal coding agent.
