The Ultimate Guide to Local AI Coding with OpenCode and Ollama

Jul 1, 2026 min read

GitHub Copilot goes down on a Tuesday afternoon. You’re mid-feature, context loaded, and suddenly the autocomplete suggestions stop. You refresh. You wait. You check the status page. Partial outage. No ETA.

Or maybe it’s a different kind of Wednesday — the one where HR sends a company-wide memo reminding everyone that pasting source code into external AI tools violates the data handling policy you agreed to when you joined. The authentication module you were about to paste into ChatGPT stays un-pasted.

These are the two failure modes of cloud AI coding assistants: unavailable or off-limits. The fix isn’t a different cloud provider. The fix is moving the model off the cloud entirely.

Why Self-Host Your AI Coding Assistant?

The Cost and Privacy Problem

Every request to GitHub Copilot or any cloud-based coding API transmits your code context to a third-party server. Not just a line — the surrounding function, the import block, the file header. The model needs that context to generate useful completions, and providing it means sending your code off your machine over HTTPS.

GitHub’s zero-data-retention policy covers training data, but the payload still traverses their infrastructure. For proprietary algorithms, internal tool code, or anything touching PII, that transmission is exactly what your legal and security teams are flagging. This isn’t theoretical — it’s a documented compliance gap that GDPR and HIPAA auditors are increasingly citing.

Beyond privacy, there’s the subscription fatigue problem. A $19/month Copilot plan stacked with a $20/month Claude subscription stacked with API overage charges adds up. Local AI has a marginal cost of zero after the one-time model download.

The 8GB Rule

Consumer hardware has crossed a threshold. In 2026, an 8GB VRAM GPU — the kind that ships in a standard gaming laptop — can run a 9B parameter coding model (or smaller, faster options like Gemma 4 E4B) well enough to replace a cloud subscription for targeted, file-specific tasks.

The math: a 9B parameter model quantized to Q4_K_M (4-bit medium precision) takes approximately 5.0GB to 5.4GB of VRAM. Your operating system and active applications consume another 1.5-2GB. An 8GB card sits at the edge of viable — tight but functional if you manage your context window. A 12GB or 16GB card removes the anxiety entirely.

Quantization is what makes this work. Converting model weights from 16-bit floating-point to 4-bit integers reduces the VRAM footprint by roughly 75%, with negligible impact on code quality for the tasks these models handle well: refactoring, unit test generation, single-file logic problems.

Setting Up Your Local Infrastructure

Installing Ollama

Ollama is the runtime that pulls, manages, and serves local models. Think of it as Docker for LLMs — it handles the download, storage, and API serving so you don’t have to wrestle with raw GGUF files and llama.cpp launch flags.

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com and run it. Ollama installs as a background service.

Verify it’s running:

curl http://localhost:11434/api/tags

A JSON response listing available models confirms the service is up. An empty models array is fine — you haven’t pulled anything yet.

Pulling the Right Model

While instruction-tuned models like gemma4:e4b-instruct are highly capable, they are often tuned to be helpful conversationalists. That’s a different optimization target than strict code generation. A conversational model asked to complete a function tends to start with “Great question! Here’s an approach you might consider…” before the actual code. TUI parsers that expect a clean code response choke on that preamble.

Use a model variant trained specifically for clean code output:

# Specify the quantization tag explicitly — never omit it
ollama pull qwen3.5:9b-instruct-q4_K_M

This downloads about 5.2GB. On the first run, Ollama loads the weights into VRAM and starts accepting requests. Subsequent requests within the default 5-minute keep-alive window respond instantly.

Optimizing the Modelfile

Ollama’s default num_ctx (context window) is 2,048 tokens — uselessly small for code generation, where a single file can easily exceed that limit. While setting it to 32,768 tokens on an 8GB card used to cause KV cache overflow on older architectures, modern Grouped Query Attention (GQA) models like Qwen 3.5 are highly efficient and can often handle massive contexts. Still, a perfectly safe sweet spot for an 8GB GPU without pushing limits is 8,192 tokens. Create a Modelfile to lock this in:

FROM qwen3.5:9b-instruct-q4_K_M

# 8192 tokens fits within VRAM budget; prevents KV cache spill on 8GB GPUs
PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER stop "<|endoftext|>"

SYSTEM """You are an expert software engineer. Return ONLY valid, executable code.
Do not wrap code in conversational filler or markdown explanations unless explicitly asked."""

Build and register the custom model:

ollama create my-coder -f Modelfile

Test it:

ollama run my-coder "def parse_csv(filepath: str) -> list[dict]:"

The model should complete the function without preamble.

Integrating OpenCode

Installing and Connecting OpenCode

OpenCode is a terminal-native AI coding agent. It sits in your project directory, reads your files, sends prompts to Ollama’s local API, and applies generated code diffs directly to your filesystem. No copy-paste, no context switching to a browser.

Diagram

This diagram visualizes the fully local workflow between the filesystem, the OpenCode terminal interface, and the Ollama API serving the quantized model from GPU VRAM.

ProjectFilesQ(wLOeolnaRl3dea.eam5ddasMi/LonWodrceGialPtlUes+AVPRCIAPoMrn)otmepxttOpenCodeTUIGCeondeeraDtiUefsdferPrompt

Visual Notes:

  • The entire process executes on the local machine without external network calls.
  • The KV cache and model weights reside strictly within the isolated GPU VRAM block.

Install via npm:

npm install -g opencode-ai

Configure OpenCode to point to your local Ollama instance. Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "name": "Ollama",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "qwen3.5:9b-instruct-q4_K_M": {
          "name": "qwen3.5:9b-instruct-q4_K_M",
          "tools": true
        }
      }
    }
  }
}

The /v1 suffix is required. Ollama exposes an OpenAI-compatible endpoint there, and OpenCode expects the OpenAI API format. The "tools": true flag enables function calling, which is how OpenCode applies file diffs rather than just printing them. The model config may recongize only specifying qwen3.5 for the model, results may vary.

Launch OpenCode from your project directory:

cd /path/to/your/project
opencode

The TUI opens with a chat input at the bottom and conversation history above. The commands you’ll use constantly:

  • /add filename.py — injects the file into the model’s context window
  • /drop filename.py — removes the file, freeing KV cache memory
  • /new — clears the conversation and resets context
  • /model — switch to a different model mid-session

Context curation is the most important skill. Feed the model only what it needs: the file you’re editing and the interface or type definitions it references. Drop files you’ve finished with. Keeping irrelevant context in the window degrades suggestion quality and, on 8GB VRAM setups, slows generation noticeably.

Hands-On Example: Refactoring a Script Offline

You’re on a plane, working on a data pipeline with a messy loop structure. No internet. No Copilot. Just your laptop with Ollama running locally.

The existing function:

# process_data.py
def process_records(data):
    results = []
    for item in data:
        if item['active'] == True:
            record = {}
            record['id'] = item['id']
            record['name'] = item['name'].upper()
            record['score'] = item['value'] * 1.5
            results.append(record)
    return results

Start OpenCode, add the file, and prompt:

/add process_data.py
Refactor process_records to use a list comprehension and remove the redundant boolean comparison.

OpenCode sends the file content plus your prompt to Ollama. The model generates the refactored function and presents it as a diff in the TUI:

def process_records(data):
    return [
        {
            'id': item['id'],
            'name': item['name'].upper(),
            'score': item['value'] * 1.5,
        }
        for item in data
        if item['active']
    ]

Review the diff, accept it, and OpenCode writes the change directly to process_data.py. Run the tests to confirm:

python -m pytest tests/test_process_data.py -v

The full workflow — prompt to applied, tested change — takes about 90 seconds with zero network activity.

Best Practices

Use Q4_K_M quantization. On 8GB VRAM, this is non-negotiable. The Q8 version of the same 9B model needs ~9.5GB and will spill into system RAM, cutting generation speed by 90%.

Cap your context window. Set num_ctx to 8192 for 8GB GPUs, 16384 for 12GB. Going higher without sufficient VRAM silently degrades performance — Ollama won’t error out, it will just get slow.

Enable KV cache quantization. Set OLLAMA_KV_CACHE_TYPE=q4_0 before starting Ollama to shrink the memory footprint of the context window itself:

Linux/macOS:

export OLLAMA_KV_CACHE_TYPE=q4_0
ollama serve

Windows PowerShell:

$env:OLLAMA_KV_CACHE_TYPE = "q4_0"
ollama serve

Restrict Ollama to localhost. By default, Ollama binds to 127.0.0.1. Don’t change this unless you have a specific need to share the API on a trusted local network. The API has no authentication — anything that can reach the port can use it.

Monitor VRAM usage during generation:

# Windows/Linux (Nvidia)
nvidia-smi

# macOS: Activity Monitor → GPU tab, or install asitop

If the GPU percentage stays at 100% during generation, the model is fully in VRAM and running at full speed. If you see CPU climbing instead, you’re spilling into system RAM.

Troubleshooting

Problem: Generation drops to 2-3 tokens/sec.

Cause: The model weights plus KV cache overflow VRAM. Ollama offloads inference layers to the CPU over the PCIe bus.

Fix: Lower num_ctx in your Modelfile (try 4096), rebuild the custom model, and restart. Verify with ollama ps that 100% of layers are GPU-loaded.

Problem: ECONNREFUSED when OpenCode tries to connect.

Cause: Ollama isn’t running, or OpenCode is pointed at the wrong URL.

Fix: Verify Ollama is active (curl http://localhost:11434/api/tags). Confirm opencode.json uses http://localhost:11434/v1 with the /v1 suffix — OpenCode expects the OpenAI-compatible endpoint.

Problem: The model ignores your system prompt.

Cause: Some base models have chat templates that override custom system prompts unless you use an instruct-tuned variant.

Fix: Pull the instruct version of the model (e.g., qwen3.5:9b-instruct) and define the TEMPLATE parameter explicitly in your Modelfile if the problem persists.

Key Takeaways

Local AI coding on consumer hardware is real and usable. The three pieces that make it work: Ollama for model management and serving, a Q4_K_M quantized coding model that fits in VRAM, and OpenCode as the terminal interface that keeps your code context local and your workflow in the terminal.

The two variables that matter most are quantization level and context window size. Get those right and generation runs at 40+ tokens/sec. The next article in this series covers the hardware specifics: exactly how VRAM math works, why the KV cache is the variable that trips people up, and how to benchmark your setup before committing to a full workflow integration.

Sources