Ollama Local Inference Setup Guide
Complete guide to deploying GGUF models locally with Ollama. Covers installation, model loading, API configuration, and performance optimization for production use.
Getting Started with Ollama
Ollama provides the simplest path to running LLMs locally. This guide covers everything from installation to production API configuration, based on Blue Note Logic's deployment experience.
Installation
Linux (recommended for production):
curl -fsSL https://ollama.com/install.sh | sh
macOS:
brew install ollama
Windows: Download the installer from ollama.com/download
Loading a Custom GGUF Model
To load a custom model like dobetter-norge-v2:
- Create a Modelfile:
FROM ./dobetter-norge-v2-q5_k_m.gguf PARAMETER temperature 0.3 PARAMETER num_ctx 4096 SYSTEM You are a Norwegian legal expert assistant. - Build the model:
ollama create dobetter-norge-v2 -f Modelfile - Run inference:
ollama run dobetter-norge-v2 "Hva sier arbeidsmiljoloven om overtid?"
API Configuration
Ollama exposes a REST API on port 11434 by default:
curl http://localhost:11434/api/generate -d '{"model": "dobetter-norge-v2", "prompt": "Your question here"}'
Performance Tuning
- GPU Layers: Set
OLLAMA_NUM_GPU=999to offload all layers to GPU - Context Size: Match
num_ctxto your model's training context (4096 for dobetter-norge-v2) - Concurrent Requests: Set
OLLAMA_NUM_PARALLEL=4for multi-user deployments - Keep Alive: Use
OLLAMA_KEEP_ALIVE=24hto keep models loaded in memory
Related Resources
QLoRA Fine-Tuning Workflow
Blue Note Logic's documented workflow for fine-tuning language models with QLoRA 4-bit quantization. Covers dataset preparation, hyperparameter selection, training execution, and evaluation.
Access →
GGUF Quantization Reference
Comprehensive comparison of GGUF quantization levels from Q2_K through Q8_0. File sizes, quality tradeoffs, VRAM requirements, and inference speed benchmarks.
Access →
Claude Code CLI Developer Guide
Setup and advanced usage of Anthropic's Claude Code CLI for AI-assisted development workflows. Configuration tips and productivity patterns from the Gilligan.TECH development stack.
Access →