Home Home Tech Resources Technical GuidesOllama Local Inference Setup Guide

Technical Guides

GUIDE

Ollama Local Inference Setup Guide

Complete guide to deploying GGUF models locally with Ollama. Covers installation, model loading, API configuration, and performance optimization for production use.

Type Guide

Local Deployment REST API GPU Acceleration Custom Models

Ollama Local Inference GGUF Deployment API

Getting Started with Ollama

Ollama provides the simplest path to running LLMs locally. This guide covers everything from installation to production API configuration, based on Blue Note Logic's deployment experience.

Installation

Linux (recommended for production):

curl -fsSL https://ollama.com/install.sh | sh

macOS:

brew install ollama

Windows: Download the installer from ollama.com/download

Loading a Custom GGUF Model

To load a custom model like dobetter-norge-v2:

Create a Modelfile:

FROM ./dobetter-norge-v2-q5_k_m.gguf
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
SYSTEM You are a Norwegian legal expert assistant.

Build the model:

ollama create dobetter-norge-v2 -f Modelfile

Run inference:

ollama run dobetter-norge-v2 "Hva sier arbeidsmiljoloven om overtid?"

API Configuration

Ollama exposes a REST API on port 11434 by default:

curl http://localhost:11434/api/generate -d '{"model": "dobetter-norge-v2", "prompt": "Your question here"}'

Performance Tuning

GPU Layers: Set OLLAMA_NUM_GPU=999 to offload all layers to GPU
Context Size: Match num_ctx to your model's training context (4096 for dobetter-norge-v2)
Concurrent Requests: Set OLLAMA_NUM_PARALLEL=4 for multi-user deployments
Keep Alive: Use OLLAMA_KEEP_ALIVE=24h to keep models loaded in memory

BNL Perspective

Ollama is our go-to for local inference. We evaluated vLLM, text-generation-inference, and LocalAI before settling on Ollama for anything under 13B parameters. The Modelfile approach makes custom model deployment trivially simple, and the REST API is clean enough to integrate into any application stack. For larger models or high-throughput scenarios, we switch to vLLM — but for the 7B-class models we deploy most often, Ollama delivers.

Access Resource ← Back to Tech Resources

GUIDE

QLoRA Fine-Tuning Workflow

Blue Note Logic's documented workflow for fine-tuning language models with QLoRA 4-bit quantization. Covers dataset preparation, hyperparameter selection, training execution, and evaluation.

Access →

REFERENCE

GGUF Quantization Reference

Comprehensive comparison of GGUF quantization levels from Q2_K through Q8_0. File sizes, quality tradeoffs, VRAM requirements, and inference speed benchmarks.

Access →

GUIDE

Claude Code CLI Developer Guide

Setup and advanced usage of Anthropic's Claude Code CLI for AI-assisted development workflows. Configuration tips and productivity patterns from the Gilligan.TECH development stack.

Access →

Ollama Local Inference Setup Guide

Getting Started with Ollama

Installation

Loading a Custom GGUF Model

API Configuration

Performance Tuning

Related Resources

QLoRA Fine-Tuning Workflow

GGUF Quantization Reference

Claude Code CLI Developer Guide

Curious enough to click?
Enter CaveauAI.

Ollama Local Inference Setup Guide

Getting Started with Ollama

Installation

Loading a Custom GGUF Model

API Configuration

Performance Tuning

Related Resources

QLoRA Fine-Tuning Workflow

GGUF Quantization Reference

Claude Code CLI Developer Guide

Curious enough to click?Enter CaveauAI.

Curious enough to click?
Enter CaveauAI.