beginner ⏱ 10 minutes 10 min read

How to Run Your First Local LLM — One Path, One Command

Install Ollama, run Llama 3.1 8B, and have a private offline model running on your laptop in under ten minutes. One path, one recommendation, no decision matrix.

ollamalocal-llmllamasetupbeginner Mar 28, 2026

prerequisites

A terminal (macOS, Linux, or Windows PowerShell)
16GB RAM recommended (8GB minimum for CPU-only inference)
No Python environment or Docker required

tools used

ollama

last tested

2026-03-28

Run one command and you will have Llama 3.1 8B chatting on your laptop in under ten minutes — no API key, no usage bill, no data leaving your machine.

TL;DR:

Install Ollama with one command (macOS/Linux) or the official installer at ollama.com (Windows)
Run ollama run llama3.1 — it downloads the model, loads it, drops you into chat
Ollama exposes an OpenAI-compatible REST API on port 11434, no API keys required
Llama 3.1 8B at Q4_K_M is the right starting point for most hardware

Most developers I talk to overestimate what they need for local LLMs and underestimate what local models can do today. A Llama 3.1 8B at Q4_K_M quantization scores 69.4 on MMLU — roughly GPT-3.5-level performance, running entirely on your hardware with zero API costs. What was a three-day project requiring deep system knowledge in 2024 is now a ten-minute setup.

What Is a Local LLM?

A local LLM is a language model that runs entirely on your own hardware — no API calls, no data leaving your machine, no usage costs. The model weights live on your disk, inference happens on your CPU or GPU, and the only latency is your own hardware.

Hardware Reality Check

You probably do not need to buy anything.

Apple Silicon Mac: 16GB of unified memory handles 7B–8B models comfortably at 25–35 tokens per second. 32GB opens up 13B models and the occasional 34B at reduced quantization. Apple’s M-series chips use unified memory shared between CPU and GPU — a MacBook Pro with 36GB can fit models that would require a dedicated high-end GPU on a PC, a practical advantage that benchmark comparisons tend to understate.

Windows / Linux with NVIDIA GPU: An RTX 4060 Ti or better gives you comfortable headroom for 7B–8B models at Q4_K_M. An RTX 3090 with 24GB VRAM pushes 80–110 tokens per second on Llama 3.1 8B. GPU inference is dramatically faster than CPU-only. VRAM is a hard boundary, not a soft limit: if the model does not fit entirely in VRAM, Ollama splits layers between GPU and system RAM — and performance drops 5–20x.

CPU-only (any platform): It works. A modern CPU with 16GB RAM runs a 7B Q4_K_M model at roughly 10 tokens per second — usable for non-interactive tasks, slower for chat. Not a long-term setup, but it gets you started today.

The memory math is straightforward: at Q4_K_M quantization, plan for approximately 0.6–0.7 GB per billion parameters. An 8B model needs 5–6GB of RAM. Your current laptop almost certainly handles this.

Install Ollama

Ollama wraps the underlying inference engine (llama.cpp) in a single binary, handles model downloads, selects the appropriate quantization automatically, and exposes an OpenAI-compatible REST API out of the box. LM Studio, Jan, and LocalAI are valid alternatives if you prefer a graphical interface, but they add overhead and are harder to script. Ollama wins for CLI workflows, API integration, and headless server deployments — which is why it has become the industry standard for local model serving.

macOS (Homebrew):

brew install ollama

Linux (official install script):

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the official Windows installer from https://ollama.com (the “Download for Windows” button on the homepage) and run the .exe. GPU acceleration on Windows requires WSL2 plus the NVIDIA CUDA drivers for WSL. To enable WSL2, open PowerShell as administrator and run:

wsl --install

Then restart your machine. After WSL2 is active, install the NVIDIA CUDA/WDDM driver for WSL from the NVIDIA CUDA on WSL page — this is a separate driver from your standard Windows GPU driver. Ollama’s installer links to this page during setup. If you are running CPU-only and have no NVIDIA GPU, skip both steps; the installer works without WSL2 and the model will run on CPU. Once complete, Ollama runs in the system tray and the REST API is available on port 11434.

That is the entire installation. No Python environment, no Docker, no dependency resolution.

Verify the Installation

After installation, confirm the daemon is active by listing your models:

ollama list

On a fresh install you will see an empty table — correct, you have not downloaded any models yet:

NAME    ID    SIZE    MODIFIED

If ollama list returns a connection error, Ollama’s background service is not running. Start it manually:

ollama serve

On macOS, Ollama typically runs as a menu bar app and starts automatically at login. On Linux, it registers as a systemd service — check its status and enable autostart with:

systemctl status ollama
systemctl enable ollama

On Windows, it runs in the system tray after the installer completes. If something does not start automatically, ollama serve always works as a fallback.

For a more specific API check, confirm the server responds on the models endpoint:

curl http://localhost:11434/v1/models

A JSON response listing available models (empty array on a fresh install) confirms Ollama is running correctly. Note that the root endpoint response (curl http://localhost:11434) can vary between Ollama releases — use ollama list or the /v1/models endpoint for reliable verification.

Your First Model

Run this:

ollama run llama3.1

The first time, Ollama downloads the model weights (~5.2 GB for Llama 3.1 8B at Q4_K_M), loads them into memory, and drops you into an interactive chat session. Subsequent runs skip the download entirely and start in seconds.

Here is what the terminal looks like during first run:

pulling manifest
pulling 00e22de8b2ba... 100% ▕████████████████▏ 5.2 GB
pulling 966de95ca8a6...  100% ▕████████████████▏ 1.4 KB
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)

Once you see >>>, the model is loaded and waiting. Type a message and press Enter. Type /bye or press Ctrl+D to exit.

Download First, Run Later

If you want to download the model before running it — useful for offline environments, CI pipelines, or air-gapped machines — use ollama pull separately:

ollama pull llama3.1

Then run it later without needing a network connection:

ollama run llama3.1

For scripted or production deployments where you need certainty about exactly which quantization variant is on disk, specify the tag explicitly:

ollama pull llama3.1:8b-q4_K_M
ollama run llama3.1:8b-q4_K_M

For interactive use on your laptop, ollama run llama3.1 does the download and launch in one step — the default is Q4_K_M for Llama 3.1 8B, so both approaches land you on the same model. Browse all available tags at ollama.com/library.

Why Llama 3.1 8B? It scores 69.4 on MMLU at Q4_K_M quantization and handles coding and general reasoning well. It also has the largest open-weight ecosystem of any model family — which means fine-tunes for specific domains are abundant on Hugging Face.

Alternative Models

If your hardware has less than 8GB RAM:

ollama run gemma3:4b

Gemma 3 4B needs as little as 3GB RAM and supports basic vision tasks that Llama does not. Pick this if you are on an older laptop or running CPU-only.

If you need stronger coding or multilingual performance:

ollama run qwen3:8b

Qwen 3 8B covers 119 languages and dialects and includes a “thinking mode” for structured reasoning tasks. The RAM footprint is roughly the same as Llama 3.1 8B (~6GB at Q4_K_M). Pick this if your primary use case is code generation or non-English tasks.

Start with Llama 3.1 8B unless one of those two scenarios applies to you.

Quantization: What Q4_K_M Means

When you download a model, you are not downloading all 8 billion parameters at full 16-bit precision. Quantization compresses those values — to 8-bit, 4-bit, or lower — which shrinks the file size and speeds up inference at the cost of some accuracy.

Q4_K_M breaks down as follows: Q4 means 4-bit quantization, reducing each parameter to one-quarter of its original storage — roughly a 75% reduction in file size compared to 16-bit. K refers to grouped quantization, which uses per-group scale factors to preserve important weight distributions rather than applying a single scale across the entire model. M indicates a medium-sized grouping — more careful than the smallest K variants, less memory-hungry than the largest. The result is a model that runs in 5–6GB instead of 16GB, with quality loss that is nearly imperceptible for daily development tasks.

The naming convention you will encounter when browsing models:

Q8_0 — 8-bit, highest quality, largest file, most RAM required
Q5_K_M — 5-bit, marginally better quality than Q4, but larger and slower
Q4_K_M — 4-bit, the sweet spot for daily use
Q3_K_M and below — noticeably degraded on complex reasoning, only worth it on very constrained hardware

Ollama selects Q4_K_M by default for Llama 3.1 8B on typical consumer hardware. You do not need to think about this unless you are deliberately trading quality for RAM headroom.

Verify the API and Connect Your Tools

Ollama exposes two REST endpoints once the server is running. Knowing the difference matters when you connect third-party tools.

Ollama’s native endpoint (/api/generate):

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "prompt": "Explain async/await in three sentences.",
    "stream": false
  }'

This is Ollama’s own API format. Some tools target it directly.

OpenAI-compatible endpoint (/v1/chat/completions):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Explain async/await in three sentences."}]
  }'

This is the compatibility layer — the response format is identical to the OpenAI Chat Completions API. Most developer tools (Continue.dev, Aider, LiteLLM) use this endpoint. When a tool asks for an OpenAI base URL, point it at http://localhost:11434/v1.

Python with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, value is ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Explain JWT in plain English."}
    ]
)

print(response.choices[0].message.content)

No forked library, no wrapper. Existing scripts that use the OpenAI Python SDK migrate to local inference by changing base_url and api_key. That is the entire diff.

Connect to VS Code: Install the Continue extension, open its configuration, and set the provider to Ollama at http://localhost:11434. You get inline code completions and a chat panel with no API key and no usage cost. This is the fastest path from “model running” to “useful in my daily workflow.”

Connect to CLI coding tools: Aider supports Ollama directly via the OpenAI-compatible endpoint. If you use LiteLLM as a proxy layer, it handles routing to Ollama automatically alongside any cloud APIs you still use.

Troubleshooting

The model is loading but inference is extremely slow

Cause: The model is running on CPU instead of GPU. This happens on Windows/Linux if the NVIDIA drivers are missing or the CUDA version does not match what Ollama expects.

Fix: Verify your GPU is detected:

ollama run llama3.1 --verbose

Look for gpu layers in the output. If it shows 0 GPU layers, Ollama is running CPU-only. On Linux, reinstall the NVIDIA driver and CUDA toolkit, then restart Ollama. On Windows, ensure WSL2 is enabled and that you have installed the NVIDIA CUDA driver for WSL (not the standard Windows GPU driver) from the NVIDIA CUDA on WSL page.

`ollama run` exits immediately or shows a memory error

Cause: Insufficient RAM to load the model. A 7B–8B model at Q4_K_M needs 5–6GB of free memory. If your system RAM is nearly full, the load fails.

Fix: Close other applications to free memory, then retry. Alternatively, switch to a smaller model:

ollama run gemma3:4b

Gemma 3 4B needs roughly 3GB and is the recommended fallback on memory-constrained machines.

`ollama list` returns a connection error

Cause: The Ollama daemon is not running.

Fix: Start it manually:

ollama serve

On Linux, check systemd status and enable autostart:

systemctl status ollama
systemctl enable ollama

On macOS, relaunch the Ollama app from your Applications folder. On Windows, start it from the Start menu — it will appear in the system tray when active.

Firewall is blocking port 11434

Cause: A local firewall rule is preventing connections to Ollama’s port. Common on corporate machines or hardened Linux setups.

Fix: Allow the port explicitly. On Linux with ufw:

sudo ufw allow 11434/tcp

Note: port 11434 only needs to be open if you are connecting from another machine on your network. For local-only use (localhost), firewall rules do not apply.

Where the Ceiling Is

Be honest with yourself about what an 8B local model is and is not.

It handles 80% of daily development tasks well: code completions, explaining unfamiliar code, simple refactors, writing tests for straightforward functions, summarizing documentation, answering questions about a specific codebase. For this category of work, the quality gap between a local 8B and a frontier API is smaller than most developers expect — and the privacy and latency advantages of running locally are immediate and concrete.

The remaining 20% still needs the cloud. Complex multi-file refactors, long-context reasoning across a large codebase, nuanced architecture decisions, frontier-level code generation for unfamiliar frameworks — these tasks consistently favor Claude Sonnet or GPT-4o. A model that scores 69.4 on MMLU is not the same as one scoring 88.0, and the difference shows up exactly when the task is hard. The benchmark gap is real and it maps to real task failure, not just leaderboard aesthetics.

The practical approach: use Ollama locally for repetitive, privacy-sensitive, or high-volume tasks. Keep a cloud API available for the genuinely difficult ones. You are not choosing between local and cloud — you are layering them.

Next Steps

If your hardware has headroom, try a larger model next:

ollama run qwen3:14b

Qwen 3 14B requires roughly 10GB of RAM at Q4_K_M, and the quality jump on reasoning tasks is meaningful.

If you have 32GB+ RAM, Llama 4 Scout is worth exploring — a mixture-of-experts architecture with multimodal support:

ollama run llama4-scout

Explore GGUF models on Hugging Face for fine-tunes targeting specific domains: TypeScript generation, legal text, medical Q&A. Ollama can import GGUF files directly with ollama create.

For the strategic question of when local models are the right choice versus proprietary APIs, the Open Source vs Proprietary AI Models guide covers the decision framework in depth.

If you are evaluating what cloud-side tools like Claude Code offer that local models cannot yet match, Claude Code: The 5 Features Most Developers Ignore is the honest side-by-side.

And if API costs are part of why local inference is appealing in the first place, Context Mode — Stop Wasting 70% of Your Token Budget covers the token efficiency side of that equation.

By dennis · Mar 28, 2026 ← all guides