Run AI Locally with Ollama and LM Studio
A practical, privacy-first tutorial for running open-source AI models like Llama, Qwen, and Gemma on your own laptop.
AI Snapshot
- ✓ Ollama gives you a fast command-line workflow for pulling, running, and serving open-source models; LM Studio wraps the same idea in a polished GUI with an OpenAI-compatible API at localhost:1234.
- ✓ A mid-range laptop (16 GB RAM, Apple Silicon M2 or an RTX 3060 with 6 GB VRAM) can comfortably run Q4 quantised 7B to 13B models such as Llama 3.3 8B, Qwen 2.5 13B, Gemma 3, and Phi-4 offline.
- ✓ Local inference keeps regulated data on-device, which matters for professionals working under Singapore PDPA, Japan APPI, and India DPDPA, and it eliminates per-token costs for high-volume drafting, coding, and summarisation.
Why This Matters
Running AI locally used to mean renting a cloud GPU or wrestling with Python. That changed quickly. Ollama and LM Studio turned local inference into a three-command install. On 2 April 2026, NVIDIA announced co-optimisation of Google DeepMind's Gemma 4 models for RTX edge deployment, and Apple Silicon has quietly become one of the best platforms in the world for running medium-sized models because unified memory lets a 13B model sit comfortably in RAM. The hardware you already own is very likely enough.
There is a productivity argument as well. Once a model is on your machine, there is no rate limit, no monthly subscription, and no network round-trip. Long coding sessions, bulk document summarisation, and experimentation with prompts all become dramatically cheaper. You still use cloud models for the hard reasoning work, but a lot of the day-to-day drafting, classification, and cleanup can happen locally, in private, for free.
How to Do It
curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.PARAMETER num_ctx 8192 or higher; in LM Studio, raise the context slider in the model settings before loading. Longer context uses more RAM, so pick a value that matches your work. Also touch three sampling parameters: temperature (0.2 for extraction, 0.7 for drafting), top_p (0.9 is a safe default), and repeat_penalty (set to 1.1 to 1.3 to stop the model looping on itself). These settings make an enormous difference, and most early frustration with local models is actually frustration with defaults.from openai import OpenAI; client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio'). Drop a local model into n8n or Make (already covered in our earlier automation guide) to batch-classify inbound emails or summarise meeting notes. The same API shape works everywhere, which is the whole point: you can prototype with Claude, then flip a single URL and run the same workflow on your laptop for free, with no data leaving the machine.Prompt Templates
Summarising sensitive documents locally without uploading them to any cloud service.
You are a careful reader. I will paste a document below. Do not give legal or medical advice. Summarise the document in three sections: (1) what this document is, in one sentence; (2) the three most important obligations or recommendations; (3) any dates, deadlines, or dollar amounts. Be literal; do not infer anything not in the text. Document: [paste document here]
Code review on laptops without network access, e.g. on a plane or in a secure facility.
You are reviewing the following code file. Identify up to five concrete issues in order of severity. For each issue, give: (1) the line or function it affects; (2) a one-sentence explanation of the problem; (3) a minimal code change that would fix it. If the code is fine, say so and stop. Do not rewrite the whole file. Code: [paste code here]
Batch-classifying inbound messages across Bahasa, Vietnamese, Thai, Japanese, and Chinese on-device.
You are a customer service triage assistant. Classify each of the following messages into exactly one of these categories: URGENT_REFUND, SHIPPING_ISSUE, PRODUCT_QUESTION, SPAM, OTHER. Return a JSON array with fields id, language, category. Do not include anything else in your response. Messages: [paste messages here, any language]
Common Mistakes
⚠ Downloading the full-precision version of a model
A 7B FP16 model is about 14 GB; the same model in Q4_K_M quantisation is about 4 GB with almost identical quality for everyday use. Always start with Q4_K_M, and only step up to Q6 or Q8 if you are comparing outputs side by side and can see a real difference.
⚠ Ignoring the default 2048 token context window
If you paste in a long document and the model's answer is suspiciously short or contradicts the start of the document, you have almost certainly blown past the context limit. In Ollama, set num_ctx in your Modelfile; in LM Studio, raise the context slider before loading the model.
⚠ Running Ollama and LM Studio with separate model copies
Both tools use the GGUF format, so the same 5 GB file does not need to live in two places. Use a shared model directory or a tool like Golama so a single download is visible to both, which saves disk space quickly as your collection grows.
⚠ Expecting local models to match GPT-5 or Claude Opus
They will not, and that is not the point. Open-source 7B to 13B models are outstanding at drafting, classification, summarisation, and routine coding. For hard multi-step reasoning, novel research, or long-form writing at a senior level, keep using frontier cloud models. The right mental model is 'local for high volume and private data, cloud for hard thinking'.
⚠ Leaving the local server exposed to the network
Ollama listens on 11434 and LM Studio on 1234 by default, usually bound to localhost. If you ever set OLLAMA_HOST=0.0.0.0 to share with teammates or run on a VPS, put it behind a firewall or reverse proxy with authentication. A wide-open local model is a wide-open model, and people will find it.
Recommended Tools
Ollama
Command-line runner for open-source models. Best for developers, scripts, and headless servers. Free and open source.
Visit →LM Studio
GUI for running local models with an OpenAI-compatible API server. Best for exploring models and non-technical users.
Visit →Open WebUI
Self-hosted ChatGPT-style interface that sits on top of Ollama. Adds accounts, RAG, and document upload.
Visit →Continue
Free open-source AI coding assistant for VS Code and JetBrains. Plugs directly into Ollama or LM Studio.
Visit →Hugging Face
The largest catalogue of open-source models and quantised GGUF files. Both Ollama and LM Studio pull from here.
Visit →Gemma
Google DeepMind's open-weights family, recently co-optimised by NVIDIA for RTX. Strong reasoning for its size.
Visit →