Run AI Locally with Ollama and LM Studio
A practical, privacy-first tutorial for running open-source AI models like Llama, Qwen, and Gemma on your own laptop.

Ollama gives you a fast command-line workflow for pulling, running, and serving open-source models; LM Studio wraps the same idea in a polished GUI with an OpenAI-compatible API at localhost:1234.
A mid-range laptop (16 GB RAM, Apple Silicon M2 or an RTX 3060 with 6 GB VRAM) can comfortably run Q4 quantised 7B to 13B models such as Llama 3.3 8B, Qwen 2.5 13B, Gemma 3, and Phi-4 offline.
Local inference keeps regulated data on-device, which matters for professionals working under Singapore PDPA, Japan APPI, and India DPDPA, and it eliminates per-token costs for high-volume drafting, coding, and summarisation.
Why This Matters
Running AI locally used to mean renting a cloud GPU or wrestling with Python. That changed quickly. Ollama and LM Studio turned local inference into a three-command install. On 2 April 2026, NVIDIA announced co-optimisation of Google DeepMind's Gemma 4 models for RTX edge deployment, and Apple Silicon has quietly become one of the best platforms in the world for running medium-sized models because unified memory lets a 13B model sit comfortably in RAM. The hardware you already own is very likely enough.
There is a productivity argument as well. Once a model is on your machine, there is no rate limit, no monthly subscription, and no network round-trip. Long coding sessions, bulk document summarisation, and experimentation with prompts all become dramatically cheaper. You still use cloud models for the hard reasoning work, but a lot of the day-to-day drafting, classification, and cleanup can happen locally, in private, for free.
How to Do It
Check your hardware before you install anything
Install Ollama for the command-line workflow
curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.Install LM Studio for a GUI and model discovery
qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.Pick the right model for your job, not the biggest one
Set a realistic context length and sampling settings
PARAMETER num_ctx 8192 or higher; in LM Studio, raise the context slider in the model settings before loading. Longer context uses more RAM, so pick a value that matches your work. Also touch three sampling parameters: temperature (0.2 for extraction, 0.7 for drafting), top_p (0.9 is a safe default), and repeat_penalty (set to 1.1 to 1.3 to stop the model looping on itself). These settings make an enormous difference, and most early frustration with local models is actually frustration with defaults.Connect a model to real work through an API
from openai import OpenAI; client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio'). Drop a local model into n8n or Make (already covered in our earlier automation guide) to batch-classify inbound emails or summarise meeting notes. The same API shape works everywhere, which is the whole point: you can prototype with Claude, then flip a single URL and run the same workflow on your laptop for free, with no data leaving the machine.What This Actually Looks Like
The Prompt
You are a careful legal summariser. Summarise the following contract clause in three bullet points aimed at a non-lawyer. Do not give legal advice. Do not speculate about anything not in the text. Clause: The Service Provider shall retain Customer Data for a period of thirty-six (36) months from the date of termination, after which Customer Data shall be permanently deleted within ninety (90) days unless longer retention is required by applicable Asian jurisdictions including but not limited to Singapore, Japan, and India.
Example output — your results will vary based on your inputs
- After that window, they have 90 days to permanently delete it.
- Local laws in Singapore, Japan, and India may force a longer retention period, so timelines can vary by country.
How to Edit This
Common Mistakes
Downloading the full-precision version of a model
Ignoring the default 2048 token context window
Running Ollama and LM Studio with separate model copies
Expecting local models to match GPT-5 or Claude Opus
Leaving the local server exposed to the network
Tools That Work for This
Command-line runner for open-source models. Best for developers, scripts, and headless servers. Free and open source.
GUI for running local models with an OpenAI-compatible API server. Best for exploring models and non-technical users.
Self-hosted ChatGPT-style interface that sits on top of Ollama. Adds accounts, RAG, and document upload.
Free open-source AI coding assistant for VS Code and JetBrains. Plugs directly into Ollama or LM Studio.
The largest catalogue of open-source models and quantised GGUF files. Both Ollama and LM Studio pull from here.
Google DeepMind's open-weights family, recently co-optimised by NVIDIA for RTX. Strong reasoning for its size.
Check your hardware before you install anything
Install Ollama for the command-line workflow
curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.Install LM Studio for a GUI and model discovery
qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.