Skip to main content
AI in Asia
Intermediate Tutorial Generic GenericChatGPTClaude

Run AI Locally with Ollama and LM Studio

A practical, privacy-first tutorial for running open-source AI models like Llama, Qwen, and Gemma on your own laptop.

AI Snapshot

  • Ollama gives you a fast command-line workflow for pulling, running, and serving open-source models; LM Studio wraps the same idea in a polished GUI with an OpenAI-compatible API at localhost:1234.
  • A mid-range laptop (16 GB RAM, Apple Silicon M2 or an RTX 3060 with 6 GB VRAM) can comfortably run Q4 quantised 7B to 13B models such as Llama 3.3 8B, Qwen 2.5 13B, Gemma 3, and Phi-4 offline.
  • Local inference keeps regulated data on-device, which matters for professionals working under Singapore PDPA, Japan APPI, and India DPDPA, and it eliminates per-token costs for high-volume drafting, coding, and summarisation.
Cloud AI is fast and powerful, but every prompt you send to ChatGPT or Claude leaves your device. For a lot of the work done across Asia, that is a problem. Lawyers reviewing client files, product teams poking at customer records, finance professionals analysing transactions, and researchers handling unpublished data often cannot paste sensitive text into a cloud model without breaking a policy or a law. Singapore's PDPA, Japan's APPI, and India's DPDPA all restrict how personal and sensitive data can move across borders, and the safest answer is often to not move it at all.

Why This Matters

Cloud AI is fast and powerful, but every prompt you send to ChatGPT or Claude leaves your device. For a lot of the work done across Asia, that is a problem. Lawyers reviewing client files, product teams poking at customer records, finance professionals analysing transactions, and researchers handling unpublished data often cannot paste sensitive text into a cloud model without breaking a policy or a law. Singapore's PDPA, Japan's APPI, and India's DPDPA all restrict how personal and sensitive data can move across borders, and the safest answer is often to not move it at all.

Running AI locally used to mean renting a cloud GPU or wrestling with Python. That changed quickly. Ollama and LM Studio turned local inference into a three-command install. On 2 April 2026, NVIDIA announced co-optimisation of Google DeepMind's Gemma 4 models for RTX edge deployment, and Apple Silicon has quietly become one of the best platforms in the world for running medium-sized models because unified memory lets a 13B model sit comfortably in RAM. The hardware you already own is very likely enough.

There is a productivity argument as well. Once a model is on your machine, there is no rate limit, no monthly subscription, and no network round-trip. Long coding sessions, bulk document summarisation, and experimentation with prompts all become dramatically cheaper. You still use cloud models for the hard reasoning work, but a lot of the day-to-day drafting, classification, and cleanup can happen locally, in private, for free.

How to Do It

1
Local inference is bottlenecked by memory, not CPU. On macOS, open About This Mac and check unified memory. On Windows, open Task Manager and check both RAM and dedicated GPU VRAM. As a rough guide, 8 GB of RAM is enough for small 3B to 7B Q4 quantised models, 16 GB comfortably runs 7B to 13B, and 32 GB or more opens up 30B class models. On Windows, an NVIDIA GPU with at least 6 GB of VRAM (an RTX 3060 or better) gives you a big speed boost through CUDA. If you are on Apple Silicon (M1 through M4), you are already set; unified memory means the GPU can use all of it. If you only have 8 GB, plan to run 7B Q4 models and nothing heavier.
2
Go to ollama.com and install the native app for macOS, Windows, or Linux. On macOS and Linux you can run curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.
3
Download LM Studio for macOS or Windows. The GUI is genuinely useful for browsing the Hugging Face model catalogue without leaving the app. Search for a model (try qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.
4
More parameters is not always better; the best local model is the one that fits your hardware with room to spare. For general writing, email drafting, and summarisation, Llama 3.3 8B or Qwen 2.5 7B are the sweet spot. For coding, DeepSeek Coder 6.7B or Qwen 2.5 Coder 7B are excellent and often beat closed models on Python. For multilingual work across Asian languages, Qwen 2.5 is outstanding; it was trained with heavy Chinese, Japanese, Korean, and Southeast Asian coverage. For anything reasoning-heavy, step up to Gemma 3 12B or Qwen 2.5 32B if your machine can handle it. Always start with a Q4_K_M quantised version, which cuts memory use by about half with minimal quality loss.
5
By default, Ollama and LM Studio cap context at 2048 or 4096 tokens, which will silently truncate long documents. In Ollama, create a Modelfile with PARAMETER num_ctx 8192 or higher; in LM Studio, raise the context slider in the model settings before loading. Longer context uses more RAM, so pick a value that matches your work. Also touch three sampling parameters: temperature (0.2 for extraction, 0.7 for drafting), top_p (0.9 is a safe default), and repeat_penalty (set to 1.1 to 1.3 to stop the model looping on itself). These settings make an enormous difference, and most early frustration with local models is actually frustration with defaults.
6
The moment local AI gets useful is when you stop chatting in a window and start plugging it into tools. Point Continue at your Ollama server to get free code completion in VS Code. Use LM Studio's OpenAI-compatible endpoint with Python: from openai import OpenAI; client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio'). Drop a local model into n8n or Make (already covered in our earlier automation guide) to batch-classify inbound emails or summarise meeting notes. The same API shape works everywhere, which is the whole point: you can prototype with Claude, then flip a single URL and run the same workflow on your laptop for free, with no data leaving the machine.

Prompt Templates

Summarising sensitive documents locally without uploading them to any cloud service.

You are a careful reader. I will paste a document below. Do not give legal or medical advice. Summarise the document in three sections: (1) what this document is, in one sentence; (2) the three most important obligations or recommendations; (3) any dates, deadlines, or dollar amounts. Be literal; do not infer anything not in the text.

Document: [paste document here]

Code review on laptops without network access, e.g. on a plane or in a secure facility.

You are reviewing the following code file. Identify up to five concrete issues in order of severity. For each issue, give: (1) the line or function it affects; (2) a one-sentence explanation of the problem; (3) a minimal code change that would fix it. If the code is fine, say so and stop. Do not rewrite the whole file.

Code:
[paste code here]

Batch-classifying inbound messages across Bahasa, Vietnamese, Thai, Japanese, and Chinese on-device.

You are a customer service triage assistant. Classify each of the following messages into exactly one of these categories: URGENT_REFUND, SHIPPING_ISSUE, PRODUCT_QUESTION, SPAM, OTHER. Return a JSON array with fields id, language, category. Do not include anything else in your response.

Messages:
[paste messages here, any language]

Common Mistakes

⚠ Downloading the full-precision version of a model

A 7B FP16 model is about 14 GB; the same model in Q4_K_M quantisation is about 4 GB with almost identical quality for everyday use. Always start with Q4_K_M, and only step up to Q6 or Q8 if you are comparing outputs side by side and can see a real difference.

⚠ Ignoring the default 2048 token context window

If you paste in a long document and the model's answer is suspiciously short or contradicts the start of the document, you have almost certainly blown past the context limit. In Ollama, set num_ctx in your Modelfile; in LM Studio, raise the context slider before loading the model.

⚠ Running Ollama and LM Studio with separate model copies

Both tools use the GGUF format, so the same 5 GB file does not need to live in two places. Use a shared model directory or a tool like Golama so a single download is visible to both, which saves disk space quickly as your collection grows.

⚠ Expecting local models to match GPT-5 or Claude Opus

They will not, and that is not the point. Open-source 7B to 13B models are outstanding at drafting, classification, summarisation, and routine coding. For hard multi-step reasoning, novel research, or long-form writing at a senior level, keep using frontier cloud models. The right mental model is 'local for high volume and private data, cloud for hard thinking'.

⚠ Leaving the local server exposed to the network

Ollama listens on 11434 and LM Studio on 1234 by default, usually bound to localhost. If you ever set OLLAMA_HOST=0.0.0.0 to share with teammates or run on a VPS, put it behind a firewall or reverse proxy with authentication. A wide-open local model is a wide-open model, and people will find it.

Recommended Tools

Ollama

Command-line runner for open-source models. Best for developers, scripts, and headless servers. Free and open source.

Visit →

LM Studio

GUI for running local models with an OpenAI-compatible API server. Best for exploring models and non-technical users.

Visit →

Open WebUI

Self-hosted ChatGPT-style interface that sits on top of Ollama. Adds accounts, RAG, and document upload.

Visit →

Continue

Free open-source AI coding assistant for VS Code and JetBrains. Plugs directly into Ollama or LM Studio.

Visit →

Hugging Face

The largest catalogue of open-source models and quantised GGUF files. Both Ollama and LM Studio pull from here.

Visit →

Gemma

Google DeepMind's open-weights family, recently co-optimised by NVIDIA for RTX. Strong reasoning for its size.

Visit →

FAQ

Do I need a gaming PC or a Mac Studio to do this?
No. A 16 GB MacBook Air or a mid-range Windows laptop with an RTX 3060 runs Q4 quantised 7B to 13B models smoothly. You only need heavy hardware if you want to run 30B or 70B models locally, and for most daily work the smaller ones are more than enough.
Is running local AI actually legal for my work?
In most cases yes, and it is often more compliant than cloud AI because the data never leaves your device. That said, if you work in a regulated role (legal, medical, finance, public sector) check your organisation's acceptable use policy. Running inference locally with open-weights models under permissive licences such as Apache 2.0, MIT, or Llama 3 Community Licence is generally fine; double-check the licence page on Hugging Face before deploying anything commercially.
How does local AI compare to the new agentic AI tools coming out?
Agentic systems often call a model over and over; that makes them expensive and slow against cloud APIs. Running the underlying model locally is one of the big levers for bringing agent cost and latency down. Expect agentic workflows built on Ollama to become a major pattern in 2026, especially for enterprises that cannot send task data to third parties.
Which local model is best for Asian languages?
Qwen 2.5 from Alibaba is currently the strongest all-round choice for Chinese, Japanese, Korean, and Southeast Asian languages, and it has a coder variant too. SEA-LION from AI Singapore is built specifically for Southeast Asian languages and is worth trying for Bahasa Indonesia, Thai, and Tagalog work.
Will I still need ChatGPT or Claude after this?
Almost certainly yes, but you will use them differently. The practical pattern is: draft, classify, and clean up with a local model; escalate to a frontier model for reasoning-heavy work, novel problems, or anything that genuinely needs world-class output quality. Many professionals end up paying for one cloud subscription and using local models for the other 80 per cent of their AI usage.
Do I need a gaming PC or a Mac Studio to do this?
No. A 16 GB MacBook Air or a mid-range Windows laptop with an RTX 3060 runs Q4 quantised 7B to 13B models smoothly. You only need heavy hardware if you want to run 30B or 70B models locally, and for most daily work the smaller ones are more than enough.
Is running local AI actually legal for my work?
In most cases yes, and it is often more compliant than cloud AI because the data never leaves your device. That said, if you work in a regulated role (legal, medical, finance, public sector) check your organisation's acceptable use policy. Running inference locally with open-weights models under permissive licences such as Apache 2.0, MIT, or Llama 3 Community Licence is generally fine; double-check the licence page on Hugging Face before deploying anything commercially.
How does local AI compare to the new agentic AI tools coming out?
Agentic systems often call a model over and over; that makes them expensive and slow against cloud APIs. Running the underlying model locally is one of the big levers for bringing agent cost and latency down. Expect agentic workflows built on Ollama to become a major pattern in 2026, especially for enterprises that cannot send task data to third parties.

Next Steps

If you want more context on where this fits, read our guide on AI data privacy laws across Asia and the RAG explainer to see how to feed a local model your own documents. For a no-code way to wire it into real work, our n8n, Make, and Zapier guide shows how local inference plugs into automation platforms.