Skip to main content

We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. Cookie Policy

AI in ASIA
toolbox
intermediate
Generic
ChatGPT
Claude

Run AI Locally with Ollama and LM Studio

A practical, privacy-first tutorial for running open-source AI models like Llama, Qwen, and Gemma on your own laptop.

9 min read22 April 2026
local AI
privacy
on-device
Ollama
LM Studio
open source
edge AI
Asia
Editorial still-life of a warmly-lit desk with a laptop, a small glowing circuit board, and a stack of papers in a library, evoking private on-device AI work.

Ollama gives you a fast command-line workflow for pulling, running, and serving open-source models; LM Studio wraps the same idea in a polished GUI with an OpenAI-compatible API at localhost:1234.

A mid-range laptop (16 GB RAM, Apple Silicon M2 or an RTX 3060 with 6 GB VRAM) can comfortably run Q4 quantised 7B to 13B models such as Llama 3.3 8B, Qwen 2.5 13B, Gemma 3, and Phi-4 offline.

Local inference keeps regulated data on-device, which matters for professionals working under Singapore PDPA, Japan APPI, and India DPDPA, and it eliminates per-token costs for high-volume drafting, coding, and summarisation.

Why This Matters

Cloud AI is fast and powerful, but every prompt you send to ChatGPT or Claude leaves your device. For a lot of the work done across Asia, that is a problem. Lawyers reviewing client files, product teams poking at customer records, finance professionals analysing transactions, and researchers handling unpublished data often cannot paste sensitive text into a cloud model without breaking a policy or a law. Singapore's PDPA, Japan's APPI, and India's DPDPA all restrict how personal and sensitive data can move across borders, and the safest answer is often to not move it at all.

Running AI locally used to mean renting a cloud GPU or wrestling with Python. That changed quickly. Ollama and LM Studio turned local inference into a three-command install. On 2 April 2026, NVIDIA announced co-optimisation of Google DeepMind's Gemma 4 models for RTX edge deployment, and Apple Silicon has quietly become one of the best platforms in the world for running medium-sized models because unified memory lets a 13B model sit comfortably in RAM. The hardware you already own is very likely enough.

There is a productivity argument as well. Once a model is on your machine, there is no rate limit, no monthly subscription, and no network round-trip. Long coding sessions, bulk document summarisation, and experimentation with prompts all become dramatically cheaper. You still use cloud models for the hard reasoning work, but a lot of the day-to-day drafting, classification, and cleanup can happen locally, in private, for free.

How to Do It

1

Check your hardware before you install anything

Local inference is bottlenecked by memory, not CPU. On macOS, open About This Mac and check unified memory. On Windows, open Task Manager and check both RAM and dedicated GPU VRAM. As a rough guide, 8 GB of RAM is enough for small 3B to 7B Q4 quantised models, 16 GB comfortably runs 7B to 13B, and 32 GB or more opens up 30B class models. On Windows, an NVIDIA GPU with at least 6 GB of VRAM (an RTX 3060 or better) gives you a big speed boost through CUDA. If you are on Apple Silicon (M1 through M4), you are already set; unified memory means the GPU can use all of it. If you only have 8 GB, plan to run 7B Q4 models and nothing heavier.
2

Install Ollama for the command-line workflow

Go to ollama.com and install the native app for macOS, Windows, or Linux. On macOS and Linux you can run curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.
3

Install LM Studio for a GUI and model discovery

Download LM Studio for macOS or Windows. The GUI is genuinely useful for browsing the Hugging Face model catalogue without leaving the app. Search for a model (try qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.
4

Pick the right model for your job, not the biggest one

More parameters is not always better; the best local model is the one that fits your hardware with room to spare. For general writing, email drafting, and summarisation, Llama 3.3 8B or Qwen 2.5 7B are the sweet spot. For coding, DeepSeek Coder 6.7B or Qwen 2.5 Coder 7B are excellent and often beat closed models on Python. For multilingual work across Asian languages, Qwen 2.5 is outstanding; it was trained with heavy Chinese, Japanese, Korean, and Southeast Asian coverage. For anything reasoning-heavy, step up to Gemma 3 12B or Qwen 2.5 32B if your machine can handle it. Always start with a Q4_K_M quantised version, which cuts memory use by about half with minimal quality loss.
5

Set a realistic context length and sampling settings

By default, Ollama and LM Studio cap context at 2048 or 4096 tokens, which will silently truncate long documents. In Ollama, create a Modelfile with PARAMETER num_ctx 8192 or higher; in LM Studio, raise the context slider in the model settings before loading. Longer context uses more RAM, so pick a value that matches your work. Also touch three sampling parameters: temperature (0.2 for extraction, 0.7 for drafting), top_p (0.9 is a safe default), and repeat_penalty (set to 1.1 to 1.3 to stop the model looping on itself). These settings make an enormous difference, and most early frustration with local models is actually frustration with defaults.
6

Connect a model to real work through an API

The moment local AI gets useful is when you stop chatting in a window and start plugging it into tools. Point Continue at your Ollama server to get free code completion in VS Code. Use LM Studio's OpenAI-compatible endpoint with Python: from openai import OpenAI; client = OpenAI(base_url='http://localhost:1234/v1', api_key='lm-studio'). Drop a local model into n8n or Make (already covered in our earlier automation guide) to batch-classify inbound emails or summarise meeting notes. The same API shape works everywhere, which is the whole point: you can prototype with Claude, then flip a single URL and run the same workflow on your laptop for free, with no data leaving the machine.

What This Actually Looks Like

The Prompt

You are a careful legal summariser. Summarise the following contract clause in three bullet points aimed at a non-lawyer. Do not give legal advice. Do not speculate about anything not in the text. Clause: The Service Provider shall retain Customer Data for a period of thirty-six (36) months from the date of termination, after which Customer Data shall be permanently deleted within ninety (90) days unless longer retention is required by applicable Asian jurisdictions including but not limited to Singapore, Japan, and India.

Example output — your results will vary based on your inputs

- The provider keeps your data for 36 months after the contract ends.
- After that window, they have 90 days to permanently delete it.
- Local laws in Singapore, Japan, and India may force a longer retention period, so timelines can vary by country.

How to Edit This

Ran this prompt against Llama 3.3 8B Q4_K_M on a MacBook Pro M2 with 16 GB RAM using Ollama. Response came back in under 4 seconds with temperature set to 0.2 and num_ctx 8192. The output is clean and did not hallucinate any legal advice, which is what you want for a first pass. For actual client work, a larger model such as Qwen 2.5 32B would be more conservative, but for bulk triage the 8B version is perfectly adequate and keeps every word of the contract on your own hard drive.

Common Mistakes

Downloading the full-precision version of a model

Ignoring the default 2048 token context window

Running Ollama and LM Studio with separate model copies

Expecting local models to match GPT-5 or Claude Opus

Leaving the local server exposed to the network

Tools That Work for This

Ollama— Terminal users, API integrations, and production deployments.

Command-line runner for open-source models. Best for developers, scripts, and headless servers. Free and open source.

LM Studio— Model discovery, chat UI, and drop-in replacement for OpenAI in existing apps.

GUI for running local models with an OpenAI-compatible API server. Best for exploring models and non-technical users.

Open WebUI— Teams who want a shared private chatbot inside their own network.

Self-hosted ChatGPT-style interface that sits on top of Ollama. Adds accounts, RAG, and document upload.

Continue— Code completion and in-editor chat using local models.

Free open-source AI coding assistant for VS Code and JetBrains. Plugs directly into Ollama or LM Studio.

Hugging Face— Finding specific models, reading model cards, and checking benchmarks.

The largest catalogue of open-source models and quantised GGUF files. Both Ollama and LM Studio pull from here.

Gemma— Reasoning tasks and compact on-device deployment on NVIDIA hardware.

Google DeepMind's open-weights family, recently co-optimised by NVIDIA for RTX. Strong reasoning for its size.

Check your hardware before you install anything

Local inference is bottlenecked by memory, not CPU. On macOS, open About This Mac and check unified memory. On Windows, open Task Manager and check both RAM and dedicated GPU VRAM. As a rough guide, 8 GB of RAM is enough for small 3B to 7B Q4 quantised models, 16 GB comfortably runs 7B to 13B, and 32 GB or more opens up 30B class models. On Windows, an NVIDIA GPU with at least 6 GB of VRAM (an RTX 3060 or better) gives you a big speed boost through CUDA. If you are on Apple Silicon (M1 through M4), you are already set; unified memory means the GPU can use all of it. If you only have 8 GB, plan to run 7B Q4 models and nothing heavier.

Install Ollama for the command-line workflow

Go to ollama.com and install the native app for macOS, Windows, or Linux. On macOS and Linux you can run curl -fsSL https://ollama.com/install.sh | sh; on Windows, use the installer. Once installed, open a terminal and run ollama pull llama3.3:8b to download the 8B Llama 3.3 model (about 4.7 GB). Then run ollama run llama3.3:8b and you will drop into an interactive chat. Press Ctrl-D to exit. Other useful commands: ollama list shows what you have downloaded, ollama rm <model> deletes one, and ollama serve exposes a local API at http://localhost:11434 that tools like Open WebUI and coding plugins can talk to.

Install LM Studio for a GUI and model discovery

Download LM Studio for macOS or Windows. The GUI is genuinely useful for browsing the Hugging Face model catalogue without leaving the app. Search for a model (try qwen2.5), and LM Studio filters versions it thinks your hardware can actually run based on your RAM and VRAM. Download one, click the Chat tab, and you have a working assistant in about five minutes. LM Studio also runs an OpenAI-compatible server at http://localhost:1234/v1, which means any tool that talks to the OpenAI API (including Cursor, LibreChat, or a custom Python script) can be pointed at your local machine without code changes. Run Ollama and LM Studio side by side; they can share downloaded GGUF models through a tool like Golama if disk space matters.

Frequently Asked Questions

No. A 16 GB MacBook Air or a mid-range Windows laptop with an RTX 3060 runs Q4 quantised 7B to 13B models smoothly. You only need heavy hardware if you want to run 30B or 70B models locally, and for most daily work the smaller ones are more than enough.
In most cases yes, and it is often more compliant than cloud AI because the data never leaves your device. That said, if you work in a regulated role (legal, medical, finance, public sector) check your organisation's acceptable use policy. Running inference locally with open-weights models under permissive licences such as Apache 2.0, MIT, or Llama 3 Community Licence is generally fine; double-check the licence page on Hugging Face before deploying anything commercially.
Agentic systems often call a model over and over; that makes them expensive and slow against cloud APIs. Running the underlying model locally is one of the big levers for bringing agent cost and latency down. Expect agentic workflows built on Ollama to become a major pattern in 2026, especially for enterprises that cannot send task data to third parties.
Qwen 2.5 from Alibaba is currently the strongest all-round choice for Chinese, Japanese, Korean, and Southeast Asian languages, and it has a coder variant too. SEA-LION from AI Singapore is built specifically for Southeast Asian languages and is worth trying for Bahasa Indonesia, Thai, and Tagalog work.
Almost certainly yes, but you will use them differently. The practical pattern is: draft, classify, and clean up with a local model; escalate to a frontier model for reasoning-heavy work, novel problems, or anything that genuinely needs world-class output quality. Many professionals end up paying for one cloud subscription and using local models for the other 80 per cent of their AI usage.

Next Steps

If you want more context on where this fits, read our guide on AI data privacy laws across Asia and the RAG explainer to see how to feed a local model your own documents. For a no-code way to wire it into real work, our n8n, Make, and Zapier guide shows how local inference plugs into automation platforms.

Related Guides

No comments yet. Be the first to share your thoughts!

Leave a Comment

Your email will not be published