How to Run AI Models Locally with Ollama and LM Studio - Complete Guide
Introduction: Why Run AI Models on Your Own PC?
Every time you use ChatGPT, Claude, or Gemini, your prompts travel across the internet to a data center, get processed, and the response comes back. That works fine for most people—until it doesn’t. Maybe you’re handling sensitive company data that can’t leave your network. Maybe you’re tired of monthly subscription fees stacking up. Or maybe your internet went down and you realized you had zero access to AI tools.
Local AI changes that equation entirely. With tools like Ollama and LM Studio, you can download open-source large language models (LLMs) and run them directly on your computer. No internet required. No data leaving your machine. No subscription fees after the initial setup.
This guide walks you through everything: what local AI actually means, how to set up both Ollama (command-line focused) and LM Studio (GUI-based), which models to pick for your hardware, and how to get real work done with locally-running models. Whether you’re a developer who wants a local coding assistant, a privacy-conscious professional, or just someone curious about how LLMs actually work under the hood, this guide has you covered.
What you’ll have by the end: At least one local AI model running on your PC, capable of answering questions, writing code, summarizing documents, and more—completely offline.
Difficulty: Beginner to Intermediate Time required: 30–60 minutes (depending on download speeds) Cost: $0 (all tools and models are free and open-source)
Prerequisites: What You Need Before Starting
Hardware Requirements
Local AI models run on your hardware, so your specs matter. Here’s the realistic breakdown:
- Minimum (for 7B parameter models): 8 GB RAM, any modern CPU (Intel i5/AMD Ryzen 5 or better), 10 GB free disk space
- Recommended (for 13B–14B models): 16 GB RAM, a dedicated GPU with 8+ GB VRAM (NVIDIA RTX 3060 or better), 30 GB free disk space
- Ideal (for 70B models): 32+ GB RAM, NVIDIA RTX 4090 or dual GPUs with 24+ GB VRAM total, 100+ GB free disk space
Important note on GPUs: NVIDIA GPUs work best due to CUDA support. AMD GPUs work with Ollama on Linux (ROCm) and partially on Windows. Apple Silicon Macs (M1/M2/M3/M4) are excellent for local AI thanks to unified memory—a MacBook Pro with 32 GB unified memory can run 30B+ parameter models smoothly.
Software Requirements
- Operating System: Windows 10/11, macOS 12+, or Linux (Ubuntu 22.04+ recommended)
- For Ollama: Terminal/command line familiarity helps, but isn’t strictly required
- For LM Studio: No technical background needed—it’s fully GUI-based
- Internet connection: Only needed for the initial download of tools and models (not for running them afterward)
Understanding Model Sizes
Before you download anything, understand what the numbers mean:
| Model Size | RAM Needed | VRAM Needed | Quality Level | Example Models |
|---|---|---|---|---|
| 1B–3B | 4 GB | 2–3 GB | Basic tasks, simple Q&A | Llama 3.2 1B, Phi-3 Mini |
| 7B–8B | 8 GB | 5–6 GB | Good for most tasks | Llama 3.1 8B, Mistral 7B, Gemma 2 9B |
| 13B–14B | 16 GB | 10–12 GB | Near-GPT-3.5 quality | Llama 3.1 14B (quantized) |
| 30B–34B | 32 GB | 20+ GB | Strong reasoning | CodeLlama 34B, Yi 34B |
| 70B | 48+ GB | 40+ GB | Near-GPT-4 for some tasks | Llama 3.1 70B |
Step-by-Step Instructions
Step 1: Install Ollama (The Command-Line Option)
Ollama is the fastest way to get a local model running. It handles model management, quantization, and serving through a simple CLI.
On Windows: Download the installer from the official Ollama website. Run the .exe file and follow the installation wizard. Ollama will install and run as a background service automatically.
On macOS: Download the .dmg from the Ollama website, or install via Homebrew:
brew install ollama
**On Linux:** Run the one-line installer in your terminal:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify it works by opening a terminal and typing:
ollama —version
You should see a version number like ollama version 0.6.x. If you get a "command not found" error on Windows, restart your terminal or reboot.
Tip: Ollama runs a local API server on port 11434 by default. This becomes important later when connecting other tools to your local models.
Step 2: Download and Run Your First Model with Ollama
With Ollama installed, pulling and running a model is a single command:
ollama run llama3.1
This downloads the Llama 3.1 8B model (about 4.7 GB in its default quantized form) and immediately starts an interactive chat session. The first run will take several minutes depending on your internet speed. Subsequent launches are instant since the model is cached locally.
Try asking it something:
>>> Explain how a transformer neural network works in simple terms
Write a Python function that reads a CSV file and returns the top 5 rows by a given column
Summarize the pros and cons of microservices architecture
To exit the chat, type /bye or press Ctrl+D.
Other popular models to try:
ollama run mistral # Mistral 7B - fast and capable
ollama run codellama # Optimized for code generation
ollama run gemma2 # Google’s Gemma 2 9B
ollama run phi3 # Microsoft’s Phi-3 - small but smart
ollama run llama3.1:70b # 70B version (needs serious hardware)
**Tip:** Run ollama list to see all downloaded models and their sizes. Run ollama rm model-name to delete models you no longer need and reclaim disk space.
Step 3: Install LM Studio (The GUI Option)
If you prefer a visual interface over the command line, LM Studio is your tool. It provides a polished desktop application for browsing, downloading, and chatting with local models.
Download LM Studio from the official website. It’s available for Windows, macOS, and Linux. The installer is straightforward—run it and follow the prompts.
When you first open LM Studio, you’ll see a clean interface with a model search bar at the top. The application auto-detects your hardware (CPU, GPU, RAM) and shows compatibility indicators next to each model.
Tip: LM Studio uses the GGUF model format. It connects to Hugging Face’s model repository, so you have access to thousands of community-uploaded models. Stick to models from verified uploaders for safety.
Step 4: Download and Chat with a Model in LM Studio
- Click the search icon (magnifying glass) in the left sidebar
- Search for “Llama 3.1 8B” or “Mistral 7B”
- You’ll see multiple versions—look for Q4_K_M quantization as a balanced starting point (good quality, reasonable size)
- Click the download button next to your chosen model
- Once downloaded, click the chat icon in the left sidebar
- Select your model from the dropdown at the top
- Start chatting in the text field at the bottom
LM Studio shows real-time performance stats: tokens per second, memory usage, and GPU utilization. This helps you gauge whether a model is running efficiently on your hardware.
Performance tuning in LM Studio: In the right sidebar, you can adjust parameters like:
- GPU Offload Layers: Higher = more GPU usage = faster. Set to maximum if your VRAM allows.
- Context Length: Default is usually 4096 tokens. Increase for longer conversations, but this uses more memory.
- Temperature: Lower (0.1–0.4) for factual/coding tasks, higher (0.7–1.0) for creative writing.
Step 5: Use the Local API for Development
Both Ollama and LM Studio expose an OpenAI-compatible API, which means any tool or script built for the OpenAI API can work with your local models with minimal changes.
Ollama API (runs automatically on port 11434):
curl http://localhost:11434/api/chat -d ’{
“model”: “llama3.1”,
“messages”: [{“role”: “user”, “content”: “What is quantum computing?”}]
}’
**Using with Python (via the openai library):**
from openai import OpenAI
client = OpenAI(
base_url=“http://localhost:11434/v1”, # Ollama
api_key=“not-needed” # Local models don’t need API keys
)
response = client.chat.completions.create(
model=“llama3.1”,
messages=[{“role”: “user”, “content”: “Write a haiku about programming”}]
)
print(response.choices[0].message.content)
**LM Studio API:** Go to the "Local Server" tab in LM Studio, load a model, and click "Start Server." It runs on port 1234 by default. Replace the base_url above with http://localhost:1234/v1.
Tip: This OpenAI-compatible API is what makes local models truly powerful. You can connect them to VS Code extensions like Continue, use them with LangChain or LlamaIndex, or build your own applications—all running locally.
Step 6: Choose the Right Model for Your Use Case
Not all models are created equal. Here’s a practical guide based on what you actually want to do:
| Use Case | Best Model Choice | Why |
|---|---|---|
| General Q&A / Chat | Llama 3.1 8B | Best overall quality at 8B size |
| Code generation | CodeLlama 13B or DeepSeek Coder V2 | Trained specifically on code |
| Creative writing | Mistral 7B or Llama 3.1 8B | Good instruction following, natural prose |
| Summarization | Gemma 2 9B | Strong at understanding and condensing text |
| Math / Reasoning | Phi-3 Medium (14B) or Qwen 2.5 | Strong logical reasoning capabilities |
| Multilingual tasks | Qwen 2.5 or Llama 3.1 | Trained on diverse language data |
| Low-resource machines | Phi-3 Mini (3.8B) or Gemma 2 2B | Surprisingly capable for their size |
Step 7: Set Up a Local AI Coding Assistant
One of the most practical uses of local AI is as a coding assistant inside your editor. Here’s how to set it up with VS Code:
- Install the Continue extension from the VS Code marketplace
- Open Continue’s settings (click the gear icon in the Continue sidebar)
- Configure it to use your local Ollama instance:
{
“models”: [{
“title”: “Local Llama 3.1”,
“provider”: “ollama”,
“model”: “llama3.1”
}],
“tabAutocompleteModel”: {
“title”: “Local Autocomplete”,
“provider”: “ollama”,
“model”: “codellama:7b-code”
}
}
Now you have AI code completion and chat running entirely on your machine. Highlight code, press Ctrl+L, and ask questions about it. Use Ctrl+I to generate code inline.
Tip: For autocomplete specifically, use a smaller, faster model like codellama:7b-code or starcoder2:3b. Autocomplete needs to be fast (under 500ms), so smaller models work better here than larger ones.
Step 8: Create Custom Model Configurations
Ollama lets you create custom model configurations using a Modelfile—think of it as a Dockerfile for AI models:
# Save as Modelfile
FROM llama3.1
Set system prompt
SYSTEM You are a senior Python developer. You write clean, well-documented code following PEP 8. You always include type hints and docstrings.
Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Create the custom model:
ollama create python-expert -f Modelfile
Now run it:
ollama run python-expert
This gives you a specialized model pre-configured for your specific use case. You can create as many custom models as you want—a writing assistant, a code reviewer, a data analyst—all based on the same underlying model but with different system prompts and parameters.
Common Mistakes and How to Avoid Them
1. Downloading Models That Are Too Large for Your Hardware
A 70B model sounds impressive, but if you only have 16 GB RAM, it will either crash or run at 0.5 tokens per second (unusably slow). Instead: Start with a 7B or 8B model. Check that the quantized model file size is less than 75% of your available RAM. Scale up only after confirming smooth performance.
2. Ignoring GPU Offloading
Running a model entirely on CPU when you have a capable GPU is leaving massive performance on the table. A model that generates 5 tokens/second on CPU might hit 40+ tokens/second with GPU offloading. Instead: In Ollama, GPU offloading is automatic if CUDA drivers are installed. In LM Studio, manually increase the “GPU Offload Layers” slider to maximum. Watch VRAM usage—if it exceeds your GPU memory, reduce layers by 5–10 until stable.
3. Using the Wrong Quantization for Your Needs
Some people download the smallest quantization (Q2/Q3) thinking they’re saving resources wisely. But heavily quantized models produce noticeably worse output—garbled reasoning, factual errors, repetitive text. Instead: Use Q4_K_M as your baseline. If that doesn’t fit in memory, try a smaller model at Q4_K_M rather than a bigger model at Q2. A well-quantized 7B model outperforms a badly-quantized 13B model in practice.
4. Setting Context Length Too High
Cranking the context window to 32K or 128K because the model supports it will eat your RAM alive. Context length has a roughly linear relationship with memory usage. Instead: Use 4096 tokens for most conversations. Only increase context length when you actually need it (long document analysis, extended coding sessions). Monitor memory usage when you increase it.
5. Expecting Cloud-Level Performance
Local 7B models are not GPT-4. They’re remarkably capable for their size, but they have limitations: they can hallucinate more, struggle with very complex reasoning chains, and may not know about events after their training cutoff. Instead: Use local models for tasks they excel at—drafting, code completion, summarization, brainstorming, data formatting. For tasks requiring the highest accuracy, consider a hybrid approach: use local models for drafts and cloud models for critical final reviews.
Frequently Asked Questions
Is running AI locally actually private? Can it still phone home?
Yes, local AI is genuinely private. Ollama and LM Studio do not send your prompts or responses anywhere. The models run entirely in your computer’s memory. You can verify this by disconnecting from the internet and confirming the model still works perfectly. The only network activity happens when you initially download the tool and models. For maximum security, you can even firewall the applications after installation to block all outbound connections.
How does local AI quality compare to ChatGPT or Claude?
It depends heavily on the model size and the task. A local Llama 3.1 8B model performs roughly on par with GPT-3.5 for most tasks and can even beat it on certain benchmarks. For coding, specialized models like DeepSeek Coder V2 are competitive with much larger cloud models. However, for complex multi-step reasoning, nuanced creative writing, or broad knowledge questions, GPT-4 and Claude still have a significant edge. The gap is closing rapidly—models released in early 2025 are dramatically better than those from a year earlier.
Can I run local AI on a laptop without a dedicated GPU?
Absolutely. CPU-only inference works fine for 7B models and smaller. Expect speeds of 3–10 tokens per second on a modern laptop CPU, which is readable but slower than cloud services. Apple Silicon MacBooks are particularly good at this because their unified memory architecture lets the GPU and CPU share the same memory pool efficiently. A MacBook Air M2 with 16 GB RAM can run 7B models at 15–20 tokens per second—faster than many desktop GPUs.
How much disk space do I need for local AI models?
A single 7B model in Q4 quantization takes about 4–5 GB. A 13B model takes 7–8 GB. A 70B model takes 35–40 GB. If you plan to keep multiple models downloaded, budget 50–100 GB of free space. Both Ollama and LM Studio store models in specific directories that you can change if your primary drive is small. Ollama stores models in ~/.ollama/models and you can change this with the OLLAMA_MODELS environment variable.
Can I fine-tune or train my own local model?
You can fine-tune existing models locally using tools like Unsloth, Axolotl, or the Hugging Face Transformers library. Fine-tuning adjusts a pre-trained model using your own data—for example, training it on your company’s coding standards or your writing style. This requires more technical knowledge and a GPU with at least 12 GB VRAM for QLoRA fine-tuning of 7B models. Full fine-tuning of larger models typically requires multiple high-end GPUs or cloud compute.
Summary and Next Steps
Key Takeaways
- Ollama is best for developers and command-line users—fast setup, scriptable, great API compatibility
- LM Studio is best for visual users who want a polished GUI experience with built-in model browsing
- Start with 7B/8B models in Q4_K_M quantization—best quality-to-resource ratio for most hardware
- GPU offloading is the single biggest performance lever—use it if you have a dedicated GPU
- Local AI is truly private—no data leaves your machine once models are downloaded
- The OpenAI-compatible API lets you plug local models into existing tools and workflows with minimal changes
- Custom Modelfiles in Ollama let you create specialized assistants for different tasks
What to Do Next
- Experiment with different models: Try at least 3 different models to find which one works best for your primary use case
- Set up a coding assistant: Connect Ollama to VS Code via the Continue extension for local code completion
- Build something: Use the local API to create a simple Python script that processes your documents through a local LLM
- Explore RAG: Look into Retrieval-Augmented Generation (RAG) to let your local model answer questions about your own documents using tools like PrivateGPT or AnythingLLM
- Join the community: The r/LocalLLaMA subreddit and Ollama Discord are excellent resources for staying up to date with new models and techniques