How to Run AI Models Locally with Ollama and LM Studio - Complete Guide

Introduction: Why Run AI Models on Your Own PC?

Every time you use ChatGPT, Claude, or Gemini, your prompts travel across the internet to a data center, get processed, and the response comes back. That works fine for most people—until it doesn’t. Maybe you’re handling sensitive company data that can’t leave your network. Maybe you’re tired of monthly subscription fees stacking up. Or maybe your internet went down and you realized you had zero access to AI tools.

Local AI changes that equation entirely. With tools like Ollama and LM Studio, you can download open-source large language models (LLMs) and run them directly on your computer. No internet required. No data leaving your machine. No subscription fees after the initial setup.

This guide walks you through everything: what local AI actually means, how to set up both Ollama (command-line focused) and LM Studio (GUI-based), which models to pick for your hardware, and how to get real work done with locally-running models. Whether you’re a developer who wants a local coding assistant, a privacy-conscious professional, or just someone curious about how LLMs actually work under the hood, this guide has you covered.

What you’ll have by the end: At least one local AI model running on your PC, capable of answering questions, writing code, summarizing documents, and more—completely offline.

Difficulty: Beginner to Intermediate Time required: 30–60 minutes (depending on download speeds) Cost: $0 (all tools and models are free and open-source)

Prerequisites: What You Need Before Starting

Hardware Requirements

Local AI models run on your hardware, so your specs matter. Here’s the realistic breakdown:

  • Minimum (for 7B parameter models): 8 GB RAM, any modern CPU (Intel i5/AMD Ryzen 5 or better), 10 GB free disk space
  • Recommended (for 13B–14B models): 16 GB RAM, a dedicated GPU with 8+ GB VRAM (NVIDIA RTX 3060 or better), 30 GB free disk space
  • Ideal (for 70B models): 32+ GB RAM, NVIDIA RTX 4090 or dual GPUs with 24+ GB VRAM total, 100+ GB free disk space

Important note on GPUs: NVIDIA GPUs work best due to CUDA support. AMD GPUs work with Ollama on Linux (ROCm) and partially on Windows. Apple Silicon Macs (M1/M2/M3/M4) are excellent for local AI thanks to unified memory—a MacBook Pro with 32 GB unified memory can run 30B+ parameter models smoothly.

Software Requirements

  • Operating System: Windows 10/11, macOS 12+, or Linux (Ubuntu 22.04+ recommended)
  • For Ollama: Terminal/command line familiarity helps, but isn’t strictly required
  • For LM Studio: No technical background needed—it’s fully GUI-based
  • Internet connection: Only needed for the initial download of tools and models (not for running them afterward)

Understanding Model Sizes

Before you download anything, understand what the numbers mean:

Model SizeRAM NeededVRAM NeededQuality LevelExample Models
1B–3B4 GB2–3 GBBasic tasks, simple Q&ALlama 3.2 1B, Phi-3 Mini
7B–8B8 GB5–6 GBGood for most tasksLlama 3.1 8B, Mistral 7B, Gemma 2 9B
13B–14B16 GB10–12 GBNear-GPT-3.5 qualityLlama 3.1 14B (quantized)
30B–34B32 GB20+ GBStrong reasoningCodeLlama 34B, Yi 34B
70B48+ GB40+ GBNear-GPT-4 for some tasksLlama 3.1 70B
Most people should start with a **7B or 8B model**. They offer the best balance between quality and hardware requirements, and they run comfortably on machines with 16 GB RAM.

Step-by-Step Instructions

Step 1: Install Ollama (The Command-Line Option)

Ollama is the fastest way to get a local model running. It handles model management, quantization, and serving through a simple CLI.

On Windows: Download the installer from the official Ollama website. Run the .exe file and follow the installation wizard. Ollama will install and run as a background service automatically.

On macOS: Download the .dmg from the Ollama website, or install via Homebrew:

brew install ollama

**On Linux:** Run the one-line installer in your terminal:

curl -fsSL https://ollama.com/install.sh | sh

After installation, verify it works by opening a terminal and typing:

ollama —version

You should see a version number like ollama version 0.6.x. If you get a "command not found" error on Windows, restart your terminal or reboot.

Tip: Ollama runs a local API server on port 11434 by default. This becomes important later when connecting other tools to your local models.

Step 2: Download and Run Your First Model with Ollama

With Ollama installed, pulling and running a model is a single command:

ollama run llama3.1

This downloads the Llama 3.1 8B model (about 4.7 GB in its default quantized form) and immediately starts an interactive chat session. The first run will take several minutes depending on your internet speed. Subsequent launches are instant since the model is cached locally.

Try asking it something:

>>> Explain how a transformer neural network works in simple terms

Write a Python function that reads a CSV file and returns the top 5 rows by a given column

Summarize the pros and cons of microservices architecture

To exit the chat, type /bye or press Ctrl+D.

Other popular models to try:

ollama run mistral # Mistral 7B - fast and capable ollama run codellama # Optimized for code generation ollama run gemma2 # Google’s Gemma 2 9B ollama run phi3 # Microsoft’s Phi-3 - small but smart ollama run llama3.1:70b # 70B version (needs serious hardware)

**Tip:** Run ollama list to see all downloaded models and their sizes. Run ollama rm model-name to delete models you no longer need and reclaim disk space.

Step 3: Install LM Studio (The GUI Option)

If you prefer a visual interface over the command line, LM Studio is your tool. It provides a polished desktop application for browsing, downloading, and chatting with local models.

Download LM Studio from the official website. It’s available for Windows, macOS, and Linux. The installer is straightforward—run it and follow the prompts.

When you first open LM Studio, you’ll see a clean interface with a model search bar at the top. The application auto-detects your hardware (CPU, GPU, RAM) and shows compatibility indicators next to each model.

Tip: LM Studio uses the GGUF model format. It connects to Hugging Face’s model repository, so you have access to thousands of community-uploaded models. Stick to models from verified uploaders for safety.

Step 4: Download and Chat with a Model in LM Studio

  • Click the search icon (magnifying glass) in the left sidebar
  • Search for “Llama 3.1 8B” or “Mistral 7B”
  • You’ll see multiple versions—look for Q4_K_M quantization as a balanced starting point (good quality, reasonable size)
  • Click the download button next to your chosen model
  • Once downloaded, click the chat icon in the left sidebar
  • Select your model from the dropdown at the top
  • Start chatting in the text field at the bottom

LM Studio shows real-time performance stats: tokens per second, memory usage, and GPU utilization. This helps you gauge whether a model is running efficiently on your hardware.

Performance tuning in LM Studio: In the right sidebar, you can adjust parameters like:

  • GPU Offload Layers: Higher = more GPU usage = faster. Set to maximum if your VRAM allows.
  • Context Length: Default is usually 4096 tokens. Increase for longer conversations, but this uses more memory.
  • Temperature: Lower (0.1–0.4) for factual/coding tasks, higher (0.7–1.0) for creative writing.

Step 5: Use the Local API for Development

Both Ollama and LM Studio expose an OpenAI-compatible API, which means any tool or script built for the OpenAI API can work with your local models with minimal changes.

Ollama API (runs automatically on port 11434):

curl http://localhost:11434/api/chat -d ’{ “model”: “llama3.1”, “messages”: [{“role”: “user”, “content”: “What is quantum computing?”}] }’

**Using with Python (via the openai library):**

from openai import OpenAI

client = OpenAI( base_url=“http://localhost:11434/v1”, # Ollama api_key=“not-needed” # Local models don’t need API keys )

response = client.chat.completions.create( model=“llama3.1”, messages=[{“role”: “user”, “content”: “Write a haiku about programming”}] )

print(response.choices[0].message.content)

**LM Studio API:** Go to the "Local Server" tab in LM Studio, load a model, and click "Start Server." It runs on port 1234 by default. Replace the base_url above with http://localhost:1234/v1.

Tip: This OpenAI-compatible API is what makes local models truly powerful. You can connect them to VS Code extensions like Continue, use them with LangChain or LlamaIndex, or build your own applications—all running locally.

Step 6: Choose the Right Model for Your Use Case

Not all models are created equal. Here’s a practical guide based on what you actually want to do:

Use CaseBest Model ChoiceWhy
General Q&A / ChatLlama 3.1 8BBest overall quality at 8B size
Code generationCodeLlama 13B or DeepSeek Coder V2Trained specifically on code
Creative writingMistral 7B or Llama 3.1 8BGood instruction following, natural prose
SummarizationGemma 2 9BStrong at understanding and condensing text
Math / ReasoningPhi-3 Medium (14B) or Qwen 2.5Strong logical reasoning capabilities
Multilingual tasksQwen 2.5 or Llama 3.1Trained on diverse language data
Low-resource machinesPhi-3 Mini (3.8B) or Gemma 2 2BSurprisingly capable for their size
**About quantization:** Models come in different quantization levels (Q2, Q3, Q4, Q5, Q6, Q8, F16). Lower numbers = smaller file, faster speed, lower quality. **Q4_K_M is the sweet spot** for most users—it's roughly 70% the size of the full model with minimal quality loss. Only go lower than Q4 if you're severely RAM-constrained.

Step 7: Set Up a Local AI Coding Assistant

One of the most practical uses of local AI is as a coding assistant inside your editor. Here’s how to set it up with VS Code:

  • Install the Continue extension from the VS Code marketplace
  • Open Continue’s settings (click the gear icon in the Continue sidebar)
  • Configure it to use your local Ollama instance:

{ “models”: [{ “title”: “Local Llama 3.1”, “provider”: “ollama”, “model”: “llama3.1” }], “tabAutocompleteModel”: { “title”: “Local Autocomplete”, “provider”: “ollama”, “model”: “codellama:7b-code” } }

Now you have AI code completion and chat running entirely on your machine. Highlight code, press Ctrl+L, and ask questions about it. Use Ctrl+I to generate code inline.

Tip: For autocomplete specifically, use a smaller, faster model like codellama:7b-code or starcoder2:3b. Autocomplete needs to be fast (under 500ms), so smaller models work better here than larger ones.

Step 8: Create Custom Model Configurations

Ollama lets you create custom model configurations using a Modelfile—think of it as a Dockerfile for AI models:

# Save as Modelfile FROM llama3.1

Set system prompt

SYSTEM You are a senior Python developer. You write clean, well-documented code following PEP 8. You always include type hints and docstrings.

Adjust parameters

PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192

Create the custom model:

ollama create python-expert -f Modelfile

Now run it:

ollama run python-expert

This gives you a specialized model pre-configured for your specific use case. You can create as many custom models as you want—a writing assistant, a code reviewer, a data analyst—all based on the same underlying model but with different system prompts and parameters.

Common Mistakes and How to Avoid Them

1. Downloading Models That Are Too Large for Your Hardware

A 70B model sounds impressive, but if you only have 16 GB RAM, it will either crash or run at 0.5 tokens per second (unusably slow). Instead: Start with a 7B or 8B model. Check that the quantized model file size is less than 75% of your available RAM. Scale up only after confirming smooth performance.

2. Ignoring GPU Offloading

Running a model entirely on CPU when you have a capable GPU is leaving massive performance on the table. A model that generates 5 tokens/second on CPU might hit 40+ tokens/second with GPU offloading. Instead: In Ollama, GPU offloading is automatic if CUDA drivers are installed. In LM Studio, manually increase the “GPU Offload Layers” slider to maximum. Watch VRAM usage—if it exceeds your GPU memory, reduce layers by 5–10 until stable.

3. Using the Wrong Quantization for Your Needs

Some people download the smallest quantization (Q2/Q3) thinking they’re saving resources wisely. But heavily quantized models produce noticeably worse output—garbled reasoning, factual errors, repetitive text. Instead: Use Q4_K_M as your baseline. If that doesn’t fit in memory, try a smaller model at Q4_K_M rather than a bigger model at Q2. A well-quantized 7B model outperforms a badly-quantized 13B model in practice.

4. Setting Context Length Too High

Cranking the context window to 32K or 128K because the model supports it will eat your RAM alive. Context length has a roughly linear relationship with memory usage. Instead: Use 4096 tokens for most conversations. Only increase context length when you actually need it (long document analysis, extended coding sessions). Monitor memory usage when you increase it.

5. Expecting Cloud-Level Performance

Local 7B models are not GPT-4. They’re remarkably capable for their size, but they have limitations: they can hallucinate more, struggle with very complex reasoning chains, and may not know about events after their training cutoff. Instead: Use local models for tasks they excel at—drafting, code completion, summarization, brainstorming, data formatting. For tasks requiring the highest accuracy, consider a hybrid approach: use local models for drafts and cloud models for critical final reviews.

Frequently Asked Questions

Is running AI locally actually private? Can it still phone home?

Yes, local AI is genuinely private. Ollama and LM Studio do not send your prompts or responses anywhere. The models run entirely in your computer’s memory. You can verify this by disconnecting from the internet and confirming the model still works perfectly. The only network activity happens when you initially download the tool and models. For maximum security, you can even firewall the applications after installation to block all outbound connections.

How does local AI quality compare to ChatGPT or Claude?

It depends heavily on the model size and the task. A local Llama 3.1 8B model performs roughly on par with GPT-3.5 for most tasks and can even beat it on certain benchmarks. For coding, specialized models like DeepSeek Coder V2 are competitive with much larger cloud models. However, for complex multi-step reasoning, nuanced creative writing, or broad knowledge questions, GPT-4 and Claude still have a significant edge. The gap is closing rapidly—models released in early 2025 are dramatically better than those from a year earlier.

Can I run local AI on a laptop without a dedicated GPU?

Absolutely. CPU-only inference works fine for 7B models and smaller. Expect speeds of 3–10 tokens per second on a modern laptop CPU, which is readable but slower than cloud services. Apple Silicon MacBooks are particularly good at this because their unified memory architecture lets the GPU and CPU share the same memory pool efficiently. A MacBook Air M2 with 16 GB RAM can run 7B models at 15–20 tokens per second—faster than many desktop GPUs.

How much disk space do I need for local AI models?

A single 7B model in Q4 quantization takes about 4–5 GB. A 13B model takes 7–8 GB. A 70B model takes 35–40 GB. If you plan to keep multiple models downloaded, budget 50–100 GB of free space. Both Ollama and LM Studio store models in specific directories that you can change if your primary drive is small. Ollama stores models in ~/.ollama/models and you can change this with the OLLAMA_MODELS environment variable.

Can I fine-tune or train my own local model?

You can fine-tune existing models locally using tools like Unsloth, Axolotl, or the Hugging Face Transformers library. Fine-tuning adjusts a pre-trained model using your own data—for example, training it on your company’s coding standards or your writing style. This requires more technical knowledge and a GPU with at least 12 GB VRAM for QLoRA fine-tuning of 7B models. Full fine-tuning of larger models typically requires multiple high-end GPUs or cloud compute.

Summary and Next Steps

Key Takeaways

  • Ollama is best for developers and command-line users—fast setup, scriptable, great API compatibility
  • LM Studio is best for visual users who want a polished GUI experience with built-in model browsing
  • Start with 7B/8B models in Q4_K_M quantization—best quality-to-resource ratio for most hardware
  • GPU offloading is the single biggest performance lever—use it if you have a dedicated GPU
  • Local AI is truly private—no data leaves your machine once models are downloaded
  • The OpenAI-compatible API lets you plug local models into existing tools and workflows with minimal changes
  • Custom Modelfiles in Ollama let you create specialized assistants for different tasks

What to Do Next

  • Experiment with different models: Try at least 3 different models to find which one works best for your primary use case
  • Set up a coding assistant: Connect Ollama to VS Code via the Continue extension for local code completion
  • Build something: Use the local API to create a simple Python script that processes your documents through a local LLM
  • Explore RAG: Look into Retrieval-Augmented Generation (RAG) to let your local model answer questions about your own documents using tools like PrivateGPT or AnythingLLM
  • Join the community: The r/LocalLLaMA subreddit and Ollama Discord are excellent resources for staying up to date with new models and techniques

Explore More Tools

Grok Best Practices for Academic Research and Literature Discovery: Leveraging X/Twitter for Scholarly Intelligence Best Practices Grok Best Practices for Content Strategy: Identify Trending Topics Before They Peak and Create Content That Captures Demand Best Practices Grok Case Study: How a DTC Beauty Brand Used Real-Time Social Listening to Save Their Product Launch Case Study Grok Case Study: How a Pharma Company Tracked Patient Sentiment During a Drug Launch and Caught a Safety Signal 48 Hours Before the FDA Case Study Grok Case Study: How a Disaster Relief Nonprofit Used Real-Time X/Twitter Monitoring to Coordinate Emergency Response 3x Faster Case Study Grok Case Study: How a Political Campaign Used X/Twitter Sentiment Analysis to Reshape Messaging and Win a Swing District Case Study How to Use Grok for Competitive Intelligence: Track Product Launches, Pricing Changes, and Market Positioning in Real Time How-To Grok vs Perplexity vs ChatGPT Search for Real-Time Information: Which AI Search Tool Is Most Accurate in 2026? Comparison How to Use Grok for Crisis Communication Monitoring: Detect, Assess, and Respond to PR Emergencies in Real Time How-To How to Use Grok for Product Improvement: Extract Customer Feedback Signals from X/Twitter That Your Support Team Misses How-To How to Use Grok for Conference Live Monitoring: Extract Event Insights and Identify Networking Opportunities in Real Time How-To How to Use Grok for Influencer Marketing: Discover, Vet, and Track Influencer Partnerships Using Real X/Twitter Data How-To How to Use Grok for Job Market Analysis: Track Industry Hiring Trends, Layoff Signals, and Salary Discussions on X/Twitter How-To How to Use Grok for Investor Relations: Track Earnings Sentiment, Analyst Reactions, and Shareholder Concerns in Real Time How-To How to Use Grok for Recruitment and Talent Intelligence: Identifying Hiring Signals from X/Twitter Data How-To How to Use Grok for Startup Fundraising Intelligence: Track Investor Sentiment, VC Activity, and Funding Trends on X/Twitter How-To How to Use Grok for Regulatory Compliance Monitoring: Real-Time Policy Tracking Across Industries How-To NotebookLM Best Practices for Financial Analysts: Due Diligence, Investment Research & Risk Factor Analysis Across SEC Filings Best Practices NotebookLM Best Practices for Teachers: Build Curriculum-Aligned Lesson Plans, Study Guides, and Assessment Materials from Your Own Resources Best Practices NotebookLM Case Study: How an Insurance Company Built a Claims Processing Training System That Cut Errors by 35% Case Study