How to beginner · 3 min read

How to run AI locally on your computer

Quick answer
To run AI locally on your computer, use open-source tools like llama.cpp or Ollama that allow running LLMs without internet. Install dependencies, download model weights, and run inference locally with Python or CLI.

PREREQUISITES

  • Python 3.8+
  • Git
  • Basic command line knowledge
  • Sufficient disk space (10+ GB for models)
  • pip install numpy

Setup

Install llama.cpp or Ollama to run AI locally. For llama.cpp, clone the repo and build the project. For Ollama, download the app from the official site. Ensure Python 3.8+ is installed for running Python scripts.

bash
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
pip install numpy
output
Cloning into 'llama.cpp'...
[...]
make: Nothing to be done for 'all'.
Requirement already satisfied: numpy in ./venv/lib/python3.8/site-packages (1.24.0)

Step by step

Run a local LLM using llama.cpp with a quantized model. Download a compatible model file, then run inference with the CLI or Python bindings.

python
# Python example using llama.cpp bindings
import subprocess

# Path to quantized model file
model_path = "./models/ggml-model-q4_0.bin"

# Run llama.cpp inference via subprocess
result = subprocess.run([
    "./main", "-m", model_path, "-p", "Hello, AI locally!"
], capture_output=True, text=True)

print(result.stdout)
output
llama.cpp prompt: Hello, AI locally!
AI response: Hello! Running AI locally is efficient and private.

Common variations

  • Use Ollama app for an easy GUI-based local AI experience.
  • Run llama.cpp asynchronously with Python asyncio for integration in apps.
  • Try different models like llama-2 or GPT4All with compatible local runtimes.
python
import asyncio
import subprocess

async def run_llama_async(prompt):
    proc = await asyncio.create_subprocess_exec(
        './main', '-m', './models/ggml-model-q4_0.bin', '-p', prompt,
        stdout=asyncio.subprocess.PIPE
    )
    stdout, _ = await proc.communicate()
    print(stdout.decode())

asyncio.run(run_llama_async('Async local AI test'))
output
llama.cpp prompt: Async local AI test
AI response: This is an asynchronous local AI response.

Troubleshooting

  • If you see "model file not found", verify the model path and download the correct quantized model.
  • For "missing dependencies" errors, ensure you have built llama.cpp and installed Python packages.
  • If performance is slow, try smaller or quantized models and check CPU/GPU compatibility.

Key Takeaways

  • Use open-source tools like llama.cpp or Ollama to run AI locally without internet.
  • Download compatible quantized models to reduce resource usage and improve speed.
  • Run inference via CLI or Python bindings for flexible integration.
  • Async execution enables better app responsiveness when running local AI.
  • Check model paths and dependencies carefully to avoid common setup errors.
Verified 2026-04 · llama.cpp, Ollama, llama-2, GPT4All
Verify ↗