How to Intermediate · 4 min read

How to use LM Studio for local RAG

Quick answer
Use LM Studio with Ollama to run local Retrieval-Augmented Generation (RAG) by loading your documents into a vector store like FAISS and querying a local LLM model. This setup enables private, offline AI workflows without external API calls.

PREREQUISITES

  • Python 3.8+
  • pip install ollama faiss-cpu langchain langchain_community
  • LM Studio installed and configured locally
  • Ollama CLI installed and configured
  • Local LLM model downloaded in LM Studio

Setup

Install the necessary Python packages and ensure LM Studio and Ollama CLI are installed and configured on your machine. Download a local LLM model in LM Studio for offline use.

bash
pip install ollama faiss-cpu langchain langchain_community

Step by step

This example shows how to load documents, create embeddings, store them in a local FAISS vector store, and query a local LLM model via Ollama for RAG.

python
import os
import ollama
from langchain_community.vectorstores import FAISS
from langchain_openai import OllamaEmbeddings
from langchain_community.document_loaders import TextLoader

# Load documents from local text files
loader = TextLoader("./docs/sample.txt")
docs = loader.load()

# Create embeddings using Ollama local model
embeddings = OllamaEmbeddings(model="llama2-7b")

# Create FAISS vector store from documents
vectorstore = FAISS.from_documents(docs, embeddings)

# Query function for local RAG
query = "Explain the main idea of the document."

# Retrieve relevant docs
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Prepare prompt with retrieved context
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:" 

# Generate answer using Ollama local LLM
response = ollama.chat(
    model="llama2-7b",
    messages=[{"role": "user", "content": prompt}]
)

print("Answer:", response['choices'][0]['message']['content'])
output
Answer: The main idea of the document is ...

Common variations

  • Use different local LLM models available in LM Studio by changing model parameter.
  • Implement async calls with Ollama Python SDK for improved performance.
  • Use other vector stores like Chroma or Weaviate for larger datasets.
python
import asyncio
import ollama

async def async_query():
    response = await ollama.chat.acreate(
        model="llama2-7b",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Async answer:", response['choices'][0]['message']['content'])

asyncio.run(async_query())
output
Async answer: The main idea of the document is ...

Troubleshooting

  • If you see connection errors, verify LM Studio and Ollama CLI are running locally.
  • Ensure your local LLM model is downloaded and accessible by LM Studio.
  • For embedding errors, confirm the OllamaEmbeddings model name matches your local setup.

Key Takeaways

  • Use LM Studio with Ollama for fully local, private RAG workflows without external API calls.
  • Combine local vector stores like FAISS with Ollama embeddings for efficient document retrieval.
  • Customize your RAG pipeline by swapping local LLM models and vector stores as needed.
Verified 2026-04 · llama2-7b
Verify ↗