How to Intermediate · 4 min read

How to use LM Studio for local RAG

Q: How to use LM Studio for local RAG

Use LM Studio with Ollama to run local Retrieval-Augmented Generation (RAG) by loading your documents into a vector store like FAISS and querying a local LLM model. This setup enables private, offline AI workflows without external API calls.

Quick answer

Use LM Studio with Ollama to run local Retrieval-Augmented Generation (RAG) by loading your documents into a vector store like FAISS and querying a local LLM model. This setup enables private, offline AI workflows without external API calls.

PREREQUISITES

Python 3.8+
pip install ollama faiss-cpu langchain langchain_community
LM Studio installed and configured locally
Ollama CLI installed and configured
Local LLM model downloaded in LM Studio

Setup

Install the necessary Python packages and ensure LM Studio and Ollama CLI are installed and configured on your machine. Download a local LLM model in LM Studio for offline use.

bash

pip install ollama faiss-cpu langchain langchain_community

Step by step

This example shows how to load documents, create embeddings, store them in a local FAISS vector store, and query a local LLM model via Ollama for RAG.

python

import os
import ollama
from langchain_community.vectorstores import FAISS
from langchain_openai import OllamaEmbeddings
from langchain_community.document_loaders import TextLoader

# Load documents from local text files
loader = TextLoader("./docs/sample.txt")
docs = loader.load()

# Create embeddings using Ollama local model
embeddings = OllamaEmbeddings(model="llama2-7b")

# Create FAISS vector store from documents
vectorstore = FAISS.from_documents(docs, embeddings)

# Query function for local RAG
query = "Explain the main idea of the document."

# Retrieve relevant docs
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Prepare prompt with retrieved context
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:" 

# Generate answer using Ollama local LLM
response = ollama.chat(
    model="llama2-7b",
    messages=[{"role": "user", "content": prompt}]
)

print("Answer:", response['choices'][0]['message']['content'])

output

Answer: The main idea of the document is ...

Common variations

Use different local LLM models available in LM Studio by changing model parameter.
Implement async calls with Ollama Python SDK for improved performance.
Use other vector stores like Chroma or Weaviate for larger datasets.

python

import asyncio
import ollama

async def async_query():
    response = await ollama.chat.acreate(
        model="llama2-7b",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Async answer:", response['choices'][0]['message']['content'])

asyncio.run(async_query())

output

Async answer: The main idea of the document is ...

Troubleshooting

If you see connection errors, verify LM Studio and Ollama CLI are running locally.
Ensure your local LLM model is downloaded and accessible by LM Studio.
For embedding errors, confirm the OllamaEmbeddings model name matches your local setup.

✅

Key Takeaways

Use LM Studio with Ollama for fully local, private RAG workflows without external API calls.
Combine local vector stores like FAISS with Ollama embeddings for efficient document retrieval.
Customize your RAG pipeline by swapping local LLM models and vector stores as needed.

Verified 2026-04 · llama2-7b

Verify ↗