How to use LM Studio for local RAG
Quick answer
Use
LM Studio with Ollama to run local Retrieval-Augmented Generation (RAG) by loading your documents into a vector store like FAISS and querying a local LLM model. This setup enables private, offline AI workflows without external API calls.PREREQUISITES
Python 3.8+pip install ollama faiss-cpu langchain langchain_communityLM Studio installed and configured locallyOllama CLI installed and configuredLocal LLM model downloaded in LM Studio
Setup
Install the necessary Python packages and ensure LM Studio and Ollama CLI are installed and configured on your machine. Download a local LLM model in LM Studio for offline use.
pip install ollama faiss-cpu langchain langchain_community Step by step
This example shows how to load documents, create embeddings, store them in a local FAISS vector store, and query a local LLM model via Ollama for RAG.
import os
import ollama
from langchain_community.vectorstores import FAISS
from langchain_openai import OllamaEmbeddings
from langchain_community.document_loaders import TextLoader
# Load documents from local text files
loader = TextLoader("./docs/sample.txt")
docs = loader.load()
# Create embeddings using Ollama local model
embeddings = OllamaEmbeddings(model="llama2-7b")
# Create FAISS vector store from documents
vectorstore = FAISS.from_documents(docs, embeddings)
# Query function for local RAG
query = "Explain the main idea of the document."
# Retrieve relevant docs
retrieved_docs = vectorstore.similarity_search(query, k=3)
# Prepare prompt with retrieved context
context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
# Generate answer using Ollama local LLM
response = ollama.chat(
model="llama2-7b",
messages=[{"role": "user", "content": prompt}]
)
print("Answer:", response['choices'][0]['message']['content']) output
Answer: The main idea of the document is ...
Common variations
- Use different local LLM models available in LM Studio by changing
modelparameter. - Implement async calls with Ollama Python SDK for improved performance.
- Use other vector stores like Chroma or Weaviate for larger datasets.
import asyncio
import ollama
async def async_query():
response = await ollama.chat.acreate(
model="llama2-7b",
messages=[{"role": "user", "content": prompt}]
)
print("Async answer:", response['choices'][0]['message']['content'])
asyncio.run(async_query()) output
Async answer: The main idea of the document is ...
Troubleshooting
- If you see connection errors, verify LM Studio and Ollama CLI are running locally.
- Ensure your local LLM model is downloaded and accessible by LM Studio.
- For embedding errors, confirm the OllamaEmbeddings model name matches your local setup.
Key Takeaways
- Use LM Studio with Ollama for fully local, private RAG workflows without external API calls.
- Combine local vector stores like FAISS with Ollama embeddings for efficient document retrieval.
- Customize your RAG pipeline by swapping local LLM models and vector stores as needed.