How to beginner · 3 min read

How to run embeddings locally

Quick answer
To run embeddings locally, use open-source libraries like sentence-transformers which provide pre-trained models for generating vector embeddings without internet access. Install the library via pip and load a model such as all-MiniLM-L6-v2 to encode text into embeddings on your machine.

PREREQUISITES

  • Python 3.8+
  • pip install sentence-transformers
  • Basic Python programming knowledge

Setup

Install the sentence-transformers library, which provides easy access to many pre-trained embedding models that run locally without API calls.

Run the following command in your terminal:

bash
pip install sentence-transformers

Step by step

Here is a complete Python example to generate embeddings locally using sentence-transformers. It loads the all-MiniLM-L6-v2 model and encodes a list of texts into vectors.

python
from sentence_transformers import SentenceTransformer

# Load the pre-trained embedding model locally
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample texts to embed
texts = [
    'OpenAI develops advanced AI models.',
    'Running embeddings locally improves privacy.',
    'Sentence-transformers is an open-source library.'
]

# Generate embeddings
embeddings = model.encode(texts)

# Print the shape and first embedding vector
print(f'Generated {len(embeddings)} embeddings with dimension {len(embeddings[0])}')
print('First embedding vector:', embeddings[0])
output
Generated 3 embeddings with dimension 384
First embedding vector: [ 0.01234567 -0.02345678 ... 0.03456789]

Common variations

You can run embeddings asynchronously using libraries like asyncio combined with sentence-transformers if you batch large inputs. Other models such as all-mpnet-base-v2 offer higher accuracy but larger size. For GPU acceleration, install torch with CUDA support and move the model to GPU.

python
import torch
from sentence_transformers import SentenceTransformer

# Load model on GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

texts = ['Example text for GPU embedding']
embeddings = model.encode(texts)
print('Embedding on device:', device)
print(embeddings[0])
output
Embedding on device: cuda
[ 0.01234567 -0.02345678 ... 0.03456789]

Troubleshooting

  • If you get ImportError, ensure sentence-transformers is installed correctly.
  • If embeddings are slow, check if you can enable GPU acceleration by installing torch with CUDA.
  • For memory errors, try smaller batch sizes or use a lighter model like all-MiniLM-L6-v2.

Key Takeaways

  • Use the open-source sentence-transformers library to run embeddings locally without API calls.
  • Choose models like all-MiniLM-L6-v2 for a good balance of speed and accuracy.
  • Enable GPU support by installing torch with CUDA and loading the model on the GPU device.
  • Local embeddings improve privacy and reduce dependency on external services.
  • Troubleshoot by verifying installations and adjusting batch sizes or model choice.
Verified 2026-04 · all-MiniLM-L6-v2, all-mpnet-base-v2
Verify ↗