What is semantic similarity
embeddings that represent their semantic content as vectors. It enables AI systems to compare concepts beyond exact word matching by calculating distances or angles between these vectors.How it works
Semantic similarity works by converting text into embeddings, which are dense numerical vectors capturing the meaning of the text. These vectors live in a high-dimensional space where similar meanings are close together. By calculating the distance or angle (e.g., cosine similarity) between two embedding vectors, AI models can quantify how semantically close the texts are, even if they use different words.
Think of it like mapping cities on a globe: two cities close together are similar in location, just like two sentences close in embedding space are similar in meaning.
Concrete example
Using OpenAI embeddings, you can compute semantic similarity between two sentences by comparing their embedding vectors with cosine similarity.
from openai import OpenAI
import os
import numpy as np
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Get embeddings for two sentences
response1 = client.embeddings.create(model="text-embedding-3-small", input="I love machine learning.")
response2 = client.embeddings.create(model="text-embedding-3-small", input="Artificial intelligence is fascinating.")
vec1 = np.array(response1.data[0].embedding)
vec2 = np.array(response2.data[0].embedding)
# Compute cosine similarity
cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Semantic similarity: {cos_sim:.4f}") Semantic similarity: 0.8723
When to use it
Use semantic similarity when you need to compare texts or data based on meaning rather than exact wording. Common use cases include:
- Document search and retrieval where synonyms or paraphrases exist
- Clustering or grouping similar content
- Recommendation systems based on user preferences
- Detecting duplicate or near-duplicate content
Do not rely on semantic similarity when exact string matching or strict syntax is required, such as passwords or code syntax validation.
Key terms
| Term | Definition |
|---|---|
| Semantic similarity | A measure of how close two texts are in meaning using vector representations. |
| Embeddings | Numerical vector representations of text capturing semantic information. |
| Cosine similarity | A metric measuring the cosine of the angle between two vectors, indicating similarity. |
| Vector space | A mathematical space where embeddings are positioned based on semantic features. |
Key Takeaways
- Semantic similarity uses embeddings to compare meaning beyond exact words.
- Cosine similarity is a common method to quantify semantic closeness between vectors.
- Use semantic similarity for search, clustering, and recommendations based on meaning.
- Avoid semantic similarity for tasks requiring exact matches or syntax correctness.