How to beginner to intermediate · 4 min read

AI for plagiarism detection

Quick answer
Use large language models (LLMs) like gpt-4o-mini to detect plagiarism by comparing text similarity and paraphrase detection. Combine embedding models with vector search to identify copied or closely reworded content efficiently.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example uses gpt-4o-mini to check if a given text is plagiarized by comparing it against a reference text. It uses embeddings to measure semantic similarity and a prompt to detect paraphrasing.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Reference text to compare against
reference_text = "Artificial intelligence is the simulation of human intelligence processes by machines."

# Text to check for plagiarism
input_text = "AI involves machines mimicking human intelligence processes."

# Step 1: Get embeddings for both texts
embedding_model = "text-embedding-3-small"

ref_embedding_resp = client.embeddings.create(model=embedding_model, input=reference_text)
input_embedding_resp = client.embeddings.create(model=embedding_model, input=input_text)

ref_embedding = ref_embedding_resp.data[0].embedding
input_embedding = input_embedding_resp.data[0].embedding

# Step 2: Compute cosine similarity
import math

def cosine_similarity(vec1, vec2):
    dot = sum(a*b for a, b in zip(vec1, vec2))
    norm1 = math.sqrt(sum(a*a for a in vec1))
    norm2 = math.sqrt(sum(b*b for b in vec2))
    return dot / (norm1 * norm2)

similarity = cosine_similarity(ref_embedding, input_embedding)

# Step 3: Use GPT to detect paraphrasing if similarity is high
threshold = 0.85
if similarity >= threshold:
    prompt = f"Determine if the following text is a paraphrase of the reference.\nReference: {reference_text}\nInput: {input_text}\nAnswer with 'Yes' or 'No' and a brief explanation."
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.choices[0].message.content
else:
    result = "Texts are not similar enough to be considered plagiarized."

print(f"Cosine similarity: {similarity:.3f}")
print("Plagiarism check result:", result)
output
Cosine similarity: 0.912
Plagiarism check result: Yes. The input text is a paraphrase of the reference, conveying the same meaning with different wording.

Common variations

  • Use async calls with asyncio and client.chat.completions.acreate() for non-blocking requests.
  • Try different embedding models like text-embedding-3-large for higher accuracy.
  • Integrate vector databases like FAISS or Chroma for large-scale plagiarism detection across many documents.
python
import asyncio
from openai import OpenAI

async def async_plagiarism_check():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    reference_text = "Artificial intelligence is the simulation of human intelligence processes by machines."
    input_text = "AI involves machines mimicking human intelligence processes."

    embedding_model = "text-embedding-3-small"
    ref_embedding_resp = await client.embeddings.acreate(model=embedding_model, input=reference_text)
    input_embedding_resp = await client.embeddings.acreate(model=embedding_model, input=input_text)

    ref_embedding = ref_embedding_resp.data[0].embedding
    input_embedding = input_embedding_resp.data[0].embedding

    import math
    def cosine_similarity(vec1, vec2):
        dot = sum(a*b for a, b in zip(vec1, vec2))
        norm1 = math.sqrt(sum(a*a for a in vec1))
        norm2 = math.sqrt(sum(b*b for b in vec2))
        return dot / (norm1 * norm2)

    similarity = cosine_similarity(ref_embedding, input_embedding)

    if similarity >= 0.85:
        prompt = f"Determine if the following text is a paraphrase of the reference.\nReference: {reference_text}\nInput: {input_text}\nAnswer with 'Yes' or 'No' and a brief explanation."
        response = await client.chat.completions.acreate(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
        result = response.choices[0].message.content
    else:
        result = "Texts are not similar enough to be considered plagiarized."

    print(f"Cosine similarity: {similarity:.3f}")
    print("Plagiarism check result:", result)

asyncio.run(async_plagiarism_check())
output
Cosine similarity: 0.912
Plagiarism check result: Yes. The input text is a paraphrase of the reference, conveying the same meaning with different wording.

Troubleshooting

  • If you get 401 Unauthorized, verify your OPENAI_API_KEY environment variable is set correctly.
  • Low similarity scores may indicate the embedding model is too small; try a larger embedding model.
  • For large document sets, use vector databases to avoid performance bottlenecks.

Key Takeaways

  • Use embedding models to measure semantic similarity for plagiarism detection.
  • Combine embeddings with LLM prompts to detect paraphrasing effectively.
  • Async API calls improve performance in large-scale or real-time systems.
  • Vector databases scale plagiarism detection across many documents.
  • Always secure your API key via environment variables.
Verified 2026-04 · gpt-4o-mini, text-embedding-3-small, text-embedding-3-large
Verify ↗