How to translate entire documents with AI
Quick answer
Use the
OpenAI Python SDK to read your document, split it into manageable chunks, and send each chunk to a translation model like gpt-4o for translation. Combine the translated chunks to reconstruct the full document efficiently.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example reads a text document, splits it into chunks, translates each chunk using gpt-4o, and combines the results.
import os
from openai import OpenAI
# Initialize client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Function to split text into chunks
def chunk_text(text, max_tokens=1000):
words = text.split()
chunks = []
current_chunk = []
current_len = 0
for word in words:
current_len += 1
current_chunk.append(word)
if current_len >= max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_len = 0
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
# Load your document
with open("document.txt", "r", encoding="utf-8") as f:
text = f.read()
chunks = chunk_text(text, max_tokens=500) # Adjust chunk size as needed
translated_chunks = []
for i, chunk in enumerate(chunks):
prompt = f"Translate the following text to French:\n\n{chunk}"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
translated_text = response.choices[0].message.content
translated_chunks.append(translated_text)
print(f"Chunk {i+1}/{len(chunks)} translated.")
# Combine translated chunks
full_translation = "\n\n".join(translated_chunks)
# Save to file
with open("translated_document.txt", "w", encoding="utf-8") as f:
f.write(full_translation)
print("Translation complete. Saved to translated_document.txt") output
Chunk 1/3 translated. Chunk 2/3 translated. Chunk 3/3 translated. Translation complete. Saved to translated_document.txt
Common variations
- Use asynchronous calls with
asyncioandclient.chat.completions.createfor faster batch translation. - Switch to other models like
claude-3-5-sonnet-20241022orgemini-2.5-profor different translation styles or languages. - Implement streaming to process large documents chunk-by-chunk with partial outputs.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def translate_chunk(chunk):
prompt = f"Translate the following text to Spanish:\n\n{chunk}"
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
with open("document.txt", "r", encoding="utf-8") as f:
text = f.read()
chunks = text.split("\n\n") # Simple split by paragraphs
tasks = [translate_chunk(chunk) for chunk in chunks]
translated_chunks = await asyncio.gather(*tasks)
full_translation = "\n\n".join(translated_chunks)
with open("translated_document_async.txt", "w", encoding="utf-8") as f:
f.write(full_translation)
print("Async translation complete.")
if __name__ == "__main__":
asyncio.run(main()) output
Async translation complete.
Troubleshooting
- If you hit token limits, reduce chunk size or use models with larger context windows.
- For rate limit errors, add retry logic with exponential backoff.
- Ensure your document encoding is UTF-8 to avoid decoding errors.
- If translations are inaccurate, try adding more detailed instructions in the prompt.
Key Takeaways
- Split large documents into smaller chunks to avoid token limits during translation.
- Use the OpenAI Python SDK with
gpt-4oor other advanced models for high-quality translations. - Async calls can speed up batch document translation significantly.
- Adjust prompts to specify target language and style for better translation accuracy.