Training data format for Gemini
Why this matters
Developers migrating from OpenAI's fine-tuning assume Gemini offers equivalent model adaptation. Understanding Gemini's actual training data constraints prevents wasted development effort and helps you choose the right pattern (retrieval vs. few-shot vs. system prompts) for your use case.
Explanation
Gemini does not support fine-tuning in the traditional sense (training new model weights on your data). Instead, Gemini provides two mechanisms for data injection: the File API for uploading documents that persist across sessions, and in-context learning (few-shot examples in system prompts). The File API accepts MIME types including text/plain, application/pdf, and structured formats; files are indexed server-side and can be referenced across multiple API calls within a session or across sessions if you retain the file ID.
Under the hood, when you upload a file via the File API, Google's infrastructure chunks and embeds the content, making it retrievable via semantic search during generation. The model doesn't learn from this data: it retrieves relevant sections at inference time. This is fundamentally different from fine-tuning, where model parameters are updated. For truly custom model behavior, you should use few-shot examples in the system_instruction field or include examples in your user prompt, which fire-and-forget but work within context-window limits.
Use the File API when your reference data is large (>5KB), changes infrequently, and you want to reuse it across multiple prompts without re-uploading. Use in-context examples when your data is small, changes per-request, or you need guaranteed retrieval precision. Never expect Gemini to adapt its base weights: it is a frozen model; your data influences only retrieval and prompt context, not model behavior.
Request code
import os
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
with open('sample_doc.txt', 'w') as f:
f.write('Acme Corp raised $50M in Series B. Founded 2020. Headquarters: San Francisco.\nKey products: Widget X, Widget Y.')
file_path = Path('sample_doc.txt')
upload_response = genai.upload_file(
path=file_path,
mime_type='text/plain'
)
print(f'File uploaded with URI: {upload_response.uri}')
print(f'File name: {upload_response.name}')
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content([
'Based on the uploaded document, what was Acme Corp\'s funding amount?',
upload_response
])
print(f'Model response: {response.text}')
genai.delete_file(name=upload_response.name)
print('File deleted.') Authentication
Set your Google API key in the environment: export GOOGLE_API_KEY='your-key-here'. The google-generativeai library reads this at configuration time. Verify access by calling genai.configure(api_key=os.environ['GOOGLE_API_KEY']) before uploading files.
Response shape
| Field | Description |
|---|---|
upload_response | [object Object] |
generate_response | [object Object] |
Field guide
name Use this to reference the file in subsequent API calls within 48 hours. If you need persistence beyond 48 hours, store this value in your database.
expiration_time Files auto-delete after 48 hours by default. The hidden capability: you can call genai.update_file() to extend expiration, but this is undocumented and may not be available on all tier levels.
usage_metadata Track prompt_token_count carefully: large files with semantic retrieval may inflate token usage significantly. A 100KB document counted character-by-character can cost 20K–50K tokens depending on retrieval efficiency.
uri This is a storage identifier, not a download URL. You cannot share this with users or embed it in client code; the URI is tied to your API key's permissions.
Setup trap
The google-generativeai library uses lazy authentication. If you set os.environ['GOOGLE_API_KEY'] after importing genai but before calling genai.configure(), the configuration will succeed, but subsequent file upload calls may fail silently if the key was invalid. Always validate with a small API call immediately after configure(): e.g., genai.list_files(): to catch credential issues early.
Cost
File uploads are free, but tokens consumed during retrieval are charged. A 50MB PDF file may consume 50K–200K tokens on first retrieval depending on query complexity and indexing depth. Each subsequent query re-consumes tokens proportional to the retrieved context size, not the full file size. Budget accordingly: if you retrieve from 10 large files per day, you may consume 500K–1M tokens/day in retrieval alone.
Rate limits
File uploads are subject to a 100 files per 60 seconds rate limit per API key. If you bulk-upload reference documents, implement exponential backoff and batch in groups of 5 with 1-second delays. Files created within a single session share quota; aggressive cleanup (delete_file) after each session frees quota for new uploads.
Common gotcha
Developers assume file uploads persist indefinitely or that files uploaded in one session are available in the next day's session without re-upload. Files expire after 48 hours by default. If you need persistent reference data, either re-upload before each session or use a vector database (Firestore, Pinecone) alongside Gemini. Additionally, semantic retrieval from large files is NOT instantaneous: the first query involving a new file may have latency (100–500ms) while indexing completes server-side.
Error recovery
google.generativeai.types.BiddingStrategyError (API key invalid)File size exceeds limit (413 Payload Too Large)File not found (404) on second sessionRetrieval returned no context (empty candidates)Token count unexpectedly highExperienced dev note
Gemini's File API is a retrieval layer, not a training interface: it's fundamentally different from OpenAI fine-tuning or Anthropic's context windows. The real power is not in the 48-hour expiration or file size limits, but in the fact that token costs scale with retrieved context, not uploaded size. A 1GB file costs zero to store but can cost 100K+ tokens if fully retrieved in a single query. Experienced teams optimize by: (1) breaking reference data into modular documents (one per entity/topic), (2) using specific queries to constrain retrieval to 5–10KB chunks, (3) caching file URIs in Redis/Firestore with expiration timestamps, and (4) pre-computing embeddings of file summaries to filter which files to query. This prevents 'token bloat' where a single broad query retrieves 500KB of irrelevant context.
Check your understanding
You have a 2GB product catalog and need to answer customer questions about product specifications. Why shouldn't you upload the entire catalog as a single file, and what's the token-cost implication if you do?
Show answer hint
Semantic retrieval will pull broad context sections when the query is vague. A query like 'Do you have widgets?' against a single 2GB file might retrieve 10MB+ of context (every product entry with 'widget' anywhere). Split into per-category files and use specific queries to constrain retrieval to <100KB per query, reducing token spend by 50–90x.