How to set up Vertex AI in Python
Direct answer
Use the vertexai Python SDK to initialize your Google Cloud project and location, then create and call a GenerativeModel instance for text generation.
Setup
Install
pip install google-cloud-aiplatform vertexai Env vars
GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS Imports
import os
import vertexai
from vertexai.generative_models import GenerativeModel Examples
inExplain quantum computing
outQuantum computing is a type of computation that uses quantum bits to perform operations exponentially faster than classical computers for certain problems.
inSummarize the benefits of AI in healthcare
outAI in healthcare improves diagnostics, personalizes treatment, automates administrative tasks, and accelerates drug discovery.
in
outError: Input prompt cannot be empty.
Integration steps
- Install the vertexai SDK and set environment variables for authentication.
- Initialize the vertexai client with your Google Cloud project and location.
- Create a GenerativeModel instance with the desired model name (e.g., 'gemini-2.0-flash').
- Call the model's generate_content() method with your input prompt.
- Extract the generated text from the response object.
Full code
import os
import vertexai
from vertexai.generative_models import GenerativeModel
# Set your Google Cloud project and location
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
# Initialize the Vertex AI SDK
vertexai.init(project=project_id, location=location)
# Load the Gemini 2.0 Flash model
model = GenerativeModel("gemini-2.0-flash")
# Define the prompt
prompt = "Explain quantum computing"
# Generate content
response = model.generate_content(prompt)
# Print the generated text
print(response.text) API trace
Request
{"model": "gemini-2.0-flash", "prompt": "Explain quantum computing"} Response
{"text": "Quantum computing is a type of computation that uses quantum bits...", "metadata": {...}} Extract
response.textVariants
Streaming response ›
Use streaming to display partial results immediately for long or interactive responses.
import os
import vertexai
from vertexai.generative_models import GenerativeModel
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = GenerativeModel("gemini-2.0-flash")
prompt = "Explain quantum computing"
# Stream the generated content
for chunk in model.generate_content(prompt, stream=True):
print(chunk.text, end="", flush=True) Async version ›
Use async calls to integrate Vertex AI generation into asynchronous Python applications.
import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel
async def main():
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = GenerativeModel("gemini-2.0-flash")
prompt = "Explain quantum computing"
response = await model.generate_content_async(prompt)
print(response.text)
asyncio.run(main()) Alternative model: Gemini 2.5 Pro ›
Use Gemini 2.5 Pro for higher quality or more complex generation tasks.
import os
import vertexai
from vertexai.generative_models import GenerativeModel
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = GenerativeModel("gemini-2.5-pro")
prompt = "Explain quantum computing"
response = model.generate_content(prompt)
print(response.text) Performance
Latency~800ms for gemini-2.0-flash non-streaming calls
Cost~$0.003 per 500 tokens generated with gemini-2.0-flash
Rate limitsDefault quota: 600 RPM (requests per minute) per project
- Use concise prompts to reduce token usage.
- Leverage streaming to start processing output early.
- Cache frequent queries to avoid repeated calls.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard generate_content() | ~800ms | ~$0.003 | Simple synchronous calls |
| Streaming generate_content(stream=True) | Starts immediately | ~$0.003 | Long or interactive outputs |
| Async generate_content_async() | ~800ms | ~$0.003 | Async Python apps |
Quick tip
Always initialize vertexai with your project and location before creating model instances.
Common mistake
Forgetting to set the GOOGLE_APPLICATION_CREDENTIALS environment variable with your service account JSON path.