How to Intermediate · 4 min read

How to build RAG with Vertex AI

Quick answer
Build Retrieval-Augmented Generation (RAG) with Vertex AI by first embedding your documents using OpenAIEmbeddings, indexing them in a vector store like FAISS, then querying the vector store to retrieve relevant context and passing it to a Vertex AI Gemini chat model for generation. Use the vertexai Python SDK to integrate embedding, retrieval, and chat completion steps seamlessly.

PREREQUISITES

  • Python 3.8+
  • Google Cloud project with Vertex AI enabled
  • Service account with Vertex AI permissions
  • pip install vertexai langchain langchain_community faiss-cpu openai
  • Set GOOGLE_APPLICATION_CREDENTIALS environment variable for authentication

Setup

Install the required Python packages and set up authentication for Google Vertex AI. You need to enable Vertex AI API in your Google Cloud project and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to your service account JSON key file.

bash
pip install vertexai langchain langchain_community faiss-cpu openai

Step by step

This example shows how to embed documents, create a FAISS vector store, query it with a user question, and generate an answer using Vertex AI's Gemini chat model.

python
import os
from vertexai import init
from vertexai.generative_models import GenerativeModel
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

# Initialize Vertex AI with your project and location
init(project=os.environ['GOOGLE_CLOUD_PROJECT'], location='us-central1')

# Sample documents to index
texts = [
    "Vertex AI is Google Cloud's managed ML platform.",
    "RAG combines retrieval with generation for better answers.",
    "FAISS is a popular vector search library.",
    "Gemini models provide powerful chat capabilities."
]

# Create LangChain Document objects
docs = [Document(page_content=text) for text in texts]

# Create embeddings using OpenAI embeddings (can also use Vertex AI embeddings if available)
embeddings = OpenAIEmbeddings()

# Build FAISS vector store from documents
vector_store = FAISS.from_documents(docs, embeddings)

# User query
query = "What is Vertex AI?"

# Retrieve top 2 relevant documents
retrieved_docs = vector_store.similarity_search(query, k=2)

# Combine retrieved context
context = "\n".join([doc.page_content for doc in retrieved_docs])

# Prepare prompt with context
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:" 

# Load Vertex AI Gemini generative model
model = GenerativeModel('gemini-2.0-flash')

# Generate answer
response = model.generate_content(prompt, max_output_tokens=256)

print("Answer:", response.text.strip())
output
Answer: Vertex AI is Google Cloud's managed ML platform that enables building, deploying, and scaling machine learning models.

Common variations

  • Use vertexai.generative_models.GenerativeModel for chat-based RAG with conversational context.
  • Replace OpenAIEmbeddings with Vertex AI embeddings when available for tighter integration.
  • Use asynchronous calls with asyncio and await for scalable applications.
  • Switch vector stores to Chroma or cloud-hosted vector DBs for large-scale retrieval.
python
import asyncio
from vertexai import init
from vertexai.generative_models import GenerativeModel
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

async def rag_async():
    init(project=os.environ['GOOGLE_CLOUD_PROJECT'], location='us-central1')

    texts = ["Vertex AI is a managed ML platform.", "RAG improves generation with retrieval."]
    docs = [Document(page_content=t) for t in texts]
    embeddings = OpenAIEmbeddings()
    vector_store = FAISS.from_documents(docs, embeddings)

    query = "Explain RAG"
    retrieved_docs = vector_store.similarity_search(query, k=1)
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    model = GenerativeModel('gemini-2.0-flash')

    prompt = f"Context:\n{context}\nQuestion: {query}\nAnswer:"

    # Async generation
    response = await model.generate_content(prompt, max_output_tokens=256)
    print("Async answer:", response.text.strip())

import asyncio
asyncio.run(rag_async())
output
Async answer: RAG stands for Retrieval-Augmented Generation, a technique that improves language model responses by retrieving relevant documents to provide context.

Troubleshooting

  • If you get authentication errors, verify GOOGLE_APPLICATION_CREDENTIALS points to a valid service account JSON with Vertex AI permissions.
  • If embeddings or vector search return no results, check that documents are properly indexed and embedding model is compatible.
  • For quota or API errors, ensure your Google Cloud project has Vertex AI enabled and billing is active.
  • Model names like gemini-2.0-flash may update; check Vertex AI models for current names.

Key Takeaways

  • Use Vertex AI's Gemini models combined with vector search for effective RAG implementations.
  • Leverage LangChain and FAISS for easy document embedding and retrieval integration.
  • Always authenticate with a service account and set environment variables correctly for Vertex AI.
  • Consider async and chat-based models for scalable and conversational RAG applications.
Verified 2026-04 · gemini-2.0-flash, OpenAIEmbeddings, FAISS
Verify ↗