Skip to content

Google Vertex AI Guide — Gemini, Model Garden & Production GenAI

Google’s position in the AI landscape is unusual: it runs one of the largest data analytics platforms in the world (BigQuery), maintains a dominant search and advertising business built on information retrieval at scale, and is a primary model provider (Gemini) competing directly with OpenAI and Anthropic.

Vertex AI is Google’s managed platform for running AI workloads, predating the LLM era by several years as a general ML platform. Since 2023, it has been significantly extended to serve as the production infrastructure layer for Gemini and other frontier models on Google Cloud — the equivalent of AWS Bedrock for GCP-first organizations.

For teams whose data lives in BigQuery, whose ETL runs in Dataflow, and whose infrastructure is managed in Google Cloud, the integration advantage of Vertex AI is substantial: Gemini can directly query BigQuery tables via grounding, Vertex AI Search can index Google Drive and Cloud Storage documents without an ETL pipeline, and identity management flows through Google Cloud IAM with the same policies that govern every other GCP resource.

The core differentiation of Vertex AI relative to Bedrock and Azure OpenAI is data platform depth. Gemini’s grounding capability — the ability to query a live data source at inference time and ground the response in retrieved results — has native integrations with Google Search (for web knowledge) and Vertex AI Search (for enterprise document corpora). For organizations already on the Google data stack, this is a meaningful reduction in integration work.

The second differentiation is BigQuery integration. Gemini can be invoked directly from BigQuery SQL using the ML.GENERATE_TEXT function, enabling AI-powered data transformation, classification, and enrichment within existing data pipelines without extracting data to a separate service.


A data engineering team at a media company uses BigQuery as their primary analytics store, Google Drive for document management, and Cloud Storage for raw files. They want to build a content analysis agent that classifies articles, extracts entities from documents, and answers questions about their content library.

The direct API approach requires: an embedding pipeline that reads from Google Drive and writes to a separate vector database, an API key for the Gemini API, custom integration code for BigQuery queries, and a separate logging infrastructure for AI invocations.

The Vertex AI approach: Vertex AI Search creates an index directly from a Google Drive data source with no ETL code. Gemini grounding queries that index at inference time. BigQuery ML can invoke Gemini within existing SQL queries for batch classification. All AI invocations are logged in Cloud Logging and auditable via Cloud Audit Logs.

A second context where Vertex AI has an architectural advantage: organizations that need to evaluate and fine-tune models as part of their production ML workflow. Vertex AI has native model evaluation pipelines (Vertex AI Evaluation), managed fine-tuning for Gemini via supervised fine-tuning and RLHF, and integration with Vertex AI Experiments for tracking evaluation runs. Teams that use Vertex AI for traditional ML (classification models, regression, forecasting) can extend the same pipeline infrastructure to LLM evaluation without introducing new tools.


Like Bedrock, Vertex AI is an umbrella for multiple distinct capabilities. The distinction between components that are active and what each one is for is the foundational knowledge for working with it.

Vertex AI Gemini API (generative AI endpoint): The primary endpoint for Gemini model invocations — chat, completion, multimodal (text + image + audio + video), code generation, and function calling. Accessed via the Vertex AI SDK or REST API.

Model Garden: A catalog of foundation models from Google and third parties (Meta Llama, Mistral, Anthropic Claude via a separate marketplace) available for deployment on Vertex AI infrastructure. Some models are available for direct API use; others require deployment to a managed endpoint.

Vertex AI Search (formerly Enterprise Search): Managed search and RAG. Index documents from Cloud Storage, BigQuery, Google Drive, websites, and enterprise apps. Query via semantic search, keyword search, or hybrid. Supports grounding integration with the Gemini API.

Vertex AI Agent Builder: Managed agent orchestration. Define data stores (for RAG), tools (for function calling), and playbooks (instruction prompts). Vertex AI manages the agent loop. Similar in concept to Amazon Bedrock Agents.

Grounding: A first-class Gemini capability that retrieves information from Google Search (for general web knowledge) or a Vertex AI Search data store (for enterprise documents) before generating a response. Grounding reduces hallucination by anchoring the response to retrieved sources.

Vertex AI Evaluation: Automated evaluation of model outputs using metrics (BLEU, ROUGE, coherence, fluency, groundedness) and human-in-the-loop evaluation workflows. Integrates with Vertex AI Experiments for tracking evaluation results over time.

Vertex AI uses Google Cloud IAM for authentication. Service accounts with the roles/aiplatform.user role can invoke Gemini APIs. When running on GCP (GKE, Cloud Run, Cloud Functions), workload identity provides automatic credential management — no service account keys required. Authentication flows through google-auth or Application Default Credentials.

Gemini is available in multiple sizes optimized for different latency/cost/capability points:

  • Gemini 1.5 Pro: Highest capability, longest context window (up to 2 million tokens), best for complex reasoning
  • Gemini 1.5 Flash: Faster and cheaper, optimized for high-volume inference with good capability
  • Gemini 1.5 Flash-8B: Smallest and fastest, for latency-critical or very high-volume use cases
  • Gemini 2.0 Flash: The 2025 generation, experimental and then stable, multimodal-native

Model selection follows the general principle: start with Flash for development and cost estimation, upgrade to Pro when capability gaps appear in evaluation.


Terminal window
# Install the Vertex AI SDK
pip install google-cloud-aiplatform
# Set up Application Default Credentials
gcloud auth application-default login
# Or set a service account key file
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

For production on GCP, use workload identity federation — no key file required:

Terminal window
# Assign the workload identity to a GKE service account
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]" \
VERTEX_SA@PROJECT.iam.gserviceaccount.com

Step 2: Invoke Gemini with Function Calling

Section titled “Step 2: Invoke Gemini with Function Calling”
import vertexai
from vertexai.generative_models import (
GenerativeModel, GenerationConfig, Tool, FunctionDeclaration, Part
)
vertexai.init(project="your-project-id", location="us-central1")
# Define tools as function declarations
search_knowledge_base = FunctionDeclaration(
name="search_knowledge_base",
description=(
"Search the company knowledge base for information about products, policies, and FAQs. "
"Use when the user asks about company-specific information."
),
parameters={
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"}
},
"required": ["query"]
}
)
get_order_details = FunctionDeclaration(
name="get_order_details",
description="Retrieve order details by order ID. Use when the user provides an order number.",
parameters={
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
)
model = GenerativeModel(
model_name="gemini-1.5-flash-001",
tools=[Tool(function_declarations=[search_knowledge_base, get_order_details])]
)
response = model.generate_content(
"What is the return policy and what is the status of order #98765?",
generation_config=GenerationConfig(temperature=0.2)
)
# Handle function calls in the response
for part in response.candidates[0].content.parts:
if hasattr(part, "function_call"):
fn = part.function_call
print(f"Tool called: {fn.name}, Arguments: {dict(fn.args)}")
Section titled “Step 3: Set Up Grounding with Vertex AI Search”
from vertexai.generative_models import (
GenerativeModel, GenerationConfig, Tool, grounding
)
# Grounding with a Vertex AI Search data store
search_tool = Tool.from_retrieval(
retrieval=grounding.Retrieval(
source=grounding.VertexAISearch(
datastore="projects/PROJECT/locations/global/collections/default_collection/dataStores/DATASTORE_ID"
)
)
)
model = GenerativeModel("gemini-1.5-pro-001", tools=[search_tool])
response = model.generate_content(
"What is our company's remote work policy?",
generation_config=GenerationConfig(temperature=0.1)
)
print(response.text)
# Response includes citation references to source documents
-- Create a BigQuery ML remote model backed by Gemini
CREATE OR REPLACE MODEL `your_dataset.gemini_model`
REMOTE WITH CONNECTION `us.vertex-ai-connection`
OPTIONS (endpoint = 'gemini-1.5-flash-001');
-- Use the model to classify customer support tickets
SELECT
ticket_id,
ticket_text,
ML.GENERATE_TEXT(
MODEL `your_dataset.gemini_model`,
STRUCT(
CONCAT(
'Classify this customer support ticket into one of: billing, technical, shipping, returns, other. ',
'Respond with only the category name.\n\nTicket: ', ticket_text
) AS prompt,
0.1 AS temperature,
50 AS max_output_tokens
)
) AS classification
FROM `your_dataset.support_tickets`
WHERE processed_date = CURRENT_DATE()

This enables AI-powered enrichment entirely within BigQuery without extracting data to an application layer.

# Stream a response for real-time UI
for chunk in model.generate_content("Explain the difference between RAG and fine-tuning", stream=True):
if chunk.text:
print(chunk.text, end="", flush=True)

A production GenAI system on Vertex AI integrates the Gemini API with Google-native data services at every tier.

Production GenAI Architecture — Google Vertex AI

A complete production stack using native Google Cloud services

Application Tier
Cloud Run · GKE · Cloud Functions — API endpoints
Vertex AI Gemini API
Function calling · Grounding · Streaming · Multimodal
Vertex AI Search
Managed RAG — indexes Cloud Storage, Drive, BigQuery
Data Platform
BigQuery · Cloud Storage · Google Drive · Firestore
Operations
Cloud Logging · Cloud Trace · Vertex AI Evaluation
Idle

Grounding is architecturally distinct from RAG implemented at the application layer. In application-layer RAG, the application calls a vector store, retrieves documents, appends them to the prompt, and calls the model. In Vertex AI grounding, the retrieval happens inside the Gemini API call — the model retrieves from the configured data store as part of inference, and the response includes both the generated text and citations to source documents.

The practical difference: grounding is simpler to implement (no retrieval code in the application) but less controllable (you cannot intercept retrieved chunks before they reach the model). For applications where retrieval quality is critical and requires custom ranking or filtering, application-layer RAG with Vertex AI Search gives more control.


A document intelligence pipeline that classifies, extracts, and summarizes uploaded documents using Vertex AI:

import vertexai
from vertexai.generative_models import GenerativeModel, Part
import base64
vertexai.init(project="your-project", location="us-central1")
model = GenerativeModel("gemini-1.5-pro-001")
def analyze_document(pdf_bytes: bytes) -> dict:
"""Extract structured information from a PDF document."""
pdf_part = Part.from_data(data=pdf_bytes, mime_type="application/pdf")
response = model.generate_content([
pdf_part,
"""Analyze this document and return a JSON object with these fields:
- document_type: the type of document (contract, invoice, report, etc.)
- key_entities: list of key parties, companies, or people mentioned
- dates: list of important dates (as strings in YYYY-MM-DD format)
- summary: 2-3 sentence summary of the document's purpose
Respond with only valid JSON."""
])
import json
return json.loads(response.text)

This uses Gemini’s native multimodal capability to process PDFs directly — no OCR preprocessing required for standard PDF documents.

Example: Evaluation with Vertex AI Evaluation

Section titled “Example: Evaluation with Vertex AI Evaluation”
from vertexai.evaluation import EvalTask, PointwiseMetric
eval_dataset = [
{
"prompt": "What are the key differences between RAG and fine-tuning?",
"reference": "RAG retrieves relevant context at inference time without changing model weights...",
"response": model.generate_content("What are the key differences between RAG and fine-tuning?").text
},
# More examples...
]
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
"coherence",
"fluency",
"groundedness",
PointwiseMetric(
metric="answer_quality",
metric_prompt_template="Rate the quality of this answer on a scale of 1-5..."
)
]
)
eval_result = eval_task.evaluate()
print(eval_result.summary_metrics)

7. Trade-offs, Limitations & Failure Modes

Section titled “7. Trade-offs, Limitations & Failure Modes”

Gemini vs. Claude and GPT-4 Capability Gaps

Section titled “Gemini vs. Claude and GPT-4 Capability Gaps”

Gemini models have different capability profiles than Claude and GPT-4. On instruction-following tasks with complex multi-step reasoning, Gemini 1.5 Pro is competitive with Claude Sonnet-tier models. On tasks that benefit from long context (processing entire codebases, analyzing long documents), Gemini’s 2M token context window is a genuine differentiator — no other production model offers comparable context length.

The gap is more pronounced on specialized coding tasks and nuanced text generation where Claude Sonnet tends to outperform. Run evaluation on your specific use case rather than relying on benchmark comparisons.

Vertex AI Search has strong native coverage for Google-native data sources (Drive, Workspace, Cloud Storage) and reasonable coverage for third-party enterprise apps via pre-built connectors. If your document corpus lives in a non-supported system, you need to export to Cloud Storage or push via API before Vertex AI Search can index it. This is more friction than direct vector store ingest for non-Google-native data.

Like Bedrock, not all Gemini models are available in all regions. Multi-region endpoints help but may route inference outside your preferred geography. Verify regional availability for your required model before finalizing architecture, particularly for EU data residency requirements.

Vertex AI Agent Builder is less mature than the Gemini API directly. The managed agent orchestration has limited observability (you cannot inspect intermediate reasoning steps), limited custom logic support (the playbook model is less flexible than LangGraph), and fewer third-party integrations than agent frameworks like LangGraph. For production agents with complex logic, the Gemini API with LangGraph or a similar framework gives significantly more control.


For a broader set of GenAI interview questions by level, see the GenAI interview questions guide.

“How would you build a RAG system over BigQuery data for a GCP-first team?” The expected answer covers Vertex AI Search indexing a BigQuery export or direct Cloud Storage data, grounding the Gemini API call to the Vertex AI Search data store, and all infrastructure within GCP (Cloud Run or GKE for the application, Cloud Logging for observability). Demonstrating awareness of the BigQuery ML GENERATE_TEXT capability as an alternative for batch processing signals advanced knowledge.

“When would you choose Vertex AI over AWS Bedrock?” The expected answer is infrastructure-context-dependent, not capability-dependent. Vertex AI for GCP-first organizations with data in BigQuery and Drive; Bedrock for AWS-first organizations. Mentioning Gemini’s 2M-token context window as a genuine differentiator for long-document use cases demonstrates product-level knowledge.

“How does Vertex AI grounding work and when would you use it?” Expected: grounding integrates retrieval into the Gemini API call rather than at the application layer, supports Google Search and Vertex AI Search as retrieval sources, reduces hallucination by anchoring responses to retrieved content, and includes citations in the response. Use when simplicity of implementation matters more than retrieval control.


The Vertex AI API has regional endpoints (us-central1-aiplatform.googleapis.com, europe-west4-aiplatform.googleapis.com, etc.). Route requests to the region closest to your application and your users. For global applications, implement region routing at the application layer.

Monitor Token Costs with Cloud Billing Alerts

Section titled “Monitor Token Costs with Cloud Billing Alerts”

Vertex AI pricing is per character (not per token, unlike most other providers) for Gemini models. Set Cloud Billing budget alerts to detect cost anomalies early. Unexpected token consumption — from a prompt engineering bug or an unbounded agent loop — will appear in billing data before it appears in application metrics if you do not have token-level logging enabled.

The Vertex AI Evaluation service is one of the best integrated evaluation tools in the managed cloud AI space. Use it before migrating to a new model version. Define a golden dataset of 50–100 representative inputs with reference outputs. Run evaluation on every major model version change and track metric trends over time.

Vertex AI Gemini models have default quota limits per project per minute. For high-volume production applications, request quota increases before you hit them — the process requires business justification and can take several days. Monitor quota consumption via Cloud Monitoring metrics (aiplatform.googleapis.com/generate_content_requests_per_minute) and set alerts at 70% utilization.


Google Vertex AI is the right foundation model infrastructure for teams with GCP-first infrastructure, data in BigQuery and Google Drive, or a need for Gemini’s unique long-context capabilities.

Use Vertex AI when:

  • Your organization runs primarily on GCP and IAM/VPC requirements apply to AI workloads
  • Data lives in BigQuery or Google Drive and native indexing via Vertex AI Search reduces pipeline complexity
  • The task requires processing very long documents and Gemini’s 2M-token context window is a genuine architectural advantage
  • You need model evaluation and fine-tuning in a unified platform that connects to your existing ML pipelines

Consider alternatives when:

  • Organization is AWS-first or Azure-first — the infrastructure integration advantage disappears across cloud boundaries
  • Agent logic is complex and requires fine-grained control over the reasoning loop
  • The specific tasks favor Claude or GPT-4 capability profiles based on your evaluation data

Key operational rules:

  • Use grounding when integration simplicity matters more than retrieval control; use application-layer RAG when you need to customize chunk ranking or filtering
  • Prefer the streaming API for user-facing applications to improve perceived responsiveness
  • Set billing alerts before production deployment — per-character pricing can produce unexpected costs with verbose prompts
  • Evaluate model version changes with Vertex AI Evaluation before promoting to production