Skip to content

AWS Bedrock Guide 2026 — Managed LLM API for Production GenAI

GenAI engineers who work at AWS-first organizations face a specific problem: building LLM-powered applications while satisfying the security, compliance, and operational requirements that AWS infrastructure was chosen to meet in the first place.

Calling the Anthropic API directly from an AWS Lambda function means your source code — and possibly sensitive data in your prompts — leaves the AWS boundary and travels to a third-party API over the public internet. It means authentication is a static API key that sits in a Secrets Manager entry instead of being managed by IAM. It means model invocations are not in your CloudTrail audit logs. It means when you need low-latency, guaranteed throughput for a high-volume application, there is no mechanism for provisioned capacity.

AWS Bedrock is the answer to this. It is a managed service that runs foundation model inference entirely within the AWS boundary: VPC endpoints, IAM-based authentication, CloudTrail logging, and the ability to provision throughput. It adds a managed RAG layer (Knowledge Bases for Amazon Bedrock), a managed agent orchestration layer (Amazon Bedrock Agents), and guardrails for content filtering and safety.

The model catalog includes Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere, and AI21 Labs models — available through a unified API regardless of the underlying provider.

For teams in AWS-first organizations, the integration advantage of Bedrock is not optional infrastructure plumbing — it is a compliance requirement. SOC 2, HIPAA, PCI DSS, and FedRAMP workloads running on AWS cannot route model invocations through third-party public APIs without additional compliance documentation. Bedrock keeps inference within the certified AWS boundary, which is why enterprise legal and security teams approve Bedrock first and evaluate direct model APIs second.


A team at a financial services company builds a document analysis agent using the Anthropic API directly. It works in development. In production review, it fails three compliance requirements: data containing customer financial information cannot leave the AWS boundary, model invocations must be in the audit trail, and the API key rotation schedule does not meet their security policy.

The remediation options are expensive: build a proxy service that runs inside the VPC and intercepts all Anthropic API calls, implement CloudTrail event publishing for every API call, and move to a credential rotation system. Or, switch to Bedrock, where all three are native capabilities.

This is not an edge case. It is the standard discovery path at financial services, healthcare, and government organizations building on AWS.

A second scenario: a high-volume application sends 500 requests per second to Claude via the Anthropic API. The API’s default rate limits impose a per-minute token budget that the application regularly hits. Each rate limit error requires a retry with exponential backoff, adding latency and unpredictability.

Bedrock’s Provisioned Throughput solves this: purchase a fixed number of model units for a given model, and Bedrock guarantees that throughput is available for your account. No shared rate limits, no rate limit errors at steady state, predictable latency. The trade-off is cost — provisioned throughput is significantly more expensive than on-demand pricing and requires committing to a 1-month or 6-month term.


Bedrock is not a single service — it is an umbrella for five distinct capabilities. Understanding which capability to use for a given task is the foundational knowledge for working with Bedrock.

Foundation Model Inference (InvokeModel / Converse API): Direct model invocation. Send a prompt, receive a completion. The Converse API provides a model-agnostic message format that works across all supported models, making it the recommended interface for new applications.

Knowledge Bases for Amazon Bedrock: Managed RAG. You point Bedrock at a data source (S3, Confluence, SharePoint, web crawler), it handles chunking, embedding, and vector storage (in OpenSearch Serverless or other supported vector stores), and exposes a Retrieve and RetrieveAndGenerate API for query-time retrieval and generation.

Amazon Bedrock Agents: Managed agent orchestration. Define a set of action groups (tools), an instruction prompt, and a knowledge base. Bedrock manages the ReAct loop — tool selection, tool execution, observation injection, and iteration — without you implementing the loop.

Guardrails for Amazon Bedrock: Content filtering, PII detection and redaction, topic denial, and grounding checks. Apply to any model invocation in Bedrock, including those within Knowledge Bases and Agents.

Model Evaluation: Automated evaluation of model outputs on a dataset using both automated metrics (BERTScore, accuracy) and human review workflows. Used for choosing between models and monitoring quality over time.

Every Bedrock API call is an AWS API call, which means it is authenticated via IAM. There are no separate API keys for Bedrock. An EC2 instance with an IAM role that includes bedrock:InvokeModel can call Bedrock without any additional credentials. A Lambda function assumes a role with the same permission. An SageMaker notebook uses its execution role.

This is a significant operational advantage over direct model APIs: credential management follows the IAM lifecycle — rotation, revocation, and audit are handled by your existing AWS identity infrastructure. There is no separate secret to manage.


Bedrock models are not available by default. In the Bedrock console, navigate to Model Access and request access to the models you need. For Claude models, this typically requires accepting Anthropic’s terms of service. Access is provisioned within minutes for most models.

Terminal window
# Verify access via CLI
aws bedrock list-foundation-models --by-provider anthropic --region us-east-1

The Converse API is the recommended interface for chat-style models. It handles message formatting consistently across Claude, Titan, Llama, and Mistral:

import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse(
modelId="anthropic.claude-sonnet-4-5-20250929-v1:0",
messages=[
{"role": "user", "content": [{"text": "Explain the ReAct agent pattern in two paragraphs."}]}
],
system=[{"text": "You are a senior GenAI engineer. Explain concepts clearly and concisely."}],
inferenceConfig={
"maxTokens": 1024,
"temperature": 0.3
}
)
output_text = response["output"]["message"]["content"][0]["text"]
print(output_text)

The boto3 client automatically uses your environment’s IAM credentials — no API key management required.

For RAG, create a Knowledge Base in the Bedrock console or via the API:

  1. Create an S3 bucket and upload your documents (PDFs, Word files, text files)
  2. Create a Knowledge Base pointing at the S3 prefix
  3. Configure the embedding model (Amazon Titan Embeddings v2 is the default)
  4. Select a vector store (OpenSearch Serverless, Aurora Serverless with pgvector, or Pinecone)
  5. Sync the data source to trigger chunking, embedding, and indexing

After setup, query via the API:

bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
# Retrieve relevant chunks without generation
retrieve_response = bedrock_agent.retrieve(
knowledgeBaseId="KB_ID_HERE",
retrievalQuery={"text": "What is the refund policy for enterprise customers?"},
retrievalConfiguration={
"vectorSearchConfiguration": {"numberOfResults": 5}
}
)
# Or retrieve and generate in one call
rag_response = bedrock_agent.retrieve_and_generate(
input={"text": "What is the refund policy for enterprise customers?"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": "KB_ID_HERE",
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0"
}
}
)
print(rag_response["output"]["text"])

Create a Guardrail in the Bedrock console or via the API with the protections you need:

bedrock_mgmt = boto3.client("bedrock", region_name="us-east-1")
guardrail = bedrock_mgmt.create_guardrail(
name="production-guardrail",
description="Content safety and PII protection for customer-facing agent",
topicPolicyConfig={
"topicsConfig": [
{
"name": "legal-advice",
"definition": "Requests for legal advice or interpretation of laws",
"examples": ["Is this contract legally binding?", "Can I sue for this?"],
"type": "DENY"
}
]
},
sensitiveInformationPolicyConfig={
"piiEntitiesConfig": [
{"type": "EMAIL", "action": "ANONYMIZE"},
{"type": "PHONE", "action": "ANONYMIZE"},
{"type": "SSN", "action": "BLOCK"}
]
},
contentPolicyConfig={
"filtersConfig": [
{"type": "VIOLENCE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
{"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"}
]
}
)
guardrail_id = guardrail["guardrailId"]
# Apply to model invocations
response = bedrock.converse(
modelId="anthropic.claude-sonnet-4-5-20250929-v1:0",
messages=[{"role": "user", "content": [{"text": user_input}]}],
guardrailConfig={"guardrailIdentifier": guardrail_id, "guardrailVersion": "DRAFT"}
)

Step 5: Provision Throughput for Production

Section titled “Step 5: Provision Throughput for Production”

For high-volume production workloads, purchase Provisioned Throughput via the console or API:

bedrock_mgmt.create_provisioned_model_throughput(
modelUnits=5, # Number of model units (check pricing page for throughput per unit)
provisionedModelName="claude-prod-throughput",
modelId="anthropic.claude-sonnet-4-5-20250929-v1:0",
commitmentDuration="SIX_MONTHS" # or "ONE_MONTH"
)

Use the provisioned model ARN in subsequent converse calls instead of the base model ID.


A production GenAI system on AWS Bedrock typically has five layers: the application tier, the Bedrock orchestration tier, the retrieval tier, the safety tier, and the data tier.

Production GenAI Architecture — AWS Bedrock

A complete production stack using native AWS services

Application Tier
API Gateway · Lambda · ECS — receives user requests
Guardrails
Content filtering · PII detection · Topic denial
Bedrock Agents / Converse API
ReAct loop · Tool execution · Model invocation
Knowledge Bases
Managed RAG — retrieval from indexed documents
Data Layer
S3 (documents) · OpenSearch (vectors) · DynamoDB (sessions)
Idle

The request flows downward: the application receives it, Guardrails validate input, Bedrock Agents orchestrate the response using model invocation and Knowledge Base retrieval, and results flow back up through Guardrails for output validation before being returned to the application.

In a compliant production deployment:

  • All Bedrock API calls originate from within a VPC using a VPC endpoint (com.amazonaws.us-east-1.bedrock-runtime)
  • No traffic goes to the public internet for model invocations
  • Lambda and ECS tasks assume IAM roles with minimum-required Bedrock permissions
  • CloudTrail captures every InvokeModel and Converse call with the IAM principal, model ID, input token count, and timestamp

Example: Document Q&A Agent with Knowledge Bases

Section titled “Example: Document Q&A Agent with Knowledge Bases”

The most common Bedrock production pattern: a question-answering system over a proprietary document corpus.

import boto3
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
def answer_question(question: str, knowledge_base_id: str, model_arn: str) -> dict:
"""Query a Knowledge Base and generate a grounded answer."""
response = bedrock_agent_runtime.retrieve_and_generate(
input={"text": question},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": knowledge_base_id,
"modelArn": model_arn,
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": 5,
"overrideSearchType": "HYBRID" # Combines semantic + keyword search
}
},
"generationConfiguration": {
"promptTemplate": {
"textPromptTemplate": (
"Answer the question using only the provided context. "
"If the context does not contain enough information, say so explicitly. "
"Context: $search_results$\nQuestion: $query$"
)
}
}
}
}
)
return {
"answer": response["output"]["text"],
"citations": [
citation["retrievedReferences"]
for citation in response.get("citations", [])
]
}

Key details: HYBRID search combines semantic similarity (vector) with keyword matching (BM25), which outperforms pure semantic search on queries with specific entity names or product codes. The custom prompt template prevents the model from answering from its training data — it is grounded to the retrieved context.

For user-facing applications, streaming is essential for perceived responsiveness:

def stream_response(prompt: str) -> None:
"""Stream a model response token by token."""
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse_stream(
modelId="anthropic.claude-sonnet-4-5-20250929-v1:0",
messages=[{"role": "user", "content": [{"text": prompt}]}],
)
for event in response["stream"]:
if "contentBlockDelta" in event:
delta = event["contentBlockDelta"]["delta"]
if "text" in delta:
print(delta["text"], end="", flush=True)

converse_stream returns a streaming event iterator. Each contentBlockDelta event contains a text fragment. In a web application, forward these fragments to the client via SSE or WebSocket for real-time streaming.


7. Trade-offs, Limitations & Failure Modes

Section titled “7. Trade-offs, Limitations & Failure Modes”

Bedrock’s model catalog is not always synchronized with the latest model releases from Anthropic and other providers. A new Claude model released today may not appear in Bedrock for weeks or months. For applications where the latest model capabilities matter more than AWS infrastructure compliance, this lag is a real constraint.

Mitigate by checking Bedrock’s model availability during your architecture evaluation, not after. Do not assume the model you want will be available on Bedrock when you need it.

The default chunking strategy in Knowledge Bases (fixed-size chunks at 300 tokens) works poorly for structured documents like tables, code blocks, and multi-part lists. A table chunk that starts halfway through a row produces nonsensical retrievals.

For structured documents, configure custom chunking: hierarchical chunking (preserves document structure), semantic chunking (splits at natural boundaries), or fixed-size with larger overlap. For code documentation, semantic chunking at function/class boundaries produces dramatically better retrieval quality than fixed-size.

Not all Bedrock models are available in all AWS regions. If your application runs in eu-west-1 for data residency reasons, verify that your required models are available in that region before committing to the architecture. For models not available in your required region, the workaround is cross-region inference profiles — Bedrock can route inference to another region, but this may conflict with data residency requirements.

Bedrock Agents manages the ReAct loop internally, which makes debugging difficult. You cannot inspect individual Thought/Action/Observation cycles from the outside — you see the final response and the action traces, but not the full reasoning process. For complex agents where understanding intermediate reasoning is important, implementing the loop yourself (using the Converse API and LangGraph) gives significantly better observability.

On-demand Bedrock pricing is per-input-token and per-output-token, similar to direct API pricing. At high volume, the cost difference between on-demand and provisioned throughput can be significant. Provisioned Throughput requires a 1-month or 6-month commitment — model your expected usage carefully before committing, since the commitment is per model unit, not per model.


Bedrock is asked about in interviews at AWS-centric companies and in roles that explicitly require AWS cloud experience. For general GenAI interview preparation, see the GenAI interview questions guide.

“How would you build a compliant RAG system on AWS?” The expected answer: S3 for document storage, Knowledge Bases for Amazon Bedrock for managed chunking/embedding/retrieval, Bedrock Converse API for generation, Guardrails for output validation, all within a VPC using VPC endpoints. IAM roles for authentication, CloudTrail for audit logging.

“What is the difference between Bedrock Knowledge Bases and building RAG yourself?” Managed vs. self-hosted trade-off: Bedrock handles the operational overhead (provisioning, scaling, maintenance of the vector store and embedding pipeline) but limits your control over chunking strategies, embedding models, and retrieval algorithms. Self-built RAG gives full control at the cost of operational complexity.

“When would you use Bedrock Agents vs. building an agent with LangGraph?” Bedrock Agents: when you want fully managed orchestration with minimal code, your tools can be expressed as Lambda functions or API schemas, and production debuggability is not the primary concern. LangGraph: when you need custom control flow, conditional logic, or fine-grained observability into the reasoning loop.


Do not commit to Provisioned Throughput until you have production traffic data. On-demand pricing scales with usage — low cost at low volume. Provisioned Throughput has fixed costs regardless of usage. Run on on-demand for at least 30 days of production traffic, measure actual invocation rates and token consumption, then evaluate whether Provisioned Throughput is cost-effective at your actual usage patterns.

Bedrock usage accumulates across teams and applications. Enable Bedrock usage reporting and tag every API call with cost attribution tags. In boto3, pass tags via the additionalModelRequestFields parameter or configure tagging at the IAM level. Without cost attribution, a single over-consuming agent can make the overall Bedrock bill uninterpretable.

Implement Gradual Rollout via Model Aliases

Section titled “Implement Gradual Rollout via Model Aliases”

When upgrading to a newer model version, use model aliases with Provisioned Throughput to route a percentage of traffic to the new model. Validate quality metrics and latency before full migration. Bedrock does not have native traffic splitting, but you can implement it at the application layer by assigning model IDs per request based on a percentage split.


AWS Bedrock is the right foundation model infrastructure for teams with existing AWS investments and compliance requirements. Its primary advantages are IAM-based authentication, CloudTrail audit logging, VPC isolation, and a managed RAG layer that eliminates operational overhead for the majority of document Q&A use cases.

Use Bedrock when:

  • Your organization is AWS-first and has IAM and VPC requirements that direct model APIs cannot meet
  • You need managed RAG without operating your own vector store and embedding pipeline
  • Compliance requires all model invocations to be within the AWS boundary and in CloudTrail
  • You need provisioned throughput guarantees for high-volume production workloads

Consider direct model APIs when:

  • The latest model versions matter more than compliance boundary alignment
  • Your infrastructure is multi-cloud or non-AWS
  • You need control over the agent orchestration loop that Bedrock Agents does not provide

Key operational rules:

  • Use the Converse API, not InvokeModel — it is model-agnostic and forward-compatible
  • Use hybrid search in Knowledge Bases — it outperforms pure semantic search on most production query distributions
  • Apply Guardrails to all customer-facing agent endpoints — do not rely on model-level safety alone
  • Tag Bedrock API calls for cost attribution before you have multiple teams sharing an account