AWS Bedrock Guide 2026 — Managed LLM API for Production GenAI
1. Why AWS Bedrock Matters for GenAI
Section titled “1. Why AWS Bedrock Matters for GenAI”AWS Bedrock solves the compliance and operational gap between direct LLM APIs and the security requirements of AWS-first organizations.
Why AWS Bedrock Exists
Section titled “Why AWS Bedrock Exists”GenAI engineers who work at AWS-first organizations face a specific problem: building LLM-powered applications while satisfying the security, compliance, and operational requirements that AWS infrastructure was chosen to meet in the first place.
Calling the Anthropic API directly from an AWS Lambda function means your source code — and possibly sensitive data in your prompts — leaves the AWS boundary and travels to a third-party API over the public internet. It means authentication is a static API key that sits in a Secrets Manager entry instead of being managed by IAM. It means model invocations are not in your CloudTrail audit logs. It means when you need low-latency, guaranteed throughput for a high-volume application, there is no mechanism for provisioned capacity.
AWS Bedrock is the answer to this. It is a managed service that runs foundation model inference entirely within the AWS boundary: VPC endpoints, IAM-based authentication, CloudTrail logging, and the ability to provision throughput. It adds a managed RAG layer (Knowledge Bases for Amazon Bedrock), a managed agent orchestration layer (Amazon Bedrock Agents), and guardrails for content filtering and safety.
The model catalog includes Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere, and AI21 Labs models — available through a unified API regardless of the underlying provider. This guide is a deep dive into Bedrock specifically — for a side-by-side comparison of all three major platforms, see our Cloud AI Platforms guide.
The AWS Organization Context
Section titled “The AWS Organization Context”For teams in AWS-first organizations, the integration advantage of Bedrock is not optional infrastructure plumbing — it is a compliance requirement. SOC 2, HIPAA, PCI DSS, and FedRAMP workloads running on AWS cannot route model invocations through third-party public APIs without additional compliance documentation. Bedrock keeps inference within the certified AWS boundary, which is why enterprise legal and security teams approve Bedrock first and evaluate direct model APIs second.
2. Real-World Problem Context
Section titled “2. Real-World Problem Context”Two scenarios illustrate where direct model APIs fail at enterprise scale: compliance boundaries and throughput guarantees.
The Direct API Approach at Scale
Section titled “The Direct API Approach at Scale”A team at a financial services company builds a document analysis agent using the Anthropic API directly. It works in development. In production review, it fails three compliance requirements: data containing customer financial information cannot leave the AWS boundary, model invocations must be in the audit trail, and the API key rotation schedule does not meet their security policy.
The remediation options are expensive: build a proxy service that runs inside the VPC and intercepts all Anthropic API calls, implement CloudTrail event publishing for every API call, and move to a credential rotation system. Or, switch to Bedrock, where all three are native capabilities.
This is not an edge case. It is the standard discovery path at financial services, healthcare, and government organizations building on AWS.
The Throughput Problem
Section titled “The Throughput Problem”A second scenario: a high-volume application sends 500 requests per second to Claude via the Anthropic API. The API’s default rate limits impose a per-minute token budget that the application regularly hits. Each rate limit error requires a retry with exponential backoff, adding latency and unpredictability.
Bedrock’s Provisioned Throughput solves this: purchase a fixed number of model units for a given model, and Bedrock guarantees that throughput is available for your account. No shared rate limits, no rate limit errors at steady state, predictable latency. The trade-off is cost — provisioned throughput is significantly more expensive than on-demand pricing and requires committing to a 1-month or 6-month term.
3. How AWS Bedrock Works — Core Services
Section titled “3. How AWS Bedrock Works — Core Services”Bedrock is five distinct services under one umbrella — knowing which capability to use for a given task is foundational.
The Bedrock Service Map
Section titled “The Bedrock Service Map”Bedrock is not a single service — it is an umbrella for five distinct capabilities. Understanding which capability to use for a given task is the foundational knowledge for working with Bedrock.
Foundation Model Inference (InvokeModel / Converse API): Direct model invocation. Send a prompt, receive a completion. The Converse API provides a model-agnostic message format that works across all supported models, making it the recommended interface for new applications.
Knowledge Bases for Amazon Bedrock: Managed RAG. You point Bedrock at a data source (S3, Confluence, SharePoint, web crawler), it handles chunking, embedding, and vector storage (in OpenSearch Serverless or other supported vector stores), and exposes a Retrieve and RetrieveAndGenerate API for query-time retrieval and generation.
Amazon Bedrock Agents: Managed agent orchestration. Define a set of action groups (tools), an instruction prompt, and a knowledge base. Bedrock manages the ReAct loop — tool selection, tool execution, observation injection, and iteration — without you implementing the loop.
Guardrails for Amazon Bedrock: Content filtering, PII detection and redaction, topic denial, and grounding checks. Apply to any model invocation in Bedrock, including those within Knowledge Bases and Agents.
Model Evaluation: Automated evaluation of model outputs on a dataset using both automated metrics (BERTScore, accuracy) and human review workflows. Used for choosing between models and monitoring quality over time.
IAM as the Authentication Layer
Section titled “IAM as the Authentication Layer”Every Bedrock API call is an AWS API call, which means it is authenticated via IAM. There are no separate API keys for Bedrock. An EC2 instance with an IAM role that includes bedrock:InvokeModel can call Bedrock without any additional credentials. A Lambda function assumes a role with the same permission. An SageMaker notebook uses its execution role.
This is a significant operational advantage over direct model APIs: credential management follows the IAM lifecycle — rotation, revocation, and audit are handled by your existing AWS identity infrastructure. There is no separate secret to manage.
4. Using AWS Bedrock Step by Step
Section titled “4. Using AWS Bedrock Step by Step”These five steps take you from enabling model access to provisioning throughput for production workloads.
Step 1: Enable Model Access
Section titled “Step 1: Enable Model Access”Bedrock models are not available by default. In the Bedrock console, navigate to Model Access and request access to the models you need. For Claude models, this typically requires accepting Anthropic’s terms of service. Access is provisioned within minutes for most models.
# Verify access via CLIaws bedrock list-foundation-models --by-provider anthropic --region us-east-1Step 2: Invoke a Model
Section titled “Step 2: Invoke a Model”The Converse API is the recommended interface for chat-style models. It handles message formatting consistently across Claude, Titan, Llama, and Mistral:
import boto3import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[ {"role": "user", "content": [{"text": "Explain the ReAct agent pattern in two paragraphs."}]} ], system=[{"text": "You are a senior GenAI engineer. Explain concepts clearly and concisely."}], inferenceConfig={ "maxTokens": 1024, "temperature": 0.3 })
output_text = response["output"]["message"]["content"][0]["text"]print(output_text)The boto3 client automatically uses your environment’s IAM credentials — no API key management required.
Step 3: Set Up a Knowledge Base
Section titled “Step 3: Set Up a Knowledge Base”For RAG, create a Knowledge Base in the Bedrock console or via the API:
- Create an S3 bucket and upload your documents (PDFs, Word files, text files)
- Create a Knowledge Base pointing at the S3 prefix
- Configure the embedding model (Amazon Titan Embeddings v2 is the default)
- Select a vector store (OpenSearch Serverless, Aurora Serverless with pgvector, or Pinecone)
- Sync the data source to trigger chunking, embedding, and indexing
After setup, query via the API:
bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
# Retrieve relevant chunks without generationretrieve_response = bedrock_agent.retrieve( knowledgeBaseId="KB_ID_HERE", retrievalQuery={"text": "What is the refund policy for enterprise customers?"}, retrievalConfiguration={ "vectorSearchConfiguration": {"numberOfResults": 5} })
# Or retrieve and generate in one callrag_response = bedrock_agent.retrieve_and_generate( input={"text": "What is the refund policy for enterprise customers?"}, retrieveAndGenerateConfiguration={ "type": "KNOWLEDGE_BASE", "knowledgeBaseConfiguration": { "knowledgeBaseId": "KB_ID_HERE", "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0" } })print(rag_response["output"]["text"])Step 4: Configure Guardrails
Section titled “Step 4: Configure Guardrails”Create a Guardrail in the Bedrock console or via the API with the protections you need:
bedrock_mgmt = boto3.client("bedrock", region_name="us-east-1")
guardrail = bedrock_mgmt.create_guardrail( name="production-guardrail", description="Content safety and PII protection for customer-facing agent", topicPolicyConfig={ "topicsConfig": [ { "name": "legal-advice", "definition": "Requests for legal advice or interpretation of laws", "examples": ["Is this contract legally binding?", "Can I sue for this?"], "type": "DENY" } ] }, sensitiveInformationPolicyConfig={ "piiEntitiesConfig": [ {"type": "EMAIL", "action": "ANONYMIZE"}, {"type": "PHONE", "action": "ANONYMIZE"}, {"type": "SSN", "action": "BLOCK"} ] }, contentPolicyConfig={ "filtersConfig": [ {"type": "VIOLENCE", "inputStrength": "HIGH", "outputStrength": "HIGH"}, {"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"} ] })
guardrail_id = guardrail["guardrailId"]
# Apply to model invocationsresponse = bedrock.converse( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[{"role": "user", "content": [{"text": user_input}]}], guardrailConfig={"guardrailIdentifier": guardrail_id, "guardrailVersion": "DRAFT"})Step 5: Provision Throughput for Production
Section titled “Step 5: Provision Throughput for Production”For high-volume production workloads, purchase Provisioned Throughput via the console or API:
bedrock_mgmt.create_provisioned_model_throughput( modelUnits=5, # Number of model units (check pricing page for throughput per unit) provisionedModelName="claude-prod-throughput", modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", commitmentDuration="SIX_MONTHS" # or "ONE_MONTH")Use the provisioned model ARN in subsequent converse calls instead of the base model ID.
5. AWS Bedrock Architecture and Design
Section titled “5. AWS Bedrock Architecture and Design”A production GenAI stack on Bedrock routes requests through five distinct layers, each mapping to a native AWS service.
Production GenAI Stack on AWS
Section titled “Production GenAI Stack on AWS”A production GenAI system on AWS Bedrock typically has five layers: the application tier, the Bedrock orchestration tier, the retrieval tier, the safety tier, and the data tier.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Production GenAI Architecture — AWS Bedrock
A complete production stack using native AWS services
The request flows downward: the application receives it, Guardrails validate input, Bedrock Agents orchestrate the response using model invocation and Knowledge Base retrieval, and results flow back up through Guardrails for output validation before being returned to the application.
IAM and VPC Architecture
Section titled “IAM and VPC Architecture”In a compliant production deployment:
- All Bedrock API calls originate from within a VPC using a VPC endpoint (
com.amazonaws.us-east-1.bedrock-runtime) - No traffic goes to the public internet for model invocations
- Lambda and ECS tasks assume IAM roles with minimum-required Bedrock permissions
- CloudTrail captures every
InvokeModelandConversecall with the IAM principal, model ID, input token count, and timestamp
6. AWS Bedrock Code Examples in Python
Section titled “6. AWS Bedrock Code Examples in Python”These examples cover the two most common Bedrock production patterns: Knowledge Base Q&A and streaming completions.
Example: Document Q&A Agent with Knowledge Bases
Section titled “Example: Document Q&A Agent with Knowledge Bases”The most common Bedrock production pattern: a question-answering system over a proprietary document corpus.
import boto3
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
def answer_question(question: str, knowledge_base_id: str, model_arn: str) -> dict: """Query a Knowledge Base and generate a grounded answer.""" response = bedrock_agent_runtime.retrieve_and_generate( input={"text": question}, retrieveAndGenerateConfiguration={ "type": "KNOWLEDGE_BASE", "knowledgeBaseConfiguration": { "knowledgeBaseId": knowledge_base_id, "modelArn": model_arn, "retrievalConfiguration": { "vectorSearchConfiguration": { "numberOfResults": 5, "overrideSearchType": "HYBRID" # Combines semantic + keyword search } }, "generationConfiguration": { "promptTemplate": { "textPromptTemplate": ( "Answer the question using only the provided context. " "If the context does not contain enough information, say so explicitly. " "Context: $search_results$\nQuestion: $query$" ) } } } } )
return { "answer": response["output"]["text"], "citations": [ citation["retrievedReferences"] for citation in response.get("citations", []) ] }Key details: HYBRID search combines semantic similarity (vector) with keyword matching (BM25), which outperforms pure semantic search on queries with specific entity names or product codes. The custom prompt template prevents the model from answering from its training data — it is grounded to the retrieved context.
Example: Streaming Responses
Section titled “Example: Streaming Responses”For user-facing applications, streaming is essential for perceived responsiveness:
def stream_response(prompt: str) -> None: """Stream a model response token by token.""" bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse_stream( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[{"role": "user", "content": [{"text": prompt}]}], )
for event in response["stream"]: if "contentBlockDelta" in event: delta = event["contentBlockDelta"]["delta"] if "text" in delta: print(delta["text"], end="", flush=True)converse_stream returns a streaming event iterator. Each contentBlockDelta event contains a text fragment. In a web application, forward these fragments to the client via SSE or WebSocket for real-time streaming.
7. AWS Bedrock Trade-offs and Pitfalls
Section titled “7. AWS Bedrock Trade-offs and Pitfalls”Bedrock’s managed abstractions introduce real constraints around model availability, retrieval quality, and observability that affect production designs.
Model Availability Lag
Section titled “Model Availability Lag”Bedrock’s model catalog is not always synchronized with the latest model releases from Anthropic and other providers. A new Claude model released today may not appear in Bedrock for weeks or months. For applications where the latest model capabilities matter more than AWS infrastructure compliance, this lag is a real constraint.
Mitigate by checking Bedrock’s model availability during your architecture evaluation, not after. Do not assume the model you want will be available on Bedrock when you need it.
Knowledge Bases: Chunking Quality
Section titled “Knowledge Bases: Chunking Quality”The default chunking strategy in Knowledge Bases (fixed-size chunks at 300 tokens) works poorly for structured documents like tables, code blocks, and multi-part lists. A table chunk that starts halfway through a row produces nonsensical retrievals.
For structured documents, configure custom chunking: hierarchical chunking (preserves document structure), semantic chunking (splits at natural boundaries), or fixed-size with larger overlap. For code documentation, semantic chunking at function/class boundaries produces dramatically better retrieval quality than fixed-size.
Cross-Region Availability
Section titled “Cross-Region Availability”Not all Bedrock models are available in all AWS regions. If your application runs in eu-west-1 for data residency reasons, verify that your required models are available in that region before committing to the architecture. For models not available in your required region, the workaround is cross-region inference profiles — Bedrock can route inference to another region, but this may conflict with data residency requirements.
Bedrock Agents: Limited Observability
Section titled “Bedrock Agents: Limited Observability”Bedrock Agents manages the ReAct loop internally, which makes debugging difficult. You cannot inspect individual Thought/Action/Observation cycles from the outside — you see the final response and the action traces, but not the full reasoning process. For complex agents where understanding intermediate reasoning is important, implementing the loop yourself (using the Converse API and LangGraph) gives significantly better observability.
Cost at Scale
Section titled “Cost at Scale”On-demand Bedrock pricing is per-input-token and per-output-token, similar to direct API pricing. At high volume, the cost difference between on-demand and provisioned throughput can be significant. Provisioned Throughput requires a 1-month or 6-month commitment — model your expected usage carefully before committing, since the commitment is per model unit, not per model.
8. AWS Bedrock Interview Questions
Section titled “8. AWS Bedrock Interview Questions”Bedrock is asked about in interviews at AWS-centric companies and in roles that explicitly require AWS cloud experience. For general GenAI interview preparation, see the GenAI interview questions guide.
“How would you build a compliant RAG system on AWS?” The expected answer: S3 for document storage, Knowledge Bases for Amazon Bedrock for managed chunking/embedding/retrieval, Bedrock Converse API for generation, Guardrails for output validation, all within a VPC using VPC endpoints. IAM roles for authentication, CloudTrail for audit logging.
“What is the difference between Bedrock Knowledge Bases and building RAG yourself?” Managed vs. self-hosted trade-off: Bedrock handles the operational overhead (provisioning, scaling, maintenance of the vector store and embedding pipeline) but limits your control over chunking strategies, embedding models, and retrieval algorithms. Self-built RAG gives full control at the cost of operational complexity.
“When would you use Bedrock Agents vs. building an agent with LangGraph?” Bedrock Agents: when you want fully managed orchestration with minimal code, your tools can be expressed as Lambda functions or API schemas, and production debuggability is not the primary concern. LangGraph: when you need custom control flow, conditional logic, or fine-grained observability into the reasoning loop.
9. AWS Bedrock in Production
Section titled “9. AWS Bedrock in Production”Three rules govern reliable, cost-efficient Bedrock deployments: delay throughput commitments, tag early, and use aliases for safe model upgrades.
Start with On-Demand, Move to Provisioned
Section titled “Start with On-Demand, Move to Provisioned”Do not commit to Provisioned Throughput until you have production traffic data. On-demand pricing scales with usage — low cost at low volume. Provisioned Throughput has fixed costs regardless of usage. Run on on-demand for at least 30 days of production traffic, measure actual invocation rates and token consumption, then evaluate whether Provisioned Throughput is cost-effective at your actual usage patterns.
Tag Everything for Cost Attribution
Section titled “Tag Everything for Cost Attribution”Bedrock usage accumulates across teams and applications. Enable Bedrock usage reporting and tag every API call with cost attribution tags. In boto3, pass tags via the additionalModelRequestFields parameter or configure tagging at the IAM level. Without cost attribution, a single over-consuming agent can make the overall Bedrock bill uninterpretable.
Implement Gradual Rollout via Model Aliases
Section titled “Implement Gradual Rollout via Model Aliases”When upgrading to a newer model version, use model aliases with Provisioned Throughput to route a percentage of traffic to the new model. Validate quality metrics and latency before full migration. Bedrock does not have native traffic splitting, but you can implement it at the application layer by assigning model IDs per request based on a percentage split.
10. Summary & Key Takeaways
Section titled “10. Summary & Key Takeaways”AWS Bedrock is the right foundation model infrastructure for teams with existing AWS investments and compliance requirements. Its primary advantages are IAM-based authentication, CloudTrail audit logging, VPC isolation, and a managed RAG layer that eliminates operational overhead for the majority of document Q&A use cases.
Use Bedrock when:
- Your organization is AWS-first and has IAM and VPC requirements that direct model APIs cannot meet
- You need managed RAG without operating your own vector store and embedding pipeline
- Compliance requires all model invocations to be within the AWS boundary and in CloudTrail
- You need provisioned throughput guarantees for high-volume production workloads
Consider direct model APIs when:
- The latest model versions matter more than compliance boundary alignment
- Your infrastructure is multi-cloud or non-AWS
- You need control over the agent orchestration loop that Bedrock Agents does not provide
Key operational rules:
- Use the Converse API, not InvokeModel — it is model-agnostic and forward-compatible
- Use hybrid search in Knowledge Bases — it outperforms pure semantic search on most production query distributions
- Apply Guardrails to all customer-facing agent endpoints — do not rely on model-level safety alone
- Tag Bedrock API calls for cost attribution before you have multiple teams sharing an account
Related
Section titled “Related”- Cloud AI Platforms Compared — How Bedrock compares to Vertex AI and Azure OpenAI Service
- Google Vertex AI Deep-Dive — The comparable managed platform for GCP-first organizations
- Azure AI Foundry Guide — The comparable managed platform for Azure/Microsoft-first organizations
- Agentic Design Patterns — The patterns (ReAct, Plan-and-Execute) that Bedrock Agents implements
- AI Agents and Agentic Systems — The foundational concepts behind Bedrock’s agent orchestration capabilities
- GenAI Engineering Tools — The broader tool ecosystem including observability and deployment options
Frequently Asked Questions
What is AWS Bedrock?
AWS Bedrock is a managed service that runs foundation model inference entirely within the AWS boundary. It provides VPC endpoints, IAM-based authentication, CloudTrail logging, and provisioned throughput. The model catalog includes Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere, and AI21 Labs — accessible through a unified API.
When should I use AWS Bedrock instead of calling LLM APIs directly?
Use Bedrock when your organization requires that source code and prompts stay within the AWS boundary, when you need IAM-based authentication instead of static API keys, when model invocations must appear in CloudTrail audit logs, or when you need provisioned throughput for guaranteed capacity. For regulated industries (financial services, healthcare, defense), Bedrock provides the compliance controls that direct API calls cannot offer.
How does AWS Bedrock compare to Google Vertex AI and Azure OpenAI?
AWS Bedrock offers the widest model catalog (Claude, Llama, Mistral, Titan, Cohere, AI21). Azure OpenAI provides exclusive access to GPT-4o and OpenAI models with deep Microsoft 365 integration. Google Vertex AI provides Gemini with native BigQuery integration and the largest context window. Choose based on your cloud ecosystem — Bedrock for AWS-first, Azure OpenAI for Microsoft-first, Vertex AI for GCP-first organizations.
What models are available in AWS Bedrock?
AWS Bedrock provides access to Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere, and AI21 Labs models through a unified API. Models are not available by default — you must request access in the Bedrock console and accept provider terms of service. Not all models are available in all AWS regions, so verify regional availability before committing to your architecture.
What is the Converse API in AWS Bedrock?
The Converse API is Bedrock's recommended interface for chat-style model invocations. It provides a model-agnostic message format that works consistently across Claude, Titan, Llama, and Mistral models. Unlike the older InvokeModel API, the Converse API is forward-compatible and handles message formatting differences between providers, making it the preferred choice for new applications.
How does RAG work with Bedrock Knowledge Bases?
Bedrock Knowledge Bases provide managed RAG by handling document chunking, embedding, and vector storage. You point Bedrock at a data source (S3, Confluence, SharePoint, web crawler), and it indexes documents into a supported vector store like OpenSearch Serverless. At query time, the Retrieve API returns relevant chunks and the RetrieveAndGenerate API combines retrieval with model generation in a single call.
What are Bedrock Guardrails and how do they work?
Bedrock Guardrails provide content filtering, PII detection and redaction, topic denial, and grounding checks that apply to any model invocation. You configure policies for content types (violence, hate speech), sensitive information (email, phone, SSN), and denied topics. Guardrails validate both input and output, and can be applied to direct model calls as well as Knowledge Base and Agent invocations.
What is Provisioned Throughput in AWS Bedrock?
Provisioned Throughput lets you purchase a fixed number of model units for guaranteed inference capacity. It eliminates shared rate limits and rate limit errors at steady state, providing predictable latency for high-volume production workloads. The trade-off is cost — provisioned throughput requires committing to a 1-month or 6-month term and is significantly more expensive than on-demand pricing.
How does authentication work in AWS Bedrock?
Every Bedrock API call is authenticated via IAM — there are no separate API keys to manage. An EC2 instance with an IAM role that includes bedrock:InvokeModel can call Bedrock without additional credentials. Lambda functions assume roles with the same permission. Credential management follows the IAM lifecycle for rotation, revocation, and audit, using your existing AWS identity infrastructure.
What are the main limitations of AWS Bedrock?
Key limitations include model availability lag (new model releases may not appear in Bedrock for weeks or months), limited observability in Bedrock Agents (you cannot inspect individual reasoning cycles), default chunking quality issues in Knowledge Bases for structured documents like tables and code, and cross-region availability gaps where not all models are available in every AWS region.