AWS Bedrock Guide 2026 — Managed LLM API for Production GenAI
1. Introduction and Motivation
Section titled “1. Introduction and Motivation”Why AWS Bedrock Exists
Section titled “Why AWS Bedrock Exists”GenAI engineers who work at AWS-first organizations face a specific problem: building LLM-powered applications while satisfying the security, compliance, and operational requirements that AWS infrastructure was chosen to meet in the first place.
Calling the Anthropic API directly from an AWS Lambda function means your source code — and possibly sensitive data in your prompts — leaves the AWS boundary and travels to a third-party API over the public internet. It means authentication is a static API key that sits in a Secrets Manager entry instead of being managed by IAM. It means model invocations are not in your CloudTrail audit logs. It means when you need low-latency, guaranteed throughput for a high-volume application, there is no mechanism for provisioned capacity.
AWS Bedrock is the answer to this. It is a managed service that runs foundation model inference entirely within the AWS boundary: VPC endpoints, IAM-based authentication, CloudTrail logging, and the ability to provision throughput. It adds a managed RAG layer (Knowledge Bases for Amazon Bedrock), a managed agent orchestration layer (Amazon Bedrock Agents), and guardrails for content filtering and safety.
The model catalog includes Claude (Anthropic), Titan (Amazon), Llama (Meta), Mistral, Cohere, and AI21 Labs models — available through a unified API regardless of the underlying provider.
The AWS Organization Context
Section titled “The AWS Organization Context”For teams in AWS-first organizations, the integration advantage of Bedrock is not optional infrastructure plumbing — it is a compliance requirement. SOC 2, HIPAA, PCI DSS, and FedRAMP workloads running on AWS cannot route model invocations through third-party public APIs without additional compliance documentation. Bedrock keeps inference within the certified AWS boundary, which is why enterprise legal and security teams approve Bedrock first and evaluate direct model APIs second.
2. Real-World Problem Context
Section titled “2. Real-World Problem Context”The Direct API Approach at Scale
Section titled “The Direct API Approach at Scale”A team at a financial services company builds a document analysis agent using the Anthropic API directly. It works in development. In production review, it fails three compliance requirements: data containing customer financial information cannot leave the AWS boundary, model invocations must be in the audit trail, and the API key rotation schedule does not meet their security policy.
The remediation options are expensive: build a proxy service that runs inside the VPC and intercepts all Anthropic API calls, implement CloudTrail event publishing for every API call, and move to a credential rotation system. Or, switch to Bedrock, where all three are native capabilities.
This is not an edge case. It is the standard discovery path at financial services, healthcare, and government organizations building on AWS.
The Throughput Problem
Section titled “The Throughput Problem”A second scenario: a high-volume application sends 500 requests per second to Claude via the Anthropic API. The API’s default rate limits impose a per-minute token budget that the application regularly hits. Each rate limit error requires a retry with exponential backoff, adding latency and unpredictability.
Bedrock’s Provisioned Throughput solves this: purchase a fixed number of model units for a given model, and Bedrock guarantees that throughput is available for your account. No shared rate limits, no rate limit errors at steady state, predictable latency. The trade-off is cost — provisioned throughput is significantly more expensive than on-demand pricing and requires committing to a 1-month or 6-month term.
3. Core Concepts & Mental Model
Section titled “3. Core Concepts & Mental Model”The Bedrock Service Map
Section titled “The Bedrock Service Map”Bedrock is not a single service — it is an umbrella for five distinct capabilities. Understanding which capability to use for a given task is the foundational knowledge for working with Bedrock.
Foundation Model Inference (InvokeModel / Converse API): Direct model invocation. Send a prompt, receive a completion. The Converse API provides a model-agnostic message format that works across all supported models, making it the recommended interface for new applications.
Knowledge Bases for Amazon Bedrock: Managed RAG. You point Bedrock at a data source (S3, Confluence, SharePoint, web crawler), it handles chunking, embedding, and vector storage (in OpenSearch Serverless or other supported vector stores), and exposes a Retrieve and RetrieveAndGenerate API for query-time retrieval and generation.
Amazon Bedrock Agents: Managed agent orchestration. Define a set of action groups (tools), an instruction prompt, and a knowledge base. Bedrock manages the ReAct loop — tool selection, tool execution, observation injection, and iteration — without you implementing the loop.
Guardrails for Amazon Bedrock: Content filtering, PII detection and redaction, topic denial, and grounding checks. Apply to any model invocation in Bedrock, including those within Knowledge Bases and Agents.
Model Evaluation: Automated evaluation of model outputs on a dataset using both automated metrics (BERTScore, accuracy) and human review workflows. Used for choosing between models and monitoring quality over time.
IAM as the Authentication Layer
Section titled “IAM as the Authentication Layer”Every Bedrock API call is an AWS API call, which means it is authenticated via IAM. There are no separate API keys for Bedrock. An EC2 instance with an IAM role that includes bedrock:InvokeModel can call Bedrock without any additional credentials. A Lambda function assumes a role with the same permission. An SageMaker notebook uses its execution role.
This is a significant operational advantage over direct model APIs: credential management follows the IAM lifecycle — rotation, revocation, and audit are handled by your existing AWS identity infrastructure. There is no separate secret to manage.
4. Step-by-Step Explanation
Section titled “4. Step-by-Step Explanation”Step 1: Enable Model Access
Section titled “Step 1: Enable Model Access”Bedrock models are not available by default. In the Bedrock console, navigate to Model Access and request access to the models you need. For Claude models, this typically requires accepting Anthropic’s terms of service. Access is provisioned within minutes for most models.
# Verify access via CLIaws bedrock list-foundation-models --by-provider anthropic --region us-east-1Step 2: Invoke a Model
Section titled “Step 2: Invoke a Model”The Converse API is the recommended interface for chat-style models. It handles message formatting consistently across Claude, Titan, Llama, and Mistral:
import boto3import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[ {"role": "user", "content": [{"text": "Explain the ReAct agent pattern in two paragraphs."}]} ], system=[{"text": "You are a senior GenAI engineer. Explain concepts clearly and concisely."}], inferenceConfig={ "maxTokens": 1024, "temperature": 0.3 })
output_text = response["output"]["message"]["content"][0]["text"]print(output_text)The boto3 client automatically uses your environment’s IAM credentials — no API key management required.
Step 3: Set Up a Knowledge Base
Section titled “Step 3: Set Up a Knowledge Base”For RAG, create a Knowledge Base in the Bedrock console or via the API:
- Create an S3 bucket and upload your documents (PDFs, Word files, text files)
- Create a Knowledge Base pointing at the S3 prefix
- Configure the embedding model (Amazon Titan Embeddings v2 is the default)
- Select a vector store (OpenSearch Serverless, Aurora Serverless with pgvector, or Pinecone)
- Sync the data source to trigger chunking, embedding, and indexing
After setup, query via the API:
bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
# Retrieve relevant chunks without generationretrieve_response = bedrock_agent.retrieve( knowledgeBaseId="KB_ID_HERE", retrievalQuery={"text": "What is the refund policy for enterprise customers?"}, retrievalConfiguration={ "vectorSearchConfiguration": {"numberOfResults": 5} })
# Or retrieve and generate in one callrag_response = bedrock_agent.retrieve_and_generate( input={"text": "What is the refund policy for enterprise customers?"}, retrieveAndGenerateConfiguration={ "type": "KNOWLEDGE_BASE", "knowledgeBaseConfiguration": { "knowledgeBaseId": "KB_ID_HERE", "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0" } })print(rag_response["output"]["text"])Step 4: Configure Guardrails
Section titled “Step 4: Configure Guardrails”Create a Guardrail in the Bedrock console or via the API with the protections you need:
bedrock_mgmt = boto3.client("bedrock", region_name="us-east-1")
guardrail = bedrock_mgmt.create_guardrail( name="production-guardrail", description="Content safety and PII protection for customer-facing agent", topicPolicyConfig={ "topicsConfig": [ { "name": "legal-advice", "definition": "Requests for legal advice or interpretation of laws", "examples": ["Is this contract legally binding?", "Can I sue for this?"], "type": "DENY" } ] }, sensitiveInformationPolicyConfig={ "piiEntitiesConfig": [ {"type": "EMAIL", "action": "ANONYMIZE"}, {"type": "PHONE", "action": "ANONYMIZE"}, {"type": "SSN", "action": "BLOCK"} ] }, contentPolicyConfig={ "filtersConfig": [ {"type": "VIOLENCE", "inputStrength": "HIGH", "outputStrength": "HIGH"}, {"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"} ] })
guardrail_id = guardrail["guardrailId"]
# Apply to model invocationsresponse = bedrock.converse( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[{"role": "user", "content": [{"text": user_input}]}], guardrailConfig={"guardrailIdentifier": guardrail_id, "guardrailVersion": "DRAFT"})Step 5: Provision Throughput for Production
Section titled “Step 5: Provision Throughput for Production”For high-volume production workloads, purchase Provisioned Throughput via the console or API:
bedrock_mgmt.create_provisioned_model_throughput( modelUnits=5, # Number of model units (check pricing page for throughput per unit) provisionedModelName="claude-prod-throughput", modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", commitmentDuration="SIX_MONTHS" # or "ONE_MONTH")Use the provisioned model ARN in subsequent converse calls instead of the base model ID.
5. Architecture & System View
Section titled “5. Architecture & System View”Production GenAI Stack on AWS
Section titled “Production GenAI Stack on AWS”A production GenAI system on AWS Bedrock typically has five layers: the application tier, the Bedrock orchestration tier, the retrieval tier, the safety tier, and the data tier.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Production GenAI Architecture — AWS Bedrock
A complete production stack using native AWS services
The request flows downward: the application receives it, Guardrails validate input, Bedrock Agents orchestrate the response using model invocation and Knowledge Base retrieval, and results flow back up through Guardrails for output validation before being returned to the application.
IAM and VPC Architecture
Section titled “IAM and VPC Architecture”In a compliant production deployment:
- All Bedrock API calls originate from within a VPC using a VPC endpoint (
com.amazonaws.us-east-1.bedrock-runtime) - No traffic goes to the public internet for model invocations
- Lambda and ECS tasks assume IAM roles with minimum-required Bedrock permissions
- CloudTrail captures every
InvokeModelandConversecall with the IAM principal, model ID, input token count, and timestamp
6. Practical Examples
Section titled “6. Practical Examples”Example: Document Q&A Agent with Knowledge Bases
Section titled “Example: Document Q&A Agent with Knowledge Bases”The most common Bedrock production pattern: a question-answering system over a proprietary document corpus.
import boto3
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
def answer_question(question: str, knowledge_base_id: str, model_arn: str) -> dict: """Query a Knowledge Base and generate a grounded answer.""" response = bedrock_agent_runtime.retrieve_and_generate( input={"text": question}, retrieveAndGenerateConfiguration={ "type": "KNOWLEDGE_BASE", "knowledgeBaseConfiguration": { "knowledgeBaseId": knowledge_base_id, "modelArn": model_arn, "retrievalConfiguration": { "vectorSearchConfiguration": { "numberOfResults": 5, "overrideSearchType": "HYBRID" # Combines semantic + keyword search } }, "generationConfiguration": { "promptTemplate": { "textPromptTemplate": ( "Answer the question using only the provided context. " "If the context does not contain enough information, say so explicitly. " "Context: $search_results$\nQuestion: $query$" ) } } } } )
return { "answer": response["output"]["text"], "citations": [ citation["retrievedReferences"] for citation in response.get("citations", []) ] }Key details: HYBRID search combines semantic similarity (vector) with keyword matching (BM25), which outperforms pure semantic search on queries with specific entity names or product codes. The custom prompt template prevents the model from answering from its training data — it is grounded to the retrieved context.
Example: Streaming Responses
Section titled “Example: Streaming Responses”For user-facing applications, streaming is essential for perceived responsiveness:
def stream_response(prompt: str) -> None: """Stream a model response token by token.""" bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse_stream( modelId="anthropic.claude-sonnet-4-5-20250929-v1:0", messages=[{"role": "user", "content": [{"text": prompt}]}], )
for event in response["stream"]: if "contentBlockDelta" in event: delta = event["contentBlockDelta"]["delta"] if "text" in delta: print(delta["text"], end="", flush=True)converse_stream returns a streaming event iterator. Each contentBlockDelta event contains a text fragment. In a web application, forward these fragments to the client via SSE or WebSocket for real-time streaming.
7. Trade-offs, Limitations & Failure Modes
Section titled “7. Trade-offs, Limitations & Failure Modes”Model Availability Lag
Section titled “Model Availability Lag”Bedrock’s model catalog is not always synchronized with the latest model releases from Anthropic and other providers. A new Claude model released today may not appear in Bedrock for weeks or months. For applications where the latest model capabilities matter more than AWS infrastructure compliance, this lag is a real constraint.
Mitigate by checking Bedrock’s model availability during your architecture evaluation, not after. Do not assume the model you want will be available on Bedrock when you need it.
Knowledge Bases: Chunking Quality
Section titled “Knowledge Bases: Chunking Quality”The default chunking strategy in Knowledge Bases (fixed-size chunks at 300 tokens) works poorly for structured documents like tables, code blocks, and multi-part lists. A table chunk that starts halfway through a row produces nonsensical retrievals.
For structured documents, configure custom chunking: hierarchical chunking (preserves document structure), semantic chunking (splits at natural boundaries), or fixed-size with larger overlap. For code documentation, semantic chunking at function/class boundaries produces dramatically better retrieval quality than fixed-size.
Cross-Region Availability
Section titled “Cross-Region Availability”Not all Bedrock models are available in all AWS regions. If your application runs in eu-west-1 for data residency reasons, verify that your required models are available in that region before committing to the architecture. For models not available in your required region, the workaround is cross-region inference profiles — Bedrock can route inference to another region, but this may conflict with data residency requirements.
Bedrock Agents: Limited Observability
Section titled “Bedrock Agents: Limited Observability”Bedrock Agents manages the ReAct loop internally, which makes debugging difficult. You cannot inspect individual Thought/Action/Observation cycles from the outside — you see the final response and the action traces, but not the full reasoning process. For complex agents where understanding intermediate reasoning is important, implementing the loop yourself (using the Converse API and LangGraph) gives significantly better observability.
Cost at Scale
Section titled “Cost at Scale”On-demand Bedrock pricing is per-input-token and per-output-token, similar to direct API pricing. At high volume, the cost difference between on-demand and provisioned throughput can be significant. Provisioned Throughput requires a 1-month or 6-month commitment — model your expected usage carefully before committing, since the commitment is per model unit, not per model.
8. Interview Perspective
Section titled “8. Interview Perspective”Bedrock is asked about in interviews at AWS-centric companies and in roles that explicitly require AWS cloud experience. For general GenAI interview preparation, see the GenAI interview questions guide.
“How would you build a compliant RAG system on AWS?” The expected answer: S3 for document storage, Knowledge Bases for Amazon Bedrock for managed chunking/embedding/retrieval, Bedrock Converse API for generation, Guardrails for output validation, all within a VPC using VPC endpoints. IAM roles for authentication, CloudTrail for audit logging.
“What is the difference between Bedrock Knowledge Bases and building RAG yourself?” Managed vs. self-hosted trade-off: Bedrock handles the operational overhead (provisioning, scaling, maintenance of the vector store and embedding pipeline) but limits your control over chunking strategies, embedding models, and retrieval algorithms. Self-built RAG gives full control at the cost of operational complexity.
“When would you use Bedrock Agents vs. building an agent with LangGraph?” Bedrock Agents: when you want fully managed orchestration with minimal code, your tools can be expressed as Lambda functions or API schemas, and production debuggability is not the primary concern. LangGraph: when you need custom control flow, conditional logic, or fine-grained observability into the reasoning loop.
9. Production Perspective
Section titled “9. Production Perspective”Start with On-Demand, Move to Provisioned
Section titled “Start with On-Demand, Move to Provisioned”Do not commit to Provisioned Throughput until you have production traffic data. On-demand pricing scales with usage — low cost at low volume. Provisioned Throughput has fixed costs regardless of usage. Run on on-demand for at least 30 days of production traffic, measure actual invocation rates and token consumption, then evaluate whether Provisioned Throughput is cost-effective at your actual usage patterns.
Tag Everything for Cost Attribution
Section titled “Tag Everything for Cost Attribution”Bedrock usage accumulates across teams and applications. Enable Bedrock usage reporting and tag every API call with cost attribution tags. In boto3, pass tags via the additionalModelRequestFields parameter or configure tagging at the IAM level. Without cost attribution, a single over-consuming agent can make the overall Bedrock bill uninterpretable.
Implement Gradual Rollout via Model Aliases
Section titled “Implement Gradual Rollout via Model Aliases”When upgrading to a newer model version, use model aliases with Provisioned Throughput to route a percentage of traffic to the new model. Validate quality metrics and latency before full migration. Bedrock does not have native traffic splitting, but you can implement it at the application layer by assigning model IDs per request based on a percentage split.
10. Summary & Key Takeaways
Section titled “10. Summary & Key Takeaways”AWS Bedrock is the right foundation model infrastructure for teams with existing AWS investments and compliance requirements. Its primary advantages are IAM-based authentication, CloudTrail audit logging, VPC isolation, and a managed RAG layer that eliminates operational overhead for the majority of document Q&A use cases.
Use Bedrock when:
- Your organization is AWS-first and has IAM and VPC requirements that direct model APIs cannot meet
- You need managed RAG without operating your own vector store and embedding pipeline
- Compliance requires all model invocations to be within the AWS boundary and in CloudTrail
- You need provisioned throughput guarantees for high-volume production workloads
Consider direct model APIs when:
- The latest model versions matter more than compliance boundary alignment
- Your infrastructure is multi-cloud or non-AWS
- You need control over the agent orchestration loop that Bedrock Agents does not provide
Key operational rules:
- Use the Converse API, not InvokeModel — it is model-agnostic and forward-compatible
- Use hybrid search in Knowledge Bases — it outperforms pure semantic search on most production query distributions
- Apply Guardrails to all customer-facing agent endpoints — do not rely on model-level safety alone
- Tag Bedrock API calls for cost attribution before you have multiple teams sharing an account
Related
Section titled “Related”- Cloud AI Platforms Compared — How Bedrock compares to Vertex AI and Azure OpenAI Service
- Google Vertex AI Deep-Dive — The comparable managed platform for GCP-first organizations
- Azure OpenAI Service Deep-Dive — The comparable managed platform for Azure/Microsoft-first organizations
- Agentic Design Patterns — The patterns (ReAct, Plan-and-Execute) that Bedrock Agents implements
- AI Agents and Agentic Systems — The foundational concepts behind Bedrock’s agent orchestration capabilities
- GenAI Engineering Tools — The broader tool ecosystem including observability and deployment options