Skip to content

Python Type Hints for AI Engineers — Typed LLM Pipelines (2026)

Python type hints are a primary defense against the most common class of production bugs in LLM pipelines: data shape mismatches. An LLM pipeline chains multiple stages — embedding, retrieval, prompt building, generation, validation. At every boundary, the data must match the expected shape. Without type hints, these mismatches surface as runtime KeyError, TypeError, or AttributeError exceptions in production.

Type hints move these errors from runtime to development time. When you annotate a function that accepts list[DocumentChunk] and returns list[float], your IDE and mypy verify every call site. If a refactor changes the return type of the retrieval stage, mypy flags every downstream function that now receives the wrong type — before you run the code.

Three specific benefits for AI engineers:

  • Catch bugs at development time. A misspelled key in a TypedDict, a swapped argument order, or a missing field in a structured output — mypy catches all of these statically.
  • IDE support for complex data. LLM API responses are deeply nested dictionaries. Type hints give you autocompletion on response["choices"][0]["message"]["content"] instead of guessing key names.
  • Self-documenting pipelines. A signature like def retrieve(query: EmbeddingVector, top_k: int) -> list[ScoredDocument] communicates the interface to every engineer on the team.

Type hints belong in all production AI code. The question is where they deliver the highest value first.

Always type these first: LLM response structures (eliminates dictionary key guessing), pipeline interfaces (makes data flow visible), tool definitions (validates what the LLM sends and receives), and configuration objects (constrains model names, temperature ranges, token limits).

Where type hints have the highest impact — at integration boundaries:

# Without types — what does this return?
def process_query(query, context, model):
...
# With types — the contract is explicit
def process_query(
query: str,
context: list[DocumentChunk],
model: Literal["gpt-4o", "claude-sonnet-4-20250514"],
) -> GenerationResult:
...

The typed version communicates the full contract. The caller knows exactly what to pass and what to expect.

When types are less critical: Throwaway scripts, notebook exploration, and one-off data analysis. But any code that enters a shared repository or runs in production should be typed.


The type system in an AI pipeline follows a clear progression. Raw, untyped data enters from external sources. Each stage refines the type until the final result is fully typed and validated.

Type Flow in an AI Pipeline

Raw LLM Output
dict[str, Any] from API
API response JSON
Untyped dictionary
Unknown structure
TypedDict / Pydantic
Shape defined, keys known
Parse into typed model
Validate required fields
Coerce value types
Validated Type
Business rules enforced
Content length checks
Enum value validation
Cross-field constraints
Pipeline Stage
Generic[T] processing
Transform typed input
Chain with next stage
Preserve type via Generic
Typed Result
Fully verified output
Return typed response
Serialize for client
Log with type metadata
Idle

Each layer serves a distinct purpose. The raw output layer acknowledges that external data is uncontrolled. The TypedDict/Pydantic layer imposes structure. The validated type layer applies domain rules. The pipeline stage layer preserves types through generic processing. The typed result layer guarantees consumers receive exactly what they expect. Skipping layers creates gaps where shape mismatches can reach production.


Five core typing patterns that AI engineers use most frequently.

from typing import TypedDict, NotRequired
class ChatMessage(TypedDict):
role: str
content: str
class UsageInfo(TypedDict):
prompt_tokens: int
completion_tokens: int
total_tokens: int
class LLMResponse(TypedDict):
id: str
model: str
choices: list[dict[str, ChatMessage]]
usage: UsageInfo
system_fingerprint: NotRequired[str]

With this definition, response["usage"]["prompt_tokens"] is known to be int. Your IDE autocompletes the keys. mypy catches response["usages"] as a typo.

from typing import TypeVar, Generic, Callable, Awaitable
T = TypeVar("T")
R = TypeVar("R")
class PipelineStage(Generic[T, R]):
def __init__(self, name: str, processor: Callable[[T], Awaitable[R]]) -> None:
self.name = name
self.processor = processor
async def execute(self, input_data: T) -> R:
return await self.processor(input_data)
embed_stage: PipelineStage[str, list[float]] = PipelineStage(
name="embed", processor=embed_text,
)
# mypy knows: embed_stage.execute("query") returns list[float]
from typing import Protocol, runtime_checkable
@runtime_checkable
class Tool(Protocol):
@property
def name(self) -> str: ...
@property
def description(self) -> str: ...
async def execute(self, input_text: str) -> str: ...
# Any class with these methods satisfies Tool — no inheritance needed
class WebSearchTool:
@property
def name(self) -> str:
return "web_search"
@property
def description(self) -> str:
return "Search the web for current information"
async def execute(self, input_text: str) -> str:
return await search(input_text)
def register_tool(registry: dict[str, Tool], tool: Tool) -> None:
registry[tool.name] = tool
from typing import Literal
ModelName = Literal["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-20250514"]
Role = Literal["system", "user", "assistant"]
def create_completion(model: ModelName, messages: list[dict[str, str]]) -> LLMResponse:
...
create_completion(model="gpt4o", messages=[]) # mypy error: "gpt4o" not in Literal
from typing import TypedDict, Union, Literal
class TextInput(TypedDict):
modality: Literal["text"]
content: str
class ImageInput(TypedDict):
modality: Literal["image"]
url: str
detail: Literal["low", "high", "auto"]
MultiModalInput = Union[TextInput, ImageInput]
def process_input(input_data: MultiModalInput) -> str:
if input_data["modality"] == "text":
return input_data["content"]
return f"[Image: {input_data['url']}]"

A production AI application has multiple layers of type safety. Each layer catches a different category of error.

Type Safety Stack for AI Applications

Application Types
Domain models — User, Query, Report
API Response Types
TypedDict for LLM and service responses
LLM Output Types
Structured output schemas and parsed results
Pipeline Generic Types
Generic[T, R] for reusable stage components
Tool Interface Protocols
Protocol classes for pluggable tool systems
Runtime Validation (Pydantic)
Validates external data at system boundaries
Idle

The top layers define what data looks like — checked statically by mypy with zero runtime cost. The middle layers ensure data flows correctly between stages, preserving type information through transformations. The bottom layers operate at system boundaries where external data enters and types cannot be guaranteed statically. Together, static checking catches structural errors during development while runtime validation catches data errors in production.


Three complete examples for real AI engineering problems.

from typing import TypedDict
class DocumentChunk(TypedDict):
id: str
text: str
metadata: dict[str, str]
score: float
class RetrievalResult(TypedDict):
query: str
chunks: list[DocumentChunk]
total_found: int
class GenerationResult(TypedDict):
answer: str
sources: list[str]
tokens_used: int
async def embed_query(query: str) -> list[float]: ...
async def retrieve(embedding: list[float], top_k: int = 5) -> RetrievalResult: ...
async def generate(query: str, context: RetrievalResult) -> GenerationResult: ...
async def rag_pipeline(query: str) -> GenerationResult:
embedding = await embed_query(query)
retrieval = await retrieve(embedding)
return await generate(query, retrieval)
import asyncio
from typing import TypeVar, Callable, Awaitable
T = TypeVar("T")
async def with_retry(
fn: Callable[..., Awaitable[T]],
*args: object,
max_retries: int = 3,
base_delay: float = 1.0,
) -> T:
last_exception: Exception | None = None
for attempt in range(max_retries):
try:
return await fn(*args)
except Exception as e:
last_exception = e
if attempt < max_retries - 1:
await asyncio.sleep(base_delay * (2 ** attempt))
raise last_exception # type: ignore[misc]
# mypy preserves the return type through the wrapper
result: GenerationResult = await with_retry(generate, "What is RAG?", context)
from typing import Protocol, runtime_checkable
@runtime_checkable
class AgentTool(Protocol):
@property
def name(self) -> str: ...
@property
def parameters_schema(self) -> dict[str, object]: ...
async def execute(self, **kwargs: object) -> str: ...
class ToolRegistry:
def __init__(self) -> None:
self._tools: dict[str, AgentTool] = {}
def register(self, tool: AgentTool) -> None:
self._tools[tool.name] = tool
def list_schemas(self) -> list[dict[str, object]]:
return [
{"name": t.name, "parameters": t.parameters_schema}
for t in self._tools.values()
]

Python type hints support two complementary checking strategies. Understanding when to use each is critical for production AI code.

Static vs Runtime Type Checking

mypy (Static)
Catches errors before code runs
VS
Pydantic (Runtime)
Validates data when it arrives
Verdict: Use both — mypy for internal code correctness, Pydantic at external data boundaries.
Use case
Production AI pipelines need static checking for developer errors and runtime validation for external data errors.

Use mypy for: internal function signatures, pipeline stage connections, configuration objects, module-level type correctness.

Use Pydantic for: LLM API response parsing, user input validation, database query results, any data crossing a system boundary.

The optimal pattern — define TypedDict for internal data shapes (zero overhead), use Pydantic models at boundaries where external data enters:

from pydantic import BaseModel, Field
class StructuredLLMOutput(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(min_length=1)
# Validate at boundary, then flow as TypedDict internally
validated = StructuredLLMOutput.model_validate_json(raw_json)

Q1: How would you type a function that accepts either a single prompt or a batch?

Section titled “Q1: How would you type a function that accepts either a single prompt or a batch?”

Use @overload to define separate signatures. generate("hello") returns GenerationResult. generate(["a", "b"]) returns list[GenerationResult]. mypy gives callers precise return types based on the input type.

Q2: What is the difference between Protocol and ABC for tool interfaces?

Section titled “Q2: What is the difference between Protocol and ABC for tool interfaces?”

ABC uses nominal typing — a class must explicitly inherit from it. Protocol uses structural typing — any class with the required methods satisfies the Protocol without inheritance. For AI tool systems, Protocol is preferred because third-party tools do not need to import your interface.

Q3: Why should you avoid Any in AI pipeline code?

Section titled “Q3: Why should you avoid Any in AI pipeline code?”

Any disables type checking at that point. If a pipeline stage returns Any, mypy cannot verify downstream stages receive the correct type. Prefer object over Any — it still requires explicit type narrowing before use.

Q4: How do you handle LLM API responses with optional fields?

Section titled “Q4: How do you handle LLM API responses with optional fields?”

Use NotRequired (Python 3.11+) for TypedDict fields that may be absent. This is different from Optional[str], which means the key exists but the value may be None. NotRequired means the key itself may not exist in the dictionary.


[tool.mypy]
python_version = "3.12"
strict = true
warn_return_any = true
disallow_untyped_defs = true
[[tool.mypy.overrides]]
module = ["langchain.*", "chromadb.*"]
ignore_missing_imports = true

Add mypy to your CI pipeline so type errors block merges. Type checking runs in seconds, even for large codebases.

  1. Start at the boundaries. Type all public function signatures in pipeline modules.
  2. Type new code strictly. Every new module gets full strictness. No exceptions.
  3. Work inward. After boundaries are typed, add types to internal helpers.
  4. Eliminate Any. Track Any count as a code quality metric and reduce it each sprint.
  • Using dict instead of TypedDict. Returning dict[str, Any] gives callers no information about keys or value types.
  • Overly broad Union types. Union[str, int, float, list, dict, None] is effectively untyped. Narrow to actual types.
  • Ignoring generic variance. list[Animal] is not a supertype of list[Dog]. Use Sequence[Animal] for read-only covariant access.

Python type hints transform AI codebases from fragile dictionary-juggling into verified, self-documenting pipelines. The five core patterns — TypedDict, Generics, Protocol, Literal, and Union — address the specific challenges of typing LLM responses, building reusable pipeline stages, defining tool interfaces, constraining model parameters, and handling multi-modal inputs.

  • Python for GenAI Engineers — Async patterns, Pydantic, and production Python for AI
  • Async Python Guide — asyncio fundamentals for LLM API calls and parallel pipelines
  • Structured Outputs — LLM structured output techniques using typed schemas
  • LLMOps — Operationalizing LLM pipelines with monitoring, versioning, and deployment
  • Pydantic AI — Building type-safe AI agents with Pydantic AI framework

Frequently Asked Questions

Why are Python type hints important for AI engineering?

Type hints catch data shape mismatches at development time — before they become runtime crashes in production LLM pipelines. They enable IDE autocompletion for complex nested structures like LLM API responses, make pipeline interfaces self-documenting, and allow static analysis tools like mypy to verify that every function in your chain receives and returns the correct types.

What is the difference between TypedDict and Pydantic BaseModel for AI code?

TypedDict provides static type checking at development time with zero runtime cost — mypy verifies correct keys and value types. Pydantic BaseModel provides runtime validation, checking and coercing data when objects are created. Use TypedDict for internal data structures. Use Pydantic for external boundaries where data arrives from LLM APIs or user inputs.

How do you type LLM API responses in Python?

Use TypedDict to define the expected shape of LLM responses, including nested structures for choices, messages, and usage metadata. For structured outputs, define a TypedDict or Pydantic model for the parsed content. For streaming responses, type the generator with AsyncIterator parameterized by the chunk type.

How do Generic types improve AI pipeline code?

Generic types let you write pipeline components that work with any data type while preserving type information through the chain. A generic retry wrapper preserves the return type of the wrapped function. A generic Pipeline class ensures that chaining stages together is type-safe — connecting incompatible stages is caught before runtime.

What is the Protocol pattern for AI tool interfaces?

Protocol defines structural subtyping — any class that implements the required methods satisfies the Protocol without needing explicit inheritance. For AI tool registries, you define a Protocol with execute and description methods. Any tool class implementing these methods can be registered. Third-party tools work without modification.

How do you configure mypy for an AI Python project?

Start with strict mode in pyproject.toml: strict = true, warn_return_any = true, disallow_untyped_defs = true. Use per-module overrides for third-party AI libraries lacking type stubs. Add mypy to CI so type errors block merges.

When should you use Literal types in AI code?

Use Literal for model name parameters so mypy catches typos, for role fields in chat messages, and for configuration enums like embedding dimensions or similarity metrics. This prevents invalid values from reaching API calls where they would cause runtime errors.

How do you handle Union types for multi-modal AI inputs?

Define each modality as a distinct TypedDict with a discriminator field, then use Union to combine them. In processing functions, use isinstance checks or match statements to narrow the type. mypy verifies that you handle every variant in the Union.