Python Type Hints for AI Engineers — Typed LLM Pipelines (2026)

Q: How do you configure mypy for an AI Python project?

Start with strict mode enabled in pyproject.toml: strict = true, warn_return_any = true, disallow_untyped_defs = true. For gradual adoption, use per-module overrides to relax strictness on legacy code while enforcing full typing on new pipeline code. Add the mypy check to your CI pipeline so type errors block merges. For third-party AI libraries that lack type stubs, use ignore_missing_imports on those specific modules rather than globally disabling the check.

Q: When should you use Literal types in AI code?

Literal types restrict a value to specific allowed constants. In AI code, use Literal for model name parameters (Literal['gpt-4o', 'claude-sonnet-4-20250514']) so mypy catches typos, for role fields in chat messages (Literal['system', 'user', 'assistant']), and for configuration enums like embedding dimensions or similarity metrics. This prevents invalid values from reaching API calls where they would cause cryptic HTTP errors.

Q: How do you handle Union types for multi-modal AI inputs?

Use Union types to represent inputs that can be text, images, or audio. Define each modality as a distinct TypedDict with a discriminator field (like modality: Literal['text'] or modality: Literal['image']), then use Union to combine them. In processing functions, use isinstance checks or match statements to narrow the type. mypy verifies that you handle every variant in the Union, preventing runtime errors when a new modality is added but not handled in downstream code.

1. Why Type Hints Matter for AI Engineers

Python type hints are a primary defense against the most common class of production bugs in LLM pipelines: data shape mismatches. An LLM pipeline chains multiple stages — embedding, retrieval, prompt building, generation, validation. At every boundary, the data must match the expected shape. Without type hints, these mismatches surface as runtime KeyError, TypeError, or AttributeError exceptions in production.

Type hints move these errors from runtime to development time. When you annotate a function that accepts list[DocumentChunk] and returns list[float], your IDE and mypy verify every call site. If a refactor changes the return type of the retrieval stage, mypy flags every downstream function that now receives the wrong type — before you run the code.

Three specific benefits for AI engineers:

Catch bugs at development time. A misspelled key in a TypedDict, a swapped argument order, or a missing field in a structured output — mypy catches all of these statically.
IDE support for complex data. LLM API responses are deeply nested dictionaries. Type hints give you autocompletion on response["choices"][0]["message"]["content"] instead of guessing key names.
Self-documenting pipelines. A signature like def retrieve(query: EmbeddingVector, top_k: int) -> list[ScoredDocument] communicates the interface to every engineer on the team.

2. When to Use Type Hints in AI Code

Type hints belong in all production AI code. The question is where they deliver the highest value first.

Always type these first: LLM response structures (eliminates dictionary key guessing), pipeline interfaces (makes data flow visible), tool definitions (validates what the LLM sends and receives), and configuration objects (constrains model names, temperature ranges, token limits).

Where type hints have the highest impact — at integration boundaries:

# Without types — what does this return?
def process_query(query, context, model):
    ...

# With types — the contract is explicit
def process_query(
    query: str,
    context: list[DocumentChunk],
    model: Literal["gpt-4o", "claude-sonnet-4-20250514"],
) -> GenerationResult:
    ...

The typed version communicates the full contract. The caller knows exactly what to pass and what to expect.

When types are less critical: Throwaway scripts, notebook exploration, and one-off data analysis. But any code that enters a shared repository or runs in production should be typed.

3. Type System Architecture for AI

The type system in an AI pipeline follows a clear progression. Raw, untyped data enters from external sources. Each stage refines the type until the final result is fully typed and validated.

Type Flow in an AI Pipeline

Raw LLM Output

dict[str, Any] from API

API response JSON

Untyped dictionary

Unknown structure

TypedDict / Pydantic

Shape defined, keys known

Parse into typed model

Validate required fields

Coerce value types

Validated Type

Business rules enforced

Content length checks

Enum value validation

Cross-field constraints

Pipeline Stage

Generic[T] processing

Transform typed input

Chain with next stage

Preserve type via Generic

Typed Result

Fully verified output

Return typed response

Serialize for client

Log with type metadata

Idle

Each layer serves a distinct purpose. The raw output layer acknowledges that external data is uncontrolled. The TypedDict/Pydantic layer imposes structure. The validated type layer applies domain rules. The pipeline stage layer preserves types through generic processing. The typed result layer guarantees consumers receive exactly what they expect. Skipping layers creates gaps where shape mismatches can reach production.

4. Type Hints Tutorial for AI

Five core typing patterns that AI engineers use most frequently.

Pattern 1: TypedDict for LLM Responses

from typing import TypedDict, NotRequired

class ChatMessage(TypedDict):
    role: str
    content: str

class UsageInfo(TypedDict):
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

class LLMResponse(TypedDict):
    id: str
    model: str
    choices: list[dict[str, ChatMessage]]
    usage: UsageInfo
    system_fingerprint: NotRequired[str]

With this definition, response["usage"]["prompt_tokens"] is known to be int. Your IDE autocompletes the keys. mypy catches response["usages"] as a typo.

Pattern 2: Generic Pipelines

from typing import TypeVar, Generic, Callable, Awaitable

T = TypeVar("T")
R = TypeVar("R")

class PipelineStage(Generic[T, R]):
    def __init__(self, name: str, processor: Callable[[T], Awaitable[R]]) -> None:
        self.name = name
        self.processor = processor

    async def execute(self, input_data: T) -> R:
        return await self.processor(input_data)

embed_stage: PipelineStage[str, list[float]] = PipelineStage(
    name="embed", processor=embed_text,
)
# mypy knows: embed_stage.execute("query") returns list[float]

Pattern 3: Protocol for Tool Interfaces

from typing import Protocol, runtime_checkable

@runtime_checkable
class Tool(Protocol):
    @property
    def name(self) -> str: ...
    @property
    def description(self) -> str: ...
    async def execute(self, input_text: str) -> str: ...

# Any class with these methods satisfies Tool — no inheritance needed
class WebSearchTool:
    @property
    def name(self) -> str:
        return "web_search"
    @property
    def description(self) -> str:
        return "Search the web for current information"
    async def execute(self, input_text: str) -> str:
        return await search(input_text)

def register_tool(registry: dict[str, Tool], tool: Tool) -> None:
    registry[tool.name] = tool

Pattern 4: Literal for Model Names

from typing import Literal

ModelName = Literal["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-20250514"]
Role = Literal["system", "user", "assistant"]

def create_completion(model: ModelName, messages: list[dict[str, str]]) -> LLMResponse:
    ...

create_completion(model="gpt4o", messages=[])  # mypy error: "gpt4o" not in Literal

from typing import TypedDict, Union, Literal

class TextInput(TypedDict):
    modality: Literal["text"]
    content: str

class ImageInput(TypedDict):
    modality: Literal["image"]
    url: str
    detail: Literal["low", "high", "auto"]

MultiModalInput = Union[TextInput, ImageInput]

def process_input(input_data: MultiModalInput) -> str:
    if input_data["modality"] == "text":
        return input_data["content"]
    return f"[Image: {input_data['url']}]"

5. Type Safety Layers

A production AI application has multiple layers of type safety. Each layer catches a different category of error.

Type Safety Stack for AI Applications

Application Types

Domain models — User, Query, Report

API Response Types

TypedDict for LLM and service responses

LLM Output Types

Structured output schemas and parsed results

Pipeline Generic Types

Generic[T, R] for reusable stage components

Tool Interface Protocols

Protocol classes for pluggable tool systems

Runtime Validation (Pydantic)

Validates external data at system boundaries

Idle

The top layers define what data looks like — checked statically by mypy with zero runtime cost. The middle layers ensure data flows correctly between stages, preserving type information through transformations. The bottom layers operate at system boundaries where external data enters and types cannot be guaranteed statically. Together, static checking catches structural errors during development while runtime validation catches data errors in production.

6. AI Type Hint Examples

Three complete examples for real AI engineering problems.

Example 1: Typed RAG Pipeline

from typing import TypedDict

class DocumentChunk(TypedDict):
    id: str
    text: str
    metadata: dict[str, str]
    score: float

class RetrievalResult(TypedDict):
    query: str
    chunks: list[DocumentChunk]
    total_found: int

class GenerationResult(TypedDict):
    answer: str
    sources: list[str]
    tokens_used: int

async def embed_query(query: str) -> list[float]: ...
async def retrieve(embedding: list[float], top_k: int = 5) -> RetrievalResult: ...
async def generate(query: str, context: RetrievalResult) -> GenerationResult: ...

async def rag_pipeline(query: str) -> GenerationResult:
    embedding = await embed_query(query)
    retrieval = await retrieve(embedding)
    return await generate(query, retrieval)

Example 2: Generic Retry Wrapper

import asyncio
from typing import TypeVar, Callable, Awaitable

T = TypeVar("T")

async def with_retry(
    fn: Callable[..., Awaitable[T]],
    *args: object,
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> T:
    last_exception: Exception | None = None
    for attempt in range(max_retries):
        try:
            return await fn(*args)
        except Exception as e:
            last_exception = e
            if attempt < max_retries - 1:
                await asyncio.sleep(base_delay * (2 ** attempt))
    raise last_exception  # type: ignore[misc]

# mypy preserves the return type through the wrapper
result: GenerationResult = await with_retry(generate, "What is RAG?", context)

Example 3: Protocol-Based Tool Registry

from typing import Protocol, runtime_checkable

@runtime_checkable
class AgentTool(Protocol):
    @property
    def name(self) -> str: ...
    @property
    def parameters_schema(self) -> dict[str, object]: ...
    async def execute(self, **kwargs: object) -> str: ...

class ToolRegistry:
    def __init__(self) -> None:
        self._tools: dict[str, AgentTool] = {}

    def register(self, tool: AgentTool) -> None:
        self._tools[tool.name] = tool

    def list_schemas(self) -> list[dict[str, object]]:
        return [
            {"name": t.name, "parameters": t.parameters_schema}
            for t in self._tools.values()
        ]

7. Static vs Runtime Type Checking

Python type hints support two complementary checking strategies. Understanding when to use each is critical for production AI code.

Static vs Runtime Type Checking

mypy (Static)

Catches errors before code runs

Pydantic (Runtime)

Validates data when it arrives

Verdict: Use both — mypy for internal code correctness, Pydantic at external data boundaries.

Use case

Production AI pipelines need static checking for developer errors and runtime validation for external data errors.

Use mypy for: internal function signatures, pipeline stage connections, configuration objects, module-level type correctness.

Use Pydantic for: LLM API response parsing, user input validation, database query results, any data crossing a system boundary.

The optimal pattern — define TypedDict for internal data shapes (zero overhead), use Pydantic models at boundaries where external data enters:

from pydantic import BaseModel, Field

class StructuredLLMOutput(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(min_length=1)

# Validate at boundary, then flow as TypedDict internally
validated = StructuredLLMOutput.model_validate_json(raw_json)

8. Interview Questions

Q1: How would you type a function that accepts either a single prompt or a batch?

Use @overload to define separate signatures. generate("hello") returns GenerationResult. generate(["a", "b"]) returns list[GenerationResult]. mypy gives callers precise return types based on the input type.

Q2: What is the difference between Protocol and ABC for tool interfaces?

ABC uses nominal typing — a class must explicitly inherit from it. Protocol uses structural typing — any class with the required methods satisfies the Protocol without inheritance. For AI tool systems, Protocol is preferred because third-party tools do not need to import your interface.

Q3: Why should you avoid Any in AI pipeline code?

Any disables type checking at that point. If a pipeline stage returns Any, mypy cannot verify downstream stages receive the correct type. Prefer object over Any — it still requires explicit type narrowing before use.

Q4: How do you handle LLM API responses with optional fields?

Use NotRequired (Python 3.11+) for TypedDict fields that may be absent. This is different from Optional[str], which means the key exists but the value may be None. NotRequired means the key itself may not exist in the dictionary.

9. Type Hints in Production AI Code

mypy Configuration

[tool.mypy]
python_version = "3.12"
strict = true
warn_return_any = true
disallow_untyped_defs = true

[[tool.mypy.overrides]]
module = ["langchain.*", "chromadb.*"]
ignore_missing_imports = true

CI Integration

Add mypy to your CI pipeline so type errors block merges. Type checking runs in seconds, even for large codebases.

Gradual Typing Strategy

Start at the boundaries. Type all public function signatures in pipeline modules.
Type new code strictly. Every new module gets full strictness. No exceptions.
Work inward. After boundaries are typed, add types to internal helpers.
Eliminate Any. Track Any count as a code quality metric and reduce it each sprint.

Common Pitfalls

Using dict instead of TypedDict. Returning dict[str, Any] gives callers no information about keys or value types.
Overly broad Union types. Union[str, int, float, list, dict, None] is effectively untyped. Narrow to actual types.
Ignoring generic variance. list[Animal] is not a supertype of list[Dog]. Use Sequence[Animal] for read-only covariant access.

Python type hints transform AI codebases from fragile dictionary-juggling into verified, self-documenting pipelines. The five core patterns — TypedDict, Generics, Protocol, Literal, and Union — address the specific challenges of typing LLM responses, building reusable pipeline stages, defining tool interfaces, constraining model parameters, and handling multi-modal inputs.

Python for GenAI Engineers — Async patterns, Pydantic, and production Python for AI
Async Python Guide — asyncio fundamentals for LLM API calls and parallel pipelines
Structured Outputs — LLM structured output techniques using typed schemas
LLMOps — Operationalizing LLM pipelines with monitoring, versioning, and deployment
Pydantic AI — Building type-safe AI agents with Pydantic AI framework

Frequently Asked Questions

Why are Python type hints important for AI engineering?

Type hints catch data shape mismatches at development time — before they become runtime crashes in production LLM pipelines. They enable IDE autocompletion for complex nested structures like LLM API responses, make pipeline interfaces self-documenting, and allow static analysis tools like mypy to verify that every function in your chain receives and returns the correct types.

What is the difference between TypedDict and Pydantic BaseModel for AI code?

TypedDict provides static type checking at development time with zero runtime cost — mypy verifies correct keys and value types. Pydantic BaseModel provides runtime validation, checking and coercing data when objects are created. Use TypedDict for internal data structures. Use Pydantic for external boundaries where data arrives from LLM APIs or user inputs.

How do you type LLM API responses in Python?

Use TypedDict to define the expected shape of LLM responses, including nested structures for choices, messages, and usage metadata. For structured outputs, define a TypedDict or Pydantic model for the parsed content. For streaming responses, type the generator with AsyncIterator parameterized by the chunk type.

How do Generic types improve AI pipeline code?

Generic types let you write pipeline components that work with any data type while preserving type information through the chain. A generic retry wrapper preserves the return type of the wrapped function. A generic Pipeline class ensures that chaining stages together is type-safe — connecting incompatible stages is caught before runtime.

What is the Protocol pattern for AI tool interfaces?

Protocol defines structural subtyping — any class that implements the required methods satisfies the Protocol without needing explicit inheritance. For AI tool registries, you define a Protocol with execute and description methods. Any tool class implementing these methods can be registered. Third-party tools work without modification.

How do you configure mypy for an AI Python project?

Start with strict mode in pyproject.toml: strict = true, warn_return_any = true, disallow_untyped_defs = true. Use per-module overrides for third-party AI libraries lacking type stubs. Add mypy to CI so type errors block merges.

When should you use Literal types in AI code?

Use Literal for model name parameters so mypy catches typos, for role fields in chat messages, and for configuration enums like embedding dimensions or similarity metrics. This prevents invalid values from reaching API calls where they would cause runtime errors.

How do you handle Union types for multi-modal AI inputs?

Define each modality as a distinct TypedDict with a discriminator field, then use Union to combine them. In processing functions, use isinstance checks or match statements to narrow the type. mypy verifies that you handle every variant in the Union.