Python Data Structures for GenAI Engineers (2026)

Q: How do you use sets for deduplication in a RAG pipeline?

Track seen chunk IDs in a set as you iterate over results sorted by score. For each result, check membership in O(1) with 'id in seen_ids'. If not seen, add the ID to the set and append the chunk to the output list. This prevents duplicate chunks from appearing in the final context, which is critical in hybrid retrieval where vector search and keyword search return overlapping results.

Q: How does collections.deque manage conversation context windows?

Create a deque with maxlen set to the maximum number of messages your token budget allows. When the deque reaches capacity, it automatically evicts the oldest message in O(1) with no manual index management. The system prompt is maintained separately so it is never evicted. Convert the deque to a list only when building the final API call.

Q: What is TypedDict and when should you use it instead of Pydantic?

TypedDict adds type safety to plain dicts without the overhead of Pydantic instantiation. Use it when working with libraries that return plain dicts, such as LangChain LCEL chains or vector database clients. TypedDict provides IDE autocompletion and catches key typos at write time, but it does not validate data at runtime like Pydantic does.

Q: How do you batch chunks for embedding using itertools?

Use itertools.islice to lazily pull fixed-size batches from a generator of document chunks without loading all chunks into memory at once. In each iteration, slice the next batch_size items, embed them via the API, and upsert into the vector store. This pattern is essential when processing large corpora where the full dataset does not fit in memory.

Q: How does asyncio.Queue help with parallel embedding pipelines?

asyncio.Queue implements a thread-safe producer-consumer pattern within the event loop. Producers enqueue EmbeddingJob dataclasses, and multiple async workers pull jobs from the queue, call the embedding API, and push results to an output queue. This enables controlled concurrency with a fixed number of workers, preventing API rate limit errors.

Q: Why should you validate LLM-generated JSON with Pydantic instead of json.loads?

json.loads only checks syntax — it produces a dict that accepts any keys and values. If the LLM omits a required field, returns a string where a number is expected, or returns null where your code expects a value, the error surfaces deep in your pipeline as a KeyError or type mismatch. Pydantic raises a ValidationError immediately at parse time with field-level detail, catching malformed data at the boundary.

1. Why Data Structures Shape GenAI System Quality

Most Python tutorials treat data structures as a prerequisite topic: learn lists, move on. In GenAI engineering, data structures are never a prerequisite — they are an active design decision that affects correctness, latency, and cost.

Consider what a single RAG query touches before it returns an answer. The user message is a dict inside a list passed to the LLM API. Embedding API responses return a list[float] that your vector database stores and retrieves. Retrieved chunks arrive as a list[dict], deduplicated with a set, and merged with a dict keyed by source. The final prompt is assembled from several str concatenations constrained by a token count. The LLM’s response is parsed and validated by a Pydantic model. Any one of those steps getting the data structure wrong — using a plain dict where a validated Pydantic model belongs, or a list where a set would prevent duplicates — introduces either silent data corruption or runtime errors that appear only in production.

This guide is not an introduction to Python. It is a practical map of which data structures GenAI engineers reach for, why they reach for them, and where each one fails. It addresses the course of study that covers what every production LLM application actually uses.

Related foundation: Python for GenAI Engineers covers async patterns and error handling; this guide covers the data layer those patterns operate on.

2. Core Data Structures for GenAI — The Practical Map

Every GenAI application uses the same small set of data structures repeatedly. Understanding their roles and limitations upfront prevents the category of bugs that only appear in production.

Structure	Where it appears in GenAI code	Key limitation to know
`list`	Prompt messages, retrieved chunks, embedding batches	Ordered, allows duplicates — deduplication is manual
`dict`	Prompt message objects, metadata, tool call arguments	No schema enforcement — KeyError risks at runtime
`set`	Deduplication of retrieved chunk IDs or sources	Unordered, elements must be hashable
`dataclass`	Internal pipeline state, intermediate results	No runtime validation — bad data passes silently
`Pydantic BaseModel`	LLM output parsing, API schemas, configuration	Slightly higher overhead, but catches malformed data
`collections.deque`	Rolling conversation context windows	Fixed max length for O(1) context eviction
`collections.Counter`	Token frequency analysis, keyword counting	Not serializable as-is — convert to dict for JSON
`collections.defaultdict`	Grouping chunks by source document	Default factory can mask KeyError logic errors

The most consequential distinction is between dict and Pydantic BaseModel. A dict accepts any key and any value. A Pydantic model enforces a declared schema at runtime. When an LLM response is missing a field your code later accesses, the dict version raises KeyError deep in your pipeline. The Pydantic version raises ValidationError immediately at the boundary where the LLM output was parsed.

3. Lists, Dicts, and Sets in AI Context

These three structures handle the majority of data flow in any GenAI pipeline — understanding their roles prevents the subtle bugs that only appear in production.

Lists as the Backbone of LLM API Communication

Every call to an LLM API passes a list[dict] as the messages argument. This structure is non-negotiable — it is the interface the APIs define.

# Requires: openai>=1.0.0
from openai import AsyncOpenAI
from typing import TypedDict

# TypedDict gives type hints without runtime overhead
class Message(TypedDict):
    role: str   # "system" | "user" | "assistant" | "tool"
    content: str

# A conversation is a list of messages
conversation: list[Message] = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is retrieval-augmented generation?"},
]

client = AsyncOpenAI()
response = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=conversation,  # list[dict] — the only accepted format
)

# Append the assistant's reply to maintain conversation history
conversation.append({
    "role": "assistant",
    "content": response.choices[0].message.content,
})

List operations that matter:

list.append() — O(1), the correct way to grow a conversation
list[start:end] — O(k) slicing for context truncation (drop oldest messages)
list.extend(other) — merge retrieved chunks from multiple sources
sorted(results, key=lambda x: x["score"], reverse=True) — rerank retrieved results

Dicts as Metadata Containers

Retrieved chunks from vector databases arrive with metadata: source document, page number, section heading, retrieval score. Dicts carry this metadata.

from typing import TypedDict

class RetrievedChunk(TypedDict):
    id: str
    text: str
    score: float
    metadata: dict[str, str | int | float]

# Typical vector database result
chunks: list[RetrievedChunk] = [
    {
        "id": "doc_42_chunk_7",
        "text": "RAG pipelines combine retrieval with generation...",
        "score": 0.923,
        "metadata": {"source": "rag-guide.pdf", "page": 12, "section": "Architecture"},
    },
    {
        "id": "doc_17_chunk_3",
        "text": "Hybrid retrieval uses both dense and sparse methods...",
        "score": 0.891,
        "metadata": {"source": "retrieval-survey.pdf", "page": 5, "section": "Methods"},
    },
]

# Access with .get() to avoid KeyError on optional metadata fields
for chunk in chunks:
    source = chunk["metadata"].get("source", "unknown")
    page = chunk["metadata"].get("page", "N/A")

Dict patterns that matter:

dict.get(key, default) — safe access when keys may be absent (common with LLM-generated JSON)
{**base_dict, **overrides} — merge dicts for prompt template substitution
dict comprehensions — transform lists of chunks into lookup tables by ID

Sets for Deduplication

Hybrid retrieval (vector + keyword search) often returns the same chunk from both sources. Sets deduplicate in O(1) per lookup.

async def deduplicate_results(
    vector_results: list[RetrievedChunk],
    keyword_results: list[RetrievedChunk],
    top_k: int = 5,
) -> list[RetrievedChunk]:
    """Merge hybrid retrieval results, deduplicate by ID, sort by score."""
    seen_ids: set[str] = set()
    merged: list[RetrievedChunk] = []

    # Interleave results, preferring higher scores
    all_results = sorted(
        vector_results + keyword_results,
        key=lambda x: x["score"],
        reverse=True,
    )

    for chunk in all_results:
        if chunk["id"] not in seen_ids:
            seen_ids.add(chunk["id"])
            merged.append(chunk)
        if len(merged) == top_k:
            break

    return merged

Sets also deduplicate source citations — presenting a user with the same source document listed twice looks careless:

# Extract unique sources for citation display
unique_sources: set[str] = {
    chunk["metadata"].get("source", "unknown")
    for chunk in final_chunks
}
citations = sorted(unique_sources)  # Sorted for stable presentation

4. Pydantic Models for LLM Output Validation

Pydantic is the most important data structure tool in the GenAI stack. Its role is to enforce schema contracts at the boundary where LLM-generated data enters your application.

Why Plain Dicts Fail for LLM Output

An LLM instructed to return JSON may:

Omit a required field
Return a string where a number is expected
Return null where your code expects a non-null value
Wrap the JSON in markdown fences: ```json ... ```

Each of these produces a dict that appears valid until you access the problematic field deep in your pipeline. Pydantic catches all of these at parse time.

Defining Pydantic Models for Structured LLM Output

# Requires: pydantic>=2.0.0
from pydantic import BaseModel, Field, field_validator
from typing import Literal
from enum import Enum

class Confidence(str, Enum):
    high = "high"
    medium = "medium"
    low = "low"

class ExtractedFact(BaseModel):
    claim: str = Field(description="The factual claim extracted from the document")
    source_sentence: str = Field(description="The exact sentence the claim is drawn from")
    confidence: Confidence
    requires_verification: bool

    @field_validator("claim", "source_sentence")
    @classmethod
    def not_empty(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("Field cannot be empty")
        return v.strip()

class DocumentAnalysis(BaseModel):
    document_id: str
    summary: str = Field(max_length=500)
    key_facts: list[ExtractedFact] = Field(min_length=1)
    overall_confidence: Confidence
    recommended_action: Literal["publish", "review", "reject"]

Using Pydantic with OpenAI Structured Output

# Requires: openai>=1.0.0, pydantic>=2.0.0
from openai import AsyncOpenAI
from pydantic import ValidationError

client = AsyncOpenAI()

async def analyze_document(doc_id: str, text: str) -> DocumentAnalysis | None:
    try:
        response = await client.beta.chat.completions.parse(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract key facts from the document. "
                        "Return structured JSON matching the schema."
                    ),
                },
                {"role": "user", "content": f"Document ID: {doc_id}\n\n{text}"},
            ],
            response_format=DocumentAnalysis,
        )
        return response.choices[0].message.parsed

    except ValidationError as e:
        # Log the errors for prompt improvement — these are signal, not noise
        print(f"LLM output failed validation for {doc_id}: {e.errors()}")
        return None

Using Pydantic with LangChain

# Requires: langchain-openai>=0.1.0, langchain-core>=0.1.0
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# .with_structured_output() accepts a Pydantic model as the target schema
structured_llm = llm.with_structured_output(DocumentAnalysis)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract key facts from the document."),
    ("user", "Document ID: {doc_id}\n\n{text}"),
])

chain = prompt | structured_llm

# result is a validated DocumentAnalysis instance, not a dict
result: DocumentAnalysis = await chain.ainvoke({"doc_id": doc_id, "text": text})
print(result.overall_confidence)  # Type-safe — IDE knows this is Confidence enum

5. Data Flow in a RAG Pipeline — Visual Diagram

Understanding how data structures transform through each stage of a RAG pipeline clarifies why each structure was chosen. Raw query strings become embedding vectors; vectors become ranked chunk lists; chunks become formatted prompt strings; LLM responses become validated Pydantic objects.

📊 Visual Explanation

Data Structures Through a RAG Pipeline

Each stage transforms data — the right structure at each boundary prevents silent failures

Query Stage

str → list[dict] for the LLM API

User Query

str — raw user input

Message List

list[dict] — role/content pairs

Embedding Vector

list[float] — 1536-dim for text-embedding-3-small

Retrieval Stage

vector → deduplicated chunks

Vector Search Results

list[dict] — id, text, score, metadata

Keyword Search Results

list[dict] — parallel source

Deduplicated Chunks

set[str] for IDs, list[dict] for final chunks

Generation Stage

chunks → prompt str → Pydantic model

Context Assembly

str — joined chunk texts with token budget

Prompt Messages

list[dict] — system + user + context

LLM Response

Pydantic model — validated structured output

Output Stage

validated model → API response

Pydantic Response Model

typed, validated, serializable

Source Citations

set[str] → sorted list for display

API JSON Response

model.model_dump() — safe serialization

Idle

The transitions that most often break are the ones involving LLM output: moving from raw LLM response text to a structured data type. Using Pydantic at that boundary is not optional in production — it is how the system stays correct when the LLM is inconsistent.

6. Dataclasses vs Pydantic — Knowing When to Use Each

Dataclasses and Pydantic models are both ways to give structure to data, but they serve different purposes. Choosing the wrong one leads to either over-engineering or undetected data corruption.

Dataclasses: Internal Pipeline State

Dataclasses are Python’s standard library tool for structured data containers. They provide type hints, auto-generated __init__, __repr__, and __eq__. They do not validate at runtime.

from dataclasses import dataclass, field
import time

@dataclass
class RetrievalResult:
    query: str
    chunks: list[dict]
    retrieval_latency_ms: float
    timestamp: float = field(default_factory=time.time)
    from_cache: bool = False

@dataclass
class PipelineState:
    query_id: str
    original_query: str
    rewritten_query: str | None = None
    retrieval_result: RetrievalResult | None = None
    generated_answer: str | None = None
    total_tokens: int = 0

# Internal use — no external data, no validation needed
state = PipelineState(
    query_id="req_abc123",
    original_query="What is RAG?",
)
state.rewritten_query = "What is retrieval-augmented generation and how does it work?"

Use dataclasses when:

Data is created and consumed entirely within your own code
You control all the code paths that set values
Performance is critical (no validation overhead)
You need __slots__ for memory efficiency on large batches

Pydantic: External Data Boundaries

Use Pydantic whenever data comes from or goes to an external source — LLM APIs, user-facing REST APIs, configuration files.

# Requires: pydantic>=2.0.0, pydantic-settings>=2.0.0
from pydantic import BaseModel, Field
from pydantic_settings import BaseSettings
from typing import Annotated

class RAGRequest(BaseModel):
    """Validates incoming API requests from users."""
    query: Annotated[str, Field(min_length=1, max_length=2000)]
    top_k: int = Field(default=5, ge=1, le=20)
    temperature: float = Field(default=0.0, ge=0.0, le=2.0)
    filter_sources: list[str] = Field(default_factory=list)

class RAGResponse(BaseModel):
    """Validates outgoing API responses."""
    query_id: str
    answer: str
    sources: list[str]
    tokens_used: int
    retrieval_latency_ms: float

class AppSettings(BaseSettings):
    """Validates environment variables at application startup."""
    openai_api_key: str
    pinecone_api_key: str
    embedding_model: str = "text-embedding-3-small"
    generation_model: str = "gpt-4o-mini"
    max_concurrent_requests: int = Field(default=20, ge=1, le=100)

    class Config:
        env_file = ".env"

Decision rule:

Scenario	Use
LLM output parsing	Pydantic
API request/response body	Pydantic
Environment configuration	Pydantic (`BaseSettings`)
Internal pipeline state	dataclass
In-memory agent memory	dataclass
Tool call arguments (to be validated)	Pydantic
Intermediate computation results	dataclass

The key insight: if bad data could arrive at the structure — and produce a silent incorrect result or downstream crash — use Pydantic. If the structure only holds data your own code created and already validated, use a dataclass.

7. Collections and Itertools for GenAI Pipelines

The collections and itertools modules contain data structures and utilities that solve specific GenAI problems cleanly.

`collections.deque` — Rolling Context Windows

Conversation history has a token budget. When the budget is exceeded, you need to evict the oldest messages efficiently. deque with maxlen does this in O(1):

from collections import deque

class ConversationMemory:
    def __init__(self, max_messages: int = 20):
        # maxlen automatically evicts oldest items when at capacity
        self._messages: deque[dict] = deque(maxlen=max_messages)
        self._system_prompt: str = ""

    def set_system(self, prompt: str) -> None:
        self._system_prompt = prompt

    def add_user(self, content: str) -> None:
        self._messages.append({"role": "user", "content": content})

    def add_assistant(self, content: str) -> None:
        self._messages.append({"role": "assistant", "content": content})

    def as_message_list(self) -> list[dict]:
        messages = []
        if self._system_prompt:
            messages.append({"role": "system", "content": self._system_prompt})
        messages.extend(self._messages)
        return messages

# Usage
memory = ConversationMemory(max_messages=10)
memory.set_system("You are a GenAI expert assistant.")
memory.add_user("What is a vector database?")
memory.add_assistant("A vector database stores high-dimensional embeddings...")
# When the 11th message is added, the 1st is automatically dropped

`collections.defaultdict` — Grouping Chunks by Source

When merging results from multiple retrieval passes, grouping chunks by their source document aids deduplication and citation:

from collections import defaultdict

def group_chunks_by_source(
    chunks: list[dict],
) -> dict[str, list[dict]]:
    """Group retrieved chunks by their source document."""
    by_source: defaultdict[str, list[dict]] = defaultdict(list)

    for chunk in chunks:
        source = chunk["metadata"].get("source", "unknown")
        by_source[source].append(chunk)

    # Sort each source's chunks by score
    return {
        source: sorted(chunk_list, key=lambda c: c["score"], reverse=True)
        for source, chunk_list in by_source.items()
    }

# Result: {"rag-guide.pdf": [...], "retrieval-survey.pdf": [...]}

`collections.Counter` — Token and Keyword Analysis

Analyzing keyword frequency across a document corpus or within retrieved chunks helps calibrate retrieval quality:

from collections import Counter
import re

def analyze_keyword_coverage(
    query: str,
    chunks: list[dict],
) -> dict[str, int]:
    """Count how often query keywords appear in retrieved chunks."""
    query_keywords = set(re.findall(r'\b\w{4,}\b', query.lower()))
    combined_text = " ".join(c["text"].lower() for c in chunks)
    words = re.findall(r'\b\w+\b', combined_text)

    word_counts = Counter(words)
    # Return only counts for query-relevant keywords
    return {kw: word_counts.get(kw, 0) for kw in query_keywords}

`itertools.islice` — Lazy Chunk Batching

When processing large document corpora for embedding, loading all chunks into memory at once is impractical. itertools.islice enables lazy batching:

import itertools
from typing import Iterator

def batch_chunks(
    chunks: Iterator[str],
    batch_size: int = 100,
) -> Iterator[list[str]]:
    """Yield batches of chunks without loading all into memory."""
    while True:
        batch = list(itertools.islice(chunks, batch_size))
        if not batch:
            break
        yield batch

# Usage with a generator of document chunks
async def embed_corpus_lazy(document_path: str) -> None:
    chunk_gen = generate_chunks_from_file(document_path)  # generator
    for batch in batch_chunks(chunk_gen, batch_size=50):
        embeddings = await embed_batch(batch)  # 50 at a time, not all at once
        await vector_store.upsert(batch, embeddings)

8. Async Data Patterns for GenAI Pipelines

Data structures intersect with async patterns in specific ways that matter for production GenAI systems.

Async-Safe Shared State with asyncio.Queue

When multiple async workers process documents concurrently, shared mutable state requires care. asyncio.Queue is a thread-safe (within the event loop) data structure for producer-consumer patterns:

import asyncio
from dataclasses import dataclass

@dataclass
class EmbeddingJob:
    doc_id: str
    chunk_index: int
    text: str

@dataclass
class EmbeddingResult:
    doc_id: str
    chunk_index: int
    embedding: list[float]

async def embedding_worker(
    queue_in: asyncio.Queue[EmbeddingJob],
    queue_out: asyncio.Queue[EmbeddingResult],
    client: "AsyncOpenAI",
    worker_id: int,
) -> None:
    """Consumer: pulls jobs from queue, pushes results."""
    while True:
        job = await queue_in.get()
        try:
            response = await client.embeddings.create(
                model="text-embedding-3-small",
                input=job.text,
            )
            result = EmbeddingResult(
                doc_id=job.doc_id,
                chunk_index=job.chunk_index,
                embedding=response.data[0].embedding,
            )
            await queue_out.put(result)
        finally:
            queue_in.task_done()

async def run_parallel_ingestion(
    chunks: list[tuple[str, int, str]],  # (doc_id, chunk_index, text)
    num_workers: int = 10,
) -> list[EmbeddingResult]:
    """Run embedding workers in parallel with a shared queue."""
    from openai import AsyncOpenAI
    client = AsyncOpenAI()

    queue_in: asyncio.Queue[EmbeddingJob] = asyncio.Queue()
    queue_out: asyncio.Queue[EmbeddingResult] = asyncio.Queue()

    # Populate input queue
    for doc_id, chunk_index, text in chunks:
        await queue_in.put(EmbeddingJob(doc_id, chunk_index, text))

    # Start workers
    workers = [
        asyncio.create_task(embedding_worker(queue_in, queue_out, client, i))
        for i in range(num_workers)
    ]

    # Wait for all jobs to complete
    await queue_in.join()

    # Cancel workers
    for worker in workers:
        worker.cancel()

    # Collect results
    results = []
    while not queue_out.empty():
        results.append(queue_out.get_nowait())

    return results

Streaming Pydantic Validation

When streaming LLM output, partial JSON arrives token by token. Validate only when the stream completes:

# Requires: pydantic>=2.0.0, openai>=1.0.0
from pydantic import BaseModel, ValidationError
from openai import AsyncOpenAI

class StructuredAnswer(BaseModel):
    answer: str
    confidence: float
    sources_used: list[str]

async def stream_and_validate(prompt: str) -> StructuredAnswer | None:
    client = AsyncOpenAI()
    collected_tokens: list[str] = []

    # Collect all streaming tokens
    async with client.chat.completions.stream(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    ) as stream:
        async for event in stream:
            if event.type == "content.delta" and event.delta:
                collected_tokens.append(event.delta)

    # Validate the complete response
    import json
    raw_text = "".join(collected_tokens)
    try:
        data = json.loads(raw_text)
        return StructuredAnswer.model_validate(data)
    except (json.JSONDecodeError, ValidationError) as e:
        print(f"Validation failed: {e}")
        return None

TypedDict for Lightweight Interop

When working with libraries that return plain dicts (LangChain LCEL chains, many vector DB clients), TypedDict adds type safety without the overhead of Pydantic instantiation:

from typing import TypedDict, Required

class VectorSearchResult(TypedDict, total=False):
    id: Required[str]
    score: Required[float]
    text: Required[str]
    source: str      # Optional — not all results have this
    page: int        # Optional

# Type-safe access — IDE catches typos at write time
def extract_texts(results: list[VectorSearchResult]) -> list[str]:
    return [r["text"] for r in results]

9. Interview Preparation — Common Data Structure Questions

Interviewers for GenAI engineering roles ask data structure questions in the context of LLM systems. These are not abstract algorithmic questions. They test whether you know which structure fits a production scenario.

Question 1: “How would you store and manage a multi-turn conversation history with a token limit?”

Use collections.deque with maxlen set to the maximum number of messages the token budget allows. When the deque reaches capacity, it automatically evicts the oldest message — O(1) operation with no manual index management. Maintain the system prompt separately (not in the deque) so it is never evicted. Convert to list only when building the API call.

Question 2: “You have retrieved 15 chunks from vector search and 15 from BM25 keyword search. Several chunks appear in both result sets. How do you merge and deduplicate them?”

Track seen chunk IDs in a set[str] as you iterate over results sorted by score. For each result, check id in seen_ids in O(1). If not seen, add to the seen_ids set and append to the output list. Stop when you have top_k unique results. Do not use a list for deduplication — in on a list is O(n), making the naive approach O(n²) for large result sets.

Question 3: “An LLM returns JSON for a structured extraction task. What can go wrong if you parse it with json.loads() and access fields directly as a dict?”

Silent failures: the LLM may omit a required field (you get KeyError later), return a string where you expect an integer (type errors downstream), or return null where your code expects a non-null value. A Pydantic model raises ValidationError immediately at the parse step, giving you a clear error with field-level detail. The dict approach defers the failure — often to a silent data corruption rather than an exception.

Question 4: “When would you choose a dataclass over a Pydantic BaseModel for an agent’s memory state?”

Use a dataclass for agent memory state that your own code writes and reads — it has no validation overhead and the structure is fully under your control. Use Pydantic if the memory state is serialized to a database and reloaded (data may be corrupted in storage), or if the state is set by an external call like an LLM tool invocation where the LLM could return malformed data. The boundary question is: “Can bad data arrive here without passing through code I control?“

10. Summary and Next Steps

Python data structures in GenAI engineering are not interchangeable. The right structure at each pipeline boundary determines whether your system catches errors early or fails silently in production.

The five patterns to internalize:

List + dict for API communication. Every LLM API call takes list[dict]. Know how to build, append, slice, and token-budget these message lists.
Set for deduplication. Hybrid retrieval always produces overlapping results. O(1) set membership testing is the correct tool — not if x in list.
Pydantic at external boundaries. Any data that crosses a system boundary — LLM output, API request, environment config — belongs in a Pydantic model. ValidationError at parse time is better than KeyError deep in business logic.
Dataclass for internal state. Pipeline intermediate state, agent memory, and results your own code creates belong in dataclasses. No validation overhead, full IDE support.
collections.deque for rolling windows. Token budgets require evicting old messages. deque(maxlen=N) does this automatically and in O(1).

Where to go next:

Python for GenAI Engineers — Production async patterns and Pydantic in depth, including retry logic and structured output
Async Python Guide — asyncio.gather, semaphores, and streaming, which operate on the data structures covered here
RAG Architecture Guide — How lists, dicts, sets, and Pydantic models compose into a full retrieval-augmented generation system
AI Agents and Agentic Systems — Agent memory and tool call arguments are where dataclasses and Pydantic interact most

The data layer is invisible when it works correctly and catastrophic when it does not. Getting these fundamentals right is what separates GenAI code that works in a demo from code that runs reliably in production.

Python for GenAI — Python foundations for AI development
Async Python — Concurrency for LLM API calls
LLM Database Integration — Connecting LLMs to databases
RAG Chunking — Data structures in chunking pipelines

Frequently Asked Questions

What Python data structures are essential for AI and LLM applications?

Lists and dicts are the daily workhorses for prompt messages, retrieval results, and tool calls. Sets enable fast deduplication of retrieved chunks. Dataclasses structure pipeline intermediate state. Pydantic models validate LLM-generated JSON at runtime. The collections module offers deque for rolling context windows and Counter for token frequency analysis.

Why is Pydantic important for LLM applications?

LLMs return unstructured text or loosely formatted JSON. Pydantic models enforce a schema at runtime — if the LLM output is missing a required field or has a wrong type, a ValidationError is raised immediately rather than propagating a silent data error downstream. OpenAI's structured output feature and LangChain's with_structured_output both accept Pydantic models as the target schema.

When should I use dataclasses vs Pydantic in a GenAI project?

Use dataclasses for internal pipeline state that never crosses an external boundary: intermediate retrieval results, pipeline stage outputs, in-memory agent state. Use Pydantic when data comes from or goes to an external source: LLM outputs, API request/response bodies, configuration loaded from environment variables. The rule of thumb is: if it could be wrong, use Pydantic.

Which Python collections are most useful for storing and processing embeddings?

Plain Python lists store small embedding batches. For large-scale numeric work use NumPy arrays (faster dot-product, slicing). deque from the collections module manages rolling conversation context windows with O(1) appends and pops. defaultdict simplifies grouping retrieved chunks by source document. Counter tracks token frequencies across a corpus for analysis.

How do you use sets for deduplication in a RAG pipeline?

Track seen chunk IDs in a set as you iterate over results sorted by score. For each result, check membership in O(1) with id in seen_ids. If not seen, add the ID to the set and append the chunk to the output list. This prevents duplicate chunks from appearing in the final context, which is critical in hybrid RAG retrieval where vector search and keyword search return overlapping results.

How does collections.deque manage conversation context windows?

Create a deque with maxlen set to the maximum number of messages your token budget allows. When the deque reaches capacity, it automatically evicts the oldest message in O(1) with no manual index management. The system prompt is maintained separately so it is never evicted. Convert the deque to a list only when building the final API call.

What is TypedDict and when should you use it instead of Pydantic?

TypedDict adds type safety to plain dicts without the overhead of Pydantic instantiation. Use it when working with libraries that return plain dicts, such as LangChain LCEL chains or vector database clients. TypedDict provides IDE autocompletion and catches key typos at write time, but it does not validate data at runtime like Pydantic does.

How do you batch chunks for embedding using itertools?

Use itertools.islice to lazily pull fixed-size batches from a generator of document chunks without loading all chunks into memory at once. In each iteration, slice the next batch_size items, embed them via the API, and upsert into the vector store. This pattern is essential when processing large corpora where the full dataset does not fit in memory.

How does asyncio.Queue help with parallel embedding pipelines?

asyncio.Queue implements a thread-safe producer-consumer pattern within the event loop. Producers enqueue EmbeddingJob dataclasses, and multiple async workers pull jobs from the queue, call the embedding API, and push results to an output queue. This enables controlled concurrency with a fixed number of workers, preventing API rate limit errors.

Why should you validate LLM-generated JSON with Pydantic instead of json.loads?

json.loads only checks syntax — it produces a dict that accepts any keys and values. If the LLM omits a required field, returns a string where a number is expected, or returns null where your code expects a value, the error surfaces deep in your pipeline as a KeyError or type mismatch. Pydantic raises a ValidationError immediately at parse time with field-level detail, catching malformed data at the boundary.