Skip to content

Python Data Structures for GenAI Engineers (2026)

1. Why Data Structures Shape GenAI System Quality

Section titled “1. Why Data Structures Shape GenAI System Quality”

Most Python tutorials treat data structures as a prerequisite topic: learn lists, move on. In GenAI engineering, data structures are never a prerequisite — they are an active design decision that affects correctness, latency, and cost.

Consider what a single RAG query touches before it returns an answer. The user message is a dict inside a list passed to the LLM API. Embedding API responses return a list[float] that your vector database stores and retrieves. Retrieved chunks arrive as a list[dict], deduplicated with a set, and merged with a dict keyed by source. The final prompt is assembled from several str concatenations constrained by a token count. The LLM’s response is parsed and validated by a Pydantic model. Any one of those steps getting the data structure wrong — using a plain dict where a validated Pydantic model belongs, or a list where a set would prevent duplicates — introduces either silent data corruption or runtime errors that appear only in production.

This guide is not an introduction to Python. It is a practical map of which data structures GenAI engineers reach for, why they reach for them, and where each one fails. It addresses the course of study that covers what every production LLM application actually uses.

Related foundation: Python for GenAI Engineers covers async patterns and error handling; this guide covers the data layer those patterns operate on.


2. Core Data Structures for GenAI — The Practical Map

Section titled “2. Core Data Structures for GenAI — The Practical Map”

Every GenAI application uses the same small set of data structures repeatedly. Understanding their roles and limitations upfront prevents the category of bugs that only appear in production.

StructureWhere it appears in GenAI codeKey limitation to know
listPrompt messages, retrieved chunks, embedding batchesOrdered, allows duplicates — deduplication is manual
dictPrompt message objects, metadata, tool call argumentsNo schema enforcement — KeyError risks at runtime
setDeduplication of retrieved chunk IDs or sourcesUnordered, elements must be hashable
dataclassInternal pipeline state, intermediate resultsNo runtime validation — bad data passes silently
Pydantic BaseModelLLM output parsing, API schemas, configurationSlightly higher overhead, but catches malformed data
collections.dequeRolling conversation context windowsFixed max length for O(1) context eviction
collections.CounterToken frequency analysis, keyword countingNot serializable as-is — convert to dict for JSON
collections.defaultdictGrouping chunks by source documentDefault factory can mask KeyError logic errors

The most consequential distinction is between dict and Pydantic BaseModel. A dict accepts any key and any value. A Pydantic model enforces a declared schema at runtime. When an LLM response is missing a field your code later accesses, the dict version raises KeyError deep in your pipeline. The Pydantic version raises ValidationError immediately at the boundary where the LLM output was parsed.


These three structures handle the majority of data flow in any GenAI pipeline — understanding their roles prevents the subtle bugs that only appear in production.

Lists as the Backbone of LLM API Communication

Section titled “Lists as the Backbone of LLM API Communication”

Every call to an LLM API passes a list[dict] as the messages argument. This structure is non-negotiable — it is the interface the APIs define.

# Requires: openai>=1.0.0
from openai import AsyncOpenAI
from typing import TypedDict
# TypedDict gives type hints without runtime overhead
class Message(TypedDict):
role: str # "system" | "user" | "assistant" | "tool"
content: str
# A conversation is a list of messages
conversation: list[Message] = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is retrieval-augmented generation?"},
]
client = AsyncOpenAI()
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=conversation, # list[dict] — the only accepted format
)
# Append the assistant's reply to maintain conversation history
conversation.append({
"role": "assistant",
"content": response.choices[0].message.content,
})

List operations that matter:

  • list.append() — O(1), the correct way to grow a conversation
  • list[start:end] — O(k) slicing for context truncation (drop oldest messages)
  • list.extend(other) — merge retrieved chunks from multiple sources
  • sorted(results, key=lambda x: x["score"], reverse=True) — rerank retrieved results

Retrieved chunks from vector databases arrive with metadata: source document, page number, section heading, retrieval score. Dicts carry this metadata.

from typing import TypedDict
class RetrievedChunk(TypedDict):
id: str
text: str
score: float
metadata: dict[str, str | int | float]
# Typical vector database result
chunks: list[RetrievedChunk] = [
{
"id": "doc_42_chunk_7",
"text": "RAG pipelines combine retrieval with generation...",
"score": 0.923,
"metadata": {"source": "rag-guide.pdf", "page": 12, "section": "Architecture"},
},
{
"id": "doc_17_chunk_3",
"text": "Hybrid retrieval uses both dense and sparse methods...",
"score": 0.891,
"metadata": {"source": "retrieval-survey.pdf", "page": 5, "section": "Methods"},
},
]
# Access with .get() to avoid KeyError on optional metadata fields
for chunk in chunks:
source = chunk["metadata"].get("source", "unknown")
page = chunk["metadata"].get("page", "N/A")

Dict patterns that matter:

  • dict.get(key, default) — safe access when keys may be absent (common with LLM-generated JSON)
  • {**base_dict, **overrides} — merge dicts for prompt template substitution
  • dict comprehensions — transform lists of chunks into lookup tables by ID

Hybrid retrieval (vector + keyword search) often returns the same chunk from both sources. Sets deduplicate in O(1) per lookup.

async def deduplicate_results(
vector_results: list[RetrievedChunk],
keyword_results: list[RetrievedChunk],
top_k: int = 5,
) -> list[RetrievedChunk]:
"""Merge hybrid retrieval results, deduplicate by ID, sort by score."""
seen_ids: set[str] = set()
merged: list[RetrievedChunk] = []
# Interleave results, preferring higher scores
all_results = sorted(
vector_results + keyword_results,
key=lambda x: x["score"],
reverse=True,
)
for chunk in all_results:
if chunk["id"] not in seen_ids:
seen_ids.add(chunk["id"])
merged.append(chunk)
if len(merged) == top_k:
break
return merged

Sets also deduplicate source citations — presenting a user with the same source document listed twice looks careless:

# Extract unique sources for citation display
unique_sources: set[str] = {
chunk["metadata"].get("source", "unknown")
for chunk in final_chunks
}
citations = sorted(unique_sources) # Sorted for stable presentation

4. Pydantic Models for LLM Output Validation

Section titled “4. Pydantic Models for LLM Output Validation”

Pydantic is the most important data structure tool in the GenAI stack. Its role is to enforce schema contracts at the boundary where LLM-generated data enters your application.

An LLM instructed to return JSON may:

  • Omit a required field
  • Return a string where a number is expected
  • Return null where your code expects a non-null value
  • Wrap the JSON in markdown fences: ```json ... ```

Each of these produces a dict that appears valid until you access the problematic field deep in your pipeline. Pydantic catches all of these at parse time.

Defining Pydantic Models for Structured LLM Output

Section titled “Defining Pydantic Models for Structured LLM Output”
# Requires: pydantic>=2.0.0
from pydantic import BaseModel, Field, field_validator
from typing import Literal
from enum import Enum
class Confidence(str, Enum):
high = "high"
medium = "medium"
low = "low"
class ExtractedFact(BaseModel):
claim: str = Field(description="The factual claim extracted from the document")
source_sentence: str = Field(description="The exact sentence the claim is drawn from")
confidence: Confidence
requires_verification: bool
@field_validator("claim", "source_sentence")
@classmethod
def not_empty(cls, v: str) -> str:
if not v.strip():
raise ValueError("Field cannot be empty")
return v.strip()
class DocumentAnalysis(BaseModel):
document_id: str
summary: str = Field(max_length=500)
key_facts: list[ExtractedFact] = Field(min_length=1)
overall_confidence: Confidence
recommended_action: Literal["publish", "review", "reject"]

Using Pydantic with OpenAI Structured Output

Section titled “Using Pydantic with OpenAI Structured Output”
# Requires: openai>=1.0.0, pydantic>=2.0.0
from openai import AsyncOpenAI
from pydantic import ValidationError
client = AsyncOpenAI()
async def analyze_document(doc_id: str, text: str) -> DocumentAnalysis | None:
try:
response = await client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Extract key facts from the document. "
"Return structured JSON matching the schema."
),
},
{"role": "user", "content": f"Document ID: {doc_id}\n\n{text}"},
],
response_format=DocumentAnalysis,
)
return response.choices[0].message.parsed
except ValidationError as e:
# Log the errors for prompt improvement — these are signal, not noise
print(f"LLM output failed validation for {doc_id}: {e.errors()}")
return None
# Requires: langchain-openai>=0.1.0, langchain-core>=0.1.0
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# .with_structured_output() accepts a Pydantic model as the target schema
structured_llm = llm.with_structured_output(DocumentAnalysis)
prompt = ChatPromptTemplate.from_messages([
("system", "Extract key facts from the document."),
("user", "Document ID: {doc_id}\n\n{text}"),
])
chain = prompt | structured_llm
# result is a validated DocumentAnalysis instance, not a dict
result: DocumentAnalysis = await chain.ainvoke({"doc_id": doc_id, "text": text})
print(result.overall_confidence) # Type-safe — IDE knows this is Confidence enum

5. Data Flow in a RAG Pipeline — Visual Diagram

Section titled “5. Data Flow in a RAG Pipeline — Visual Diagram”

Understanding how data structures transform through each stage of a RAG pipeline clarifies why each structure was chosen. Raw query strings become embedding vectors; vectors become ranked chunk lists; chunks become formatted prompt strings; LLM responses become validated Pydantic objects.

Data Structures Through a RAG Pipeline

Each stage transforms data — the right structure at each boundary prevents silent failures

Query Stage
str → list[dict] for the LLM API
User Query
str — raw user input
Message List
list[dict] — role/content pairs
Embedding Vector
list[float] — 1536-dim for text-embedding-3-small
Retrieval Stage
vector → deduplicated chunks
Vector Search Results
list[dict] — id, text, score, metadata
Keyword Search Results
list[dict] — parallel source
Deduplicated Chunks
set[str] for IDs, list[dict] for final chunks
Generation Stage
chunks → prompt str → Pydantic model
Context Assembly
str — joined chunk texts with token budget
Prompt Messages
list[dict] — system + user + context
LLM Response
Pydantic model — validated structured output
Output Stage
validated model → API response
Pydantic Response Model
typed, validated, serializable
Source Citations
set[str] → sorted list for display
API JSON Response
model.model_dump() — safe serialization
Idle

The transitions that most often break are the ones involving LLM output: moving from raw LLM response text to a structured data type. Using Pydantic at that boundary is not optional in production — it is how the system stays correct when the LLM is inconsistent.


6. Dataclasses vs Pydantic — Knowing When to Use Each

Section titled “6. Dataclasses vs Pydantic — Knowing When to Use Each”

Dataclasses and Pydantic models are both ways to give structure to data, but they serve different purposes. Choosing the wrong one leads to either over-engineering or undetected data corruption.

Dataclasses are Python’s standard library tool for structured data containers. They provide type hints, auto-generated __init__, __repr__, and __eq__. They do not validate at runtime.

from dataclasses import dataclass, field
import time
@dataclass
class RetrievalResult:
query: str
chunks: list[dict]
retrieval_latency_ms: float
timestamp: float = field(default_factory=time.time)
from_cache: bool = False
@dataclass
class PipelineState:
query_id: str
original_query: str
rewritten_query: str | None = None
retrieval_result: RetrievalResult | None = None
generated_answer: str | None = None
total_tokens: int = 0
# Internal use — no external data, no validation needed
state = PipelineState(
query_id="req_abc123",
original_query="What is RAG?",
)
state.rewritten_query = "What is retrieval-augmented generation and how does it work?"

Use dataclasses when:

  • Data is created and consumed entirely within your own code
  • You control all the code paths that set values
  • Performance is critical (no validation overhead)
  • You need __slots__ for memory efficiency on large batches

Use Pydantic whenever data comes from or goes to an external source — LLM APIs, user-facing REST APIs, configuration files.

# Requires: pydantic>=2.0.0, pydantic-settings>=2.0.0
from pydantic import BaseModel, Field
from pydantic_settings import BaseSettings
from typing import Annotated
class RAGRequest(BaseModel):
"""Validates incoming API requests from users."""
query: Annotated[str, Field(min_length=1, max_length=2000)]
top_k: int = Field(default=5, ge=1, le=20)
temperature: float = Field(default=0.0, ge=0.0, le=2.0)
filter_sources: list[str] = Field(default_factory=list)
class RAGResponse(BaseModel):
"""Validates outgoing API responses."""
query_id: str
answer: str
sources: list[str]
tokens_used: int
retrieval_latency_ms: float
class AppSettings(BaseSettings):
"""Validates environment variables at application startup."""
openai_api_key: str
pinecone_api_key: str
embedding_model: str = "text-embedding-3-small"
generation_model: str = "gpt-4o-mini"
max_concurrent_requests: int = Field(default=20, ge=1, le=100)
class Config:
env_file = ".env"

Decision rule:

ScenarioUse
LLM output parsingPydantic
API request/response bodyPydantic
Environment configurationPydantic (BaseSettings)
Internal pipeline statedataclass
In-memory agent memorydataclass
Tool call arguments (to be validated)Pydantic
Intermediate computation resultsdataclass

The key insight: if bad data could arrive at the structure — and produce a silent incorrect result or downstream crash — use Pydantic. If the structure only holds data your own code created and already validated, use a dataclass.


7. Collections and Itertools for GenAI Pipelines

Section titled “7. Collections and Itertools for GenAI Pipelines”

The collections and itertools modules contain data structures and utilities that solve specific GenAI problems cleanly.

collections.deque — Rolling Context Windows

Section titled “collections.deque — Rolling Context Windows”

Conversation history has a token budget. When the budget is exceeded, you need to evict the oldest messages efficiently. deque with maxlen does this in O(1):

from collections import deque
class ConversationMemory:
def __init__(self, max_messages: int = 20):
# maxlen automatically evicts oldest items when at capacity
self._messages: deque[dict] = deque(maxlen=max_messages)
self._system_prompt: str = ""
def set_system(self, prompt: str) -> None:
self._system_prompt = prompt
def add_user(self, content: str) -> None:
self._messages.append({"role": "user", "content": content})
def add_assistant(self, content: str) -> None:
self._messages.append({"role": "assistant", "content": content})
def as_message_list(self) -> list[dict]:
messages = []
if self._system_prompt:
messages.append({"role": "system", "content": self._system_prompt})
messages.extend(self._messages)
return messages
# Usage
memory = ConversationMemory(max_messages=10)
memory.set_system("You are a GenAI expert assistant.")
memory.add_user("What is a vector database?")
memory.add_assistant("A vector database stores high-dimensional embeddings...")
# When the 11th message is added, the 1st is automatically dropped

collections.defaultdict — Grouping Chunks by Source

Section titled “collections.defaultdict — Grouping Chunks by Source”

When merging results from multiple retrieval passes, grouping chunks by their source document aids deduplication and citation:

from collections import defaultdict
def group_chunks_by_source(
chunks: list[dict],
) -> dict[str, list[dict]]:
"""Group retrieved chunks by their source document."""
by_source: defaultdict[str, list[dict]] = defaultdict(list)
for chunk in chunks:
source = chunk["metadata"].get("source", "unknown")
by_source[source].append(chunk)
# Sort each source's chunks by score
return {
source: sorted(chunk_list, key=lambda c: c["score"], reverse=True)
for source, chunk_list in by_source.items()
}
# Result: {"rag-guide.pdf": [...], "retrieval-survey.pdf": [...]}

collections.Counter — Token and Keyword Analysis

Section titled “collections.Counter — Token and Keyword Analysis”

Analyzing keyword frequency across a document corpus or within retrieved chunks helps calibrate retrieval quality:

from collections import Counter
import re
def analyze_keyword_coverage(
query: str,
chunks: list[dict],
) -> dict[str, int]:
"""Count how often query keywords appear in retrieved chunks."""
query_keywords = set(re.findall(r'\b\w{4,}\b', query.lower()))
combined_text = " ".join(c["text"].lower() for c in chunks)
words = re.findall(r'\b\w+\b', combined_text)
word_counts = Counter(words)
# Return only counts for query-relevant keywords
return {kw: word_counts.get(kw, 0) for kw in query_keywords}

When processing large document corpora for embedding, loading all chunks into memory at once is impractical. itertools.islice enables lazy batching:

import itertools
from typing import Iterator
def batch_chunks(
chunks: Iterator[str],
batch_size: int = 100,
) -> Iterator[list[str]]:
"""Yield batches of chunks without loading all into memory."""
while True:
batch = list(itertools.islice(chunks, batch_size))
if not batch:
break
yield batch
# Usage with a generator of document chunks
async def embed_corpus_lazy(document_path: str) -> None:
chunk_gen = generate_chunks_from_file(document_path) # generator
for batch in batch_chunks(chunk_gen, batch_size=50):
embeddings = await embed_batch(batch) # 50 at a time, not all at once
await vector_store.upsert(batch, embeddings)

8. Async Data Patterns for GenAI Pipelines

Section titled “8. Async Data Patterns for GenAI Pipelines”

Data structures intersect with async patterns in specific ways that matter for production GenAI systems.

Async-Safe Shared State with asyncio.Queue

Section titled “Async-Safe Shared State with asyncio.Queue”

When multiple async workers process documents concurrently, shared mutable state requires care. asyncio.Queue is a thread-safe (within the event loop) data structure for producer-consumer patterns:

import asyncio
from dataclasses import dataclass
@dataclass
class EmbeddingJob:
doc_id: str
chunk_index: int
text: str
@dataclass
class EmbeddingResult:
doc_id: str
chunk_index: int
embedding: list[float]
async def embedding_worker(
queue_in: asyncio.Queue[EmbeddingJob],
queue_out: asyncio.Queue[EmbeddingResult],
client: "AsyncOpenAI",
worker_id: int,
) -> None:
"""Consumer: pulls jobs from queue, pushes results."""
while True:
job = await queue_in.get()
try:
response = await client.embeddings.create(
model="text-embedding-3-small",
input=job.text,
)
result = EmbeddingResult(
doc_id=job.doc_id,
chunk_index=job.chunk_index,
embedding=response.data[0].embedding,
)
await queue_out.put(result)
finally:
queue_in.task_done()
async def run_parallel_ingestion(
chunks: list[tuple[str, int, str]], # (doc_id, chunk_index, text)
num_workers: int = 10,
) -> list[EmbeddingResult]:
"""Run embedding workers in parallel with a shared queue."""
from openai import AsyncOpenAI
client = AsyncOpenAI()
queue_in: asyncio.Queue[EmbeddingJob] = asyncio.Queue()
queue_out: asyncio.Queue[EmbeddingResult] = asyncio.Queue()
# Populate input queue
for doc_id, chunk_index, text in chunks:
await queue_in.put(EmbeddingJob(doc_id, chunk_index, text))
# Start workers
workers = [
asyncio.create_task(embedding_worker(queue_in, queue_out, client, i))
for i in range(num_workers)
]
# Wait for all jobs to complete
await queue_in.join()
# Cancel workers
for worker in workers:
worker.cancel()
# Collect results
results = []
while not queue_out.empty():
results.append(queue_out.get_nowait())
return results

When streaming LLM output, partial JSON arrives token by token. Validate only when the stream completes:

# Requires: pydantic>=2.0.0, openai>=1.0.0
from pydantic import BaseModel, ValidationError
from openai import AsyncOpenAI
class StructuredAnswer(BaseModel):
answer: str
confidence: float
sources_used: list[str]
async def stream_and_validate(prompt: str) -> StructuredAnswer | None:
client = AsyncOpenAI()
collected_tokens: list[str] = []
# Collect all streaming tokens
async with client.chat.completions.stream(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
) as stream:
async for event in stream:
if event.type == "content.delta" and event.delta:
collected_tokens.append(event.delta)
# Validate the complete response
import json
raw_text = "".join(collected_tokens)
try:
data = json.loads(raw_text)
return StructuredAnswer.model_validate(data)
except (json.JSONDecodeError, ValidationError) as e:
print(f"Validation failed: {e}")
return None

When working with libraries that return plain dicts (LangChain LCEL chains, many vector DB clients), TypedDict adds type safety without the overhead of Pydantic instantiation:

from typing import TypedDict, Required
class VectorSearchResult(TypedDict, total=False):
id: Required[str]
score: Required[float]
text: Required[str]
source: str # Optional — not all results have this
page: int # Optional
# Type-safe access — IDE catches typos at write time
def extract_texts(results: list[VectorSearchResult]) -> list[str]:
return [r["text"] for r in results]

9. Interview Preparation — Common Data Structure Questions

Section titled “9. Interview Preparation — Common Data Structure Questions”

Interviewers for GenAI engineering roles ask data structure questions in the context of LLM systems. These are not abstract algorithmic questions. They test whether you know which structure fits a production scenario.

Question 1: “How would you store and manage a multi-turn conversation history with a token limit?”

Use collections.deque with maxlen set to the maximum number of messages the token budget allows. When the deque reaches capacity, it automatically evicts the oldest message — O(1) operation with no manual index management. Maintain the system prompt separately (not in the deque) so it is never evicted. Convert to list only when building the API call.

Question 2: “You have retrieved 15 chunks from vector search and 15 from BM25 keyword search. Several chunks appear in both result sets. How do you merge and deduplicate them?”

Track seen chunk IDs in a set[str] as you iterate over results sorted by score. For each result, check id in seen_ids in O(1). If not seen, add to the seen_ids set and append to the output list. Stop when you have top_k unique results. Do not use a list for deduplication — in on a list is O(n), making the naive approach O(n²) for large result sets.

Question 3: “An LLM returns JSON for a structured extraction task. What can go wrong if you parse it with json.loads() and access fields directly as a dict?”

Silent failures: the LLM may omit a required field (you get KeyError later), return a string where you expect an integer (type errors downstream), or return null where your code expects a non-null value. A Pydantic model raises ValidationError immediately at the parse step, giving you a clear error with field-level detail. The dict approach defers the failure — often to a silent data corruption rather than an exception.

Question 4: “When would you choose a dataclass over a Pydantic BaseModel for an agent’s memory state?”

Use a dataclass for agent memory state that your own code writes and reads — it has no validation overhead and the structure is fully under your control. Use Pydantic if the memory state is serialized to a database and reloaded (data may be corrupted in storage), or if the state is set by an external call like an LLM tool invocation where the LLM could return malformed data. The boundary question is: “Can bad data arrive here without passing through code I control?“


Python data structures in GenAI engineering are not interchangeable. The right structure at each pipeline boundary determines whether your system catches errors early or fails silently in production.

The five patterns to internalize:

  1. List + dict for API communication. Every LLM API call takes list[dict]. Know how to build, append, slice, and token-budget these message lists.

  2. Set for deduplication. Hybrid retrieval always produces overlapping results. O(1) set membership testing is the correct tool — not if x in list.

  3. Pydantic at external boundaries. Any data that crosses a system boundary — LLM output, API request, environment config — belongs in a Pydantic model. ValidationError at parse time is better than KeyError deep in business logic.

  4. Dataclass for internal state. Pipeline intermediate state, agent memory, and results your own code creates belong in dataclasses. No validation overhead, full IDE support.

  5. collections.deque for rolling windows. Token budgets require evicting old messages. deque(maxlen=N) does this automatically and in O(1).

Where to go next:

  • Python for GenAI Engineers — Production async patterns and Pydantic in depth, including retry logic and structured output
  • Async Python Guideasyncio.gather, semaphores, and streaming, which operate on the data structures covered here
  • RAG Architecture Guide — How lists, dicts, sets, and Pydantic models compose into a full retrieval-augmented generation system
  • AI Agents and Agentic Systems — Agent memory and tool call arguments are where dataclasses and Pydantic interact most

The data layer is invisible when it works correctly and catastrophic when it does not. Getting these fundamentals right is what separates GenAI code that works in a demo from code that runs reliably in production.

Frequently Asked Questions

What Python data structures are essential for AI and LLM applications?

Lists and dicts are the daily workhorses for prompt messages, retrieval results, and tool calls. Sets enable fast deduplication of retrieved chunks. Dataclasses structure pipeline intermediate state. Pydantic models validate LLM-generated JSON at runtime. The collections module offers deque for rolling context windows and Counter for token frequency analysis.

Why is Pydantic important for LLM applications?

LLMs return unstructured text or loosely formatted JSON. Pydantic models enforce a schema at runtime — if the LLM output is missing a required field or has a wrong type, a ValidationError is raised immediately rather than propagating a silent data error downstream. OpenAI's structured output feature and LangChain's with_structured_output both accept Pydantic models as the target schema.

When should I use dataclasses vs Pydantic in a GenAI project?

Use dataclasses for internal pipeline state that never crosses an external boundary: intermediate retrieval results, pipeline stage outputs, in-memory agent state. Use Pydantic when data comes from or goes to an external source: LLM outputs, API request/response bodies, configuration loaded from environment variables. The rule of thumb is: if it could be wrong, use Pydantic.

Which Python collections are most useful for storing and processing embeddings?

Plain Python lists store small embedding batches. For large-scale numeric work use NumPy arrays (faster dot-product, slicing). deque from the collections module manages rolling conversation context windows with O(1) appends and pops. defaultdict simplifies grouping retrieved chunks by source document. Counter tracks token frequencies across a corpus for analysis.

How do you use sets for deduplication in a RAG pipeline?

Track seen chunk IDs in a set as you iterate over results sorted by score. For each result, check membership in O(1) with id in seen_ids. If not seen, add the ID to the set and append the chunk to the output list. This prevents duplicate chunks from appearing in the final context, which is critical in hybrid RAG retrieval where vector search and keyword search return overlapping results.

How does collections.deque manage conversation context windows?

Create a deque with maxlen set to the maximum number of messages your token budget allows. When the deque reaches capacity, it automatically evicts the oldest message in O(1) with no manual index management. The system prompt is maintained separately so it is never evicted. Convert the deque to a list only when building the final API call.

What is TypedDict and when should you use it instead of Pydantic?

TypedDict adds type safety to plain dicts without the overhead of Pydantic instantiation. Use it when working with libraries that return plain dicts, such as LangChain LCEL chains or vector database clients. TypedDict provides IDE autocompletion and catches key typos at write time, but it does not validate data at runtime like Pydantic does.

How do you batch chunks for embedding using itertools?

Use itertools.islice to lazily pull fixed-size batches from a generator of document chunks without loading all chunks into memory at once. In each iteration, slice the next batch_size items, embed them via the API, and upsert into the vector store. This pattern is essential when processing large corpora where the full dataset does not fit in memory.

How does asyncio.Queue help with parallel embedding pipelines?

asyncio.Queue implements a thread-safe producer-consumer pattern within the event loop. Producers enqueue EmbeddingJob dataclasses, and multiple async workers pull jobs from the queue, call the embedding API, and push results to an output queue. This enables controlled concurrency with a fixed number of workers, preventing API rate limit errors.

Why should you validate LLM-generated JSON with Pydantic instead of json.loads?

json.loads only checks syntax — it produces a dict that accepts any keys and values. If the LLM omits a required field, returns a string where a number is expected, or returns null where your code expects a value, the error surfaces deep in your pipeline as a KeyError or type mismatch. Pydantic raises a ValidationError immediately at parse time with field-level detail, catching malformed data at the boundary.