Python Data Structures for GenAI Engineers (2026)
1. Why Data Structures Shape GenAI System Quality
Section titled “1. Why Data Structures Shape GenAI System Quality”Most Python tutorials treat data structures as a prerequisite topic: learn lists, move on. In GenAI engineering, data structures are never a prerequisite — they are an active design decision that affects correctness, latency, and cost.
Consider what a single RAG query touches before it returns an answer. The user message is a dict inside a list passed to the LLM API. Embedding API responses return a list[float] that your vector database stores and retrieves. Retrieved chunks arrive as a list[dict], deduplicated with a set, and merged with a dict keyed by source. The final prompt is assembled from several str concatenations constrained by a token count. The LLM’s response is parsed and validated by a Pydantic model. Any one of those steps getting the data structure wrong — using a plain dict where a validated Pydantic model belongs, or a list where a set would prevent duplicates — introduces either silent data corruption or runtime errors that appear only in production.
This guide is not an introduction to Python. It is a practical map of which data structures GenAI engineers reach for, why they reach for them, and where each one fails. It addresses the course of study that covers what every production LLM application actually uses.
Related foundation: Python for GenAI Engineers covers async patterns and error handling; this guide covers the data layer those patterns operate on.
2. Core Data Structures for GenAI — The Practical Map
Section titled “2. Core Data Structures for GenAI — The Practical Map”Every GenAI application uses the same small set of data structures repeatedly. Understanding their roles and limitations upfront prevents the category of bugs that only appear in production.
| Structure | Where it appears in GenAI code | Key limitation to know |
|---|---|---|
list | Prompt messages, retrieved chunks, embedding batches | Ordered, allows duplicates — deduplication is manual |
dict | Prompt message objects, metadata, tool call arguments | No schema enforcement — KeyError risks at runtime |
set | Deduplication of retrieved chunk IDs or sources | Unordered, elements must be hashable |
dataclass | Internal pipeline state, intermediate results | No runtime validation — bad data passes silently |
Pydantic BaseModel | LLM output parsing, API schemas, configuration | Slightly higher overhead, but catches malformed data |
collections.deque | Rolling conversation context windows | Fixed max length for O(1) context eviction |
collections.Counter | Token frequency analysis, keyword counting | Not serializable as-is — convert to dict for JSON |
collections.defaultdict | Grouping chunks by source document | Default factory can mask KeyError logic errors |
The most consequential distinction is between dict and Pydantic BaseModel. A dict accepts any key and any value. A Pydantic model enforces a declared schema at runtime. When an LLM response is missing a field your code later accesses, the dict version raises KeyError deep in your pipeline. The Pydantic version raises ValidationError immediately at the boundary where the LLM output was parsed.
3. Lists, Dicts, and Sets in AI Context
Section titled “3. Lists, Dicts, and Sets in AI Context”These three structures handle the majority of data flow in any GenAI pipeline — understanding their roles prevents the subtle bugs that only appear in production.
Lists as the Backbone of LLM API Communication
Section titled “Lists as the Backbone of LLM API Communication”Every call to an LLM API passes a list[dict] as the messages argument. This structure is non-negotiable — it is the interface the APIs define.
# Requires: openai>=1.0.0from openai import AsyncOpenAIfrom typing import TypedDict
# TypedDict gives type hints without runtime overheadclass Message(TypedDict): role: str # "system" | "user" | "assistant" | "tool" content: str
# A conversation is a list of messagesconversation: list[Message] = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is retrieval-augmented generation?"},]
client = AsyncOpenAI()response = await client.chat.completions.create( model="gpt-4o-mini", messages=conversation, # list[dict] — the only accepted format)
# Append the assistant's reply to maintain conversation historyconversation.append({ "role": "assistant", "content": response.choices[0].message.content,})List operations that matter:
list.append()— O(1), the correct way to grow a conversationlist[start:end]— O(k) slicing for context truncation (drop oldest messages)list.extend(other)— merge retrieved chunks from multiple sourcessorted(results, key=lambda x: x["score"], reverse=True)— rerank retrieved results
Dicts as Metadata Containers
Section titled “Dicts as Metadata Containers”Retrieved chunks from vector databases arrive with metadata: source document, page number, section heading, retrieval score. Dicts carry this metadata.
from typing import TypedDict
class RetrievedChunk(TypedDict): id: str text: str score: float metadata: dict[str, str | int | float]
# Typical vector database resultchunks: list[RetrievedChunk] = [ { "id": "doc_42_chunk_7", "text": "RAG pipelines combine retrieval with generation...", "score": 0.923, "metadata": {"source": "rag-guide.pdf", "page": 12, "section": "Architecture"}, }, { "id": "doc_17_chunk_3", "text": "Hybrid retrieval uses both dense and sparse methods...", "score": 0.891, "metadata": {"source": "retrieval-survey.pdf", "page": 5, "section": "Methods"}, },]
# Access with .get() to avoid KeyError on optional metadata fieldsfor chunk in chunks: source = chunk["metadata"].get("source", "unknown") page = chunk["metadata"].get("page", "N/A")Dict patterns that matter:
dict.get(key, default)— safe access when keys may be absent (common with LLM-generated JSON){**base_dict, **overrides}— merge dicts for prompt template substitutiondict comprehensions— transform lists of chunks into lookup tables by ID
Sets for Deduplication
Section titled “Sets for Deduplication”Hybrid retrieval (vector + keyword search) often returns the same chunk from both sources. Sets deduplicate in O(1) per lookup.
async def deduplicate_results( vector_results: list[RetrievedChunk], keyword_results: list[RetrievedChunk], top_k: int = 5,) -> list[RetrievedChunk]: """Merge hybrid retrieval results, deduplicate by ID, sort by score.""" seen_ids: set[str] = set() merged: list[RetrievedChunk] = []
# Interleave results, preferring higher scores all_results = sorted( vector_results + keyword_results, key=lambda x: x["score"], reverse=True, )
for chunk in all_results: if chunk["id"] not in seen_ids: seen_ids.add(chunk["id"]) merged.append(chunk) if len(merged) == top_k: break
return mergedSets also deduplicate source citations — presenting a user with the same source document listed twice looks careless:
# Extract unique sources for citation displayunique_sources: set[str] = { chunk["metadata"].get("source", "unknown") for chunk in final_chunks}citations = sorted(unique_sources) # Sorted for stable presentation4. Pydantic Models for LLM Output Validation
Section titled “4. Pydantic Models for LLM Output Validation”Pydantic is the most important data structure tool in the GenAI stack. Its role is to enforce schema contracts at the boundary where LLM-generated data enters your application.
Why Plain Dicts Fail for LLM Output
Section titled “Why Plain Dicts Fail for LLM Output”An LLM instructed to return JSON may:
- Omit a required field
- Return a string where a number is expected
- Return
nullwhere your code expects a non-null value - Wrap the JSON in markdown fences:
```json ... ```
Each of these produces a dict that appears valid until you access the problematic field deep in your pipeline. Pydantic catches all of these at parse time.
Defining Pydantic Models for Structured LLM Output
Section titled “Defining Pydantic Models for Structured LLM Output”# Requires: pydantic>=2.0.0from pydantic import BaseModel, Field, field_validatorfrom typing import Literalfrom enum import Enum
class Confidence(str, Enum): high = "high" medium = "medium" low = "low"
class ExtractedFact(BaseModel): claim: str = Field(description="The factual claim extracted from the document") source_sentence: str = Field(description="The exact sentence the claim is drawn from") confidence: Confidence requires_verification: bool
@field_validator("claim", "source_sentence") @classmethod def not_empty(cls, v: str) -> str: if not v.strip(): raise ValueError("Field cannot be empty") return v.strip()
class DocumentAnalysis(BaseModel): document_id: str summary: str = Field(max_length=500) key_facts: list[ExtractedFact] = Field(min_length=1) overall_confidence: Confidence recommended_action: Literal["publish", "review", "reject"]Using Pydantic with OpenAI Structured Output
Section titled “Using Pydantic with OpenAI Structured Output”# Requires: openai>=1.0.0, pydantic>=2.0.0from openai import AsyncOpenAIfrom pydantic import ValidationError
client = AsyncOpenAI()
async def analyze_document(doc_id: str, text: str) -> DocumentAnalysis | None: try: response = await client.beta.chat.completions.parse( model="gpt-4o-mini", messages=[ { "role": "system", "content": ( "Extract key facts from the document. " "Return structured JSON matching the schema." ), }, {"role": "user", "content": f"Document ID: {doc_id}\n\n{text}"}, ], response_format=DocumentAnalysis, ) return response.choices[0].message.parsed
except ValidationError as e: # Log the errors for prompt improvement — these are signal, not noise print(f"LLM output failed validation for {doc_id}: {e.errors()}") return NoneUsing Pydantic with LangChain
Section titled “Using Pydantic with LangChain”# Requires: langchain-openai>=0.1.0, langchain-core>=0.1.0from langchain_openai import ChatOpenAIfrom langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)# .with_structured_output() accepts a Pydantic model as the target schemastructured_llm = llm.with_structured_output(DocumentAnalysis)
prompt = ChatPromptTemplate.from_messages([ ("system", "Extract key facts from the document."), ("user", "Document ID: {doc_id}\n\n{text}"),])
chain = prompt | structured_llm
# result is a validated DocumentAnalysis instance, not a dictresult: DocumentAnalysis = await chain.ainvoke({"doc_id": doc_id, "text": text})print(result.overall_confidence) # Type-safe — IDE knows this is Confidence enum5. Data Flow in a RAG Pipeline — Visual Diagram
Section titled “5. Data Flow in a RAG Pipeline — Visual Diagram”Understanding how data structures transform through each stage of a RAG pipeline clarifies why each structure was chosen. Raw query strings become embedding vectors; vectors become ranked chunk lists; chunks become formatted prompt strings; LLM responses become validated Pydantic objects.
📊 Visual Explanation
Section titled “📊 Visual Explanation”Data Structures Through a RAG Pipeline
Each stage transforms data — the right structure at each boundary prevents silent failures
The transitions that most often break are the ones involving LLM output: moving from raw LLM response text to a structured data type. Using Pydantic at that boundary is not optional in production — it is how the system stays correct when the LLM is inconsistent.
6. Dataclasses vs Pydantic — Knowing When to Use Each
Section titled “6. Dataclasses vs Pydantic — Knowing When to Use Each”Dataclasses and Pydantic models are both ways to give structure to data, but they serve different purposes. Choosing the wrong one leads to either over-engineering or undetected data corruption.
Dataclasses: Internal Pipeline State
Section titled “Dataclasses: Internal Pipeline State”Dataclasses are Python’s standard library tool for structured data containers. They provide type hints, auto-generated __init__, __repr__, and __eq__. They do not validate at runtime.
from dataclasses import dataclass, fieldimport time
@dataclassclass RetrievalResult: query: str chunks: list[dict] retrieval_latency_ms: float timestamp: float = field(default_factory=time.time) from_cache: bool = False
@dataclassclass PipelineState: query_id: str original_query: str rewritten_query: str | None = None retrieval_result: RetrievalResult | None = None generated_answer: str | None = None total_tokens: int = 0
# Internal use — no external data, no validation neededstate = PipelineState( query_id="req_abc123", original_query="What is RAG?",)state.rewritten_query = "What is retrieval-augmented generation and how does it work?"Use dataclasses when:
- Data is created and consumed entirely within your own code
- You control all the code paths that set values
- Performance is critical (no validation overhead)
- You need
__slots__for memory efficiency on large batches
Pydantic: External Data Boundaries
Section titled “Pydantic: External Data Boundaries”Use Pydantic whenever data comes from or goes to an external source — LLM APIs, user-facing REST APIs, configuration files.
# Requires: pydantic>=2.0.0, pydantic-settings>=2.0.0from pydantic import BaseModel, Fieldfrom pydantic_settings import BaseSettingsfrom typing import Annotated
class RAGRequest(BaseModel): """Validates incoming API requests from users.""" query: Annotated[str, Field(min_length=1, max_length=2000)] top_k: int = Field(default=5, ge=1, le=20) temperature: float = Field(default=0.0, ge=0.0, le=2.0) filter_sources: list[str] = Field(default_factory=list)
class RAGResponse(BaseModel): """Validates outgoing API responses.""" query_id: str answer: str sources: list[str] tokens_used: int retrieval_latency_ms: float
class AppSettings(BaseSettings): """Validates environment variables at application startup.""" openai_api_key: str pinecone_api_key: str embedding_model: str = "text-embedding-3-small" generation_model: str = "gpt-4o-mini" max_concurrent_requests: int = Field(default=20, ge=1, le=100)
class Config: env_file = ".env"Decision rule:
| Scenario | Use |
|---|---|
| LLM output parsing | Pydantic |
| API request/response body | Pydantic |
| Environment configuration | Pydantic (BaseSettings) |
| Internal pipeline state | dataclass |
| In-memory agent memory | dataclass |
| Tool call arguments (to be validated) | Pydantic |
| Intermediate computation results | dataclass |
The key insight: if bad data could arrive at the structure — and produce a silent incorrect result or downstream crash — use Pydantic. If the structure only holds data your own code created and already validated, use a dataclass.
7. Collections and Itertools for GenAI Pipelines
Section titled “7. Collections and Itertools for GenAI Pipelines”The collections and itertools modules contain data structures and utilities that solve specific GenAI problems cleanly.
collections.deque — Rolling Context Windows
Section titled “collections.deque — Rolling Context Windows”Conversation history has a token budget. When the budget is exceeded, you need to evict the oldest messages efficiently. deque with maxlen does this in O(1):
from collections import deque
class ConversationMemory: def __init__(self, max_messages: int = 20): # maxlen automatically evicts oldest items when at capacity self._messages: deque[dict] = deque(maxlen=max_messages) self._system_prompt: str = ""
def set_system(self, prompt: str) -> None: self._system_prompt = prompt
def add_user(self, content: str) -> None: self._messages.append({"role": "user", "content": content})
def add_assistant(self, content: str) -> None: self._messages.append({"role": "assistant", "content": content})
def as_message_list(self) -> list[dict]: messages = [] if self._system_prompt: messages.append({"role": "system", "content": self._system_prompt}) messages.extend(self._messages) return messages
# Usagememory = ConversationMemory(max_messages=10)memory.set_system("You are a GenAI expert assistant.")memory.add_user("What is a vector database?")memory.add_assistant("A vector database stores high-dimensional embeddings...")# When the 11th message is added, the 1st is automatically droppedcollections.defaultdict — Grouping Chunks by Source
Section titled “collections.defaultdict — Grouping Chunks by Source”When merging results from multiple retrieval passes, grouping chunks by their source document aids deduplication and citation:
from collections import defaultdict
def group_chunks_by_source( chunks: list[dict],) -> dict[str, list[dict]]: """Group retrieved chunks by their source document.""" by_source: defaultdict[str, list[dict]] = defaultdict(list)
for chunk in chunks: source = chunk["metadata"].get("source", "unknown") by_source[source].append(chunk)
# Sort each source's chunks by score return { source: sorted(chunk_list, key=lambda c: c["score"], reverse=True) for source, chunk_list in by_source.items() }
# Result: {"rag-guide.pdf": [...], "retrieval-survey.pdf": [...]}collections.Counter — Token and Keyword Analysis
Section titled “collections.Counter — Token and Keyword Analysis”Analyzing keyword frequency across a document corpus or within retrieved chunks helps calibrate retrieval quality:
from collections import Counterimport re
def analyze_keyword_coverage( query: str, chunks: list[dict],) -> dict[str, int]: """Count how often query keywords appear in retrieved chunks.""" query_keywords = set(re.findall(r'\b\w{4,}\b', query.lower())) combined_text = " ".join(c["text"].lower() for c in chunks) words = re.findall(r'\b\w+\b', combined_text)
word_counts = Counter(words) # Return only counts for query-relevant keywords return {kw: word_counts.get(kw, 0) for kw in query_keywords}itertools.islice — Lazy Chunk Batching
Section titled “itertools.islice — Lazy Chunk Batching”When processing large document corpora for embedding, loading all chunks into memory at once is impractical. itertools.islice enables lazy batching:
import itertoolsfrom typing import Iterator
def batch_chunks( chunks: Iterator[str], batch_size: int = 100,) -> Iterator[list[str]]: """Yield batches of chunks without loading all into memory.""" while True: batch = list(itertools.islice(chunks, batch_size)) if not batch: break yield batch
# Usage with a generator of document chunksasync def embed_corpus_lazy(document_path: str) -> None: chunk_gen = generate_chunks_from_file(document_path) # generator for batch in batch_chunks(chunk_gen, batch_size=50): embeddings = await embed_batch(batch) # 50 at a time, not all at once await vector_store.upsert(batch, embeddings)8. Async Data Patterns for GenAI Pipelines
Section titled “8. Async Data Patterns for GenAI Pipelines”Data structures intersect with async patterns in specific ways that matter for production GenAI systems.
Async-Safe Shared State with asyncio.Queue
Section titled “Async-Safe Shared State with asyncio.Queue”When multiple async workers process documents concurrently, shared mutable state requires care. asyncio.Queue is a thread-safe (within the event loop) data structure for producer-consumer patterns:
import asynciofrom dataclasses import dataclass
@dataclassclass EmbeddingJob: doc_id: str chunk_index: int text: str
@dataclassclass EmbeddingResult: doc_id: str chunk_index: int embedding: list[float]
async def embedding_worker( queue_in: asyncio.Queue[EmbeddingJob], queue_out: asyncio.Queue[EmbeddingResult], client: "AsyncOpenAI", worker_id: int,) -> None: """Consumer: pulls jobs from queue, pushes results.""" while True: job = await queue_in.get() try: response = await client.embeddings.create( model="text-embedding-3-small", input=job.text, ) result = EmbeddingResult( doc_id=job.doc_id, chunk_index=job.chunk_index, embedding=response.data[0].embedding, ) await queue_out.put(result) finally: queue_in.task_done()
async def run_parallel_ingestion( chunks: list[tuple[str, int, str]], # (doc_id, chunk_index, text) num_workers: int = 10,) -> list[EmbeddingResult]: """Run embedding workers in parallel with a shared queue.""" from openai import AsyncOpenAI client = AsyncOpenAI()
queue_in: asyncio.Queue[EmbeddingJob] = asyncio.Queue() queue_out: asyncio.Queue[EmbeddingResult] = asyncio.Queue()
# Populate input queue for doc_id, chunk_index, text in chunks: await queue_in.put(EmbeddingJob(doc_id, chunk_index, text))
# Start workers workers = [ asyncio.create_task(embedding_worker(queue_in, queue_out, client, i)) for i in range(num_workers) ]
# Wait for all jobs to complete await queue_in.join()
# Cancel workers for worker in workers: worker.cancel()
# Collect results results = [] while not queue_out.empty(): results.append(queue_out.get_nowait())
return resultsStreaming Pydantic Validation
Section titled “Streaming Pydantic Validation”When streaming LLM output, partial JSON arrives token by token. Validate only when the stream completes:
# Requires: pydantic>=2.0.0, openai>=1.0.0from pydantic import BaseModel, ValidationErrorfrom openai import AsyncOpenAI
class StructuredAnswer(BaseModel): answer: str confidence: float sources_used: list[str]
async def stream_and_validate(prompt: str) -> StructuredAnswer | None: client = AsyncOpenAI() collected_tokens: list[str] = []
# Collect all streaming tokens async with client.chat.completions.stream( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, ) as stream: async for event in stream: if event.type == "content.delta" and event.delta: collected_tokens.append(event.delta)
# Validate the complete response import json raw_text = "".join(collected_tokens) try: data = json.loads(raw_text) return StructuredAnswer.model_validate(data) except (json.JSONDecodeError, ValidationError) as e: print(f"Validation failed: {e}") return NoneTypedDict for Lightweight Interop
Section titled “TypedDict for Lightweight Interop”When working with libraries that return plain dicts (LangChain LCEL chains, many vector DB clients), TypedDict adds type safety without the overhead of Pydantic instantiation:
from typing import TypedDict, Required
class VectorSearchResult(TypedDict, total=False): id: Required[str] score: Required[float] text: Required[str] source: str # Optional — not all results have this page: int # Optional
# Type-safe access — IDE catches typos at write timedef extract_texts(results: list[VectorSearchResult]) -> list[str]: return [r["text"] for r in results]9. Interview Preparation — Common Data Structure Questions
Section titled “9. Interview Preparation — Common Data Structure Questions”Interviewers for GenAI engineering roles ask data structure questions in the context of LLM systems. These are not abstract algorithmic questions. They test whether you know which structure fits a production scenario.
Question 1: “How would you store and manage a multi-turn conversation history with a token limit?”
Use collections.deque with maxlen set to the maximum number of messages the token budget allows. When the deque reaches capacity, it automatically evicts the oldest message — O(1) operation with no manual index management. Maintain the system prompt separately (not in the deque) so it is never evicted. Convert to list only when building the API call.
Question 2: “You have retrieved 15 chunks from vector search and 15 from BM25 keyword search. Several chunks appear in both result sets. How do you merge and deduplicate them?”
Track seen chunk IDs in a set[str] as you iterate over results sorted by score. For each result, check id in seen_ids in O(1). If not seen, add to the seen_ids set and append to the output list. Stop when you have top_k unique results. Do not use a list for deduplication — in on a list is O(n), making the naive approach O(n²) for large result sets.
Question 3: “An LLM returns JSON for a structured extraction task. What can go wrong if you parse it with json.loads() and access fields directly as a dict?”
Silent failures: the LLM may omit a required field (you get KeyError later), return a string where you expect an integer (type errors downstream), or return null where your code expects a non-null value. A Pydantic model raises ValidationError immediately at the parse step, giving you a clear error with field-level detail. The dict approach defers the failure — often to a silent data corruption rather than an exception.
Question 4: “When would you choose a dataclass over a Pydantic BaseModel for an agent’s memory state?”
Use a dataclass for agent memory state that your own code writes and reads — it has no validation overhead and the structure is fully under your control. Use Pydantic if the memory state is serialized to a database and reloaded (data may be corrupted in storage), or if the state is set by an external call like an LLM tool invocation where the LLM could return malformed data. The boundary question is: “Can bad data arrive here without passing through code I control?“
10. Summary and Next Steps
Section titled “10. Summary and Next Steps”Python data structures in GenAI engineering are not interchangeable. The right structure at each pipeline boundary determines whether your system catches errors early or fails silently in production.
The five patterns to internalize:
-
List + dict for API communication. Every LLM API call takes
list[dict]. Know how to build, append, slice, and token-budget these message lists. -
Set for deduplication. Hybrid retrieval always produces overlapping results. O(1)
setmembership testing is the correct tool — notif x in list. -
Pydantic at external boundaries. Any data that crosses a system boundary — LLM output, API request, environment config — belongs in a Pydantic model.
ValidationErrorat parse time is better thanKeyErrordeep in business logic. -
Dataclass for internal state. Pipeline intermediate state, agent memory, and results your own code creates belong in dataclasses. No validation overhead, full IDE support.
-
collections.dequefor rolling windows. Token budgets require evicting old messages.deque(maxlen=N)does this automatically and in O(1).
Where to go next:
- Python for GenAI Engineers — Production async patterns and Pydantic in depth, including retry logic and structured output
- Async Python Guide —
asyncio.gather, semaphores, and streaming, which operate on the data structures covered here - RAG Architecture Guide — How lists, dicts, sets, and Pydantic models compose into a full retrieval-augmented generation system
- AI Agents and Agentic Systems — Agent memory and tool call arguments are where dataclasses and Pydantic interact most
The data layer is invisible when it works correctly and catastrophic when it does not. Getting these fundamentals right is what separates GenAI code that works in a demo from code that runs reliably in production.
Related
Section titled “Related”- Python for GenAI — Python foundations for AI development
- Async Python — Concurrency for LLM API calls
- LLM Database Integration — Connecting LLMs to databases
- RAG Chunking — Data structures in chunking pipelines
Frequently Asked Questions
What Python data structures are essential for AI and LLM applications?
Lists and dicts are the daily workhorses for prompt messages, retrieval results, and tool calls. Sets enable fast deduplication of retrieved chunks. Dataclasses structure pipeline intermediate state. Pydantic models validate LLM-generated JSON at runtime. The collections module offers deque for rolling context windows and Counter for token frequency analysis.
Why is Pydantic important for LLM applications?
LLMs return unstructured text or loosely formatted JSON. Pydantic models enforce a schema at runtime — if the LLM output is missing a required field or has a wrong type, a ValidationError is raised immediately rather than propagating a silent data error downstream. OpenAI's structured output feature and LangChain's with_structured_output both accept Pydantic models as the target schema.
When should I use dataclasses vs Pydantic in a GenAI project?
Use dataclasses for internal pipeline state that never crosses an external boundary: intermediate retrieval results, pipeline stage outputs, in-memory agent state. Use Pydantic when data comes from or goes to an external source: LLM outputs, API request/response bodies, configuration loaded from environment variables. The rule of thumb is: if it could be wrong, use Pydantic.
Which Python collections are most useful for storing and processing embeddings?
Plain Python lists store small embedding batches. For large-scale numeric work use NumPy arrays (faster dot-product, slicing). deque from the collections module manages rolling conversation context windows with O(1) appends and pops. defaultdict simplifies grouping retrieved chunks by source document. Counter tracks token frequencies across a corpus for analysis.
How do you use sets for deduplication in a RAG pipeline?
Track seen chunk IDs in a set as you iterate over results sorted by score. For each result, check membership in O(1) with id in seen_ids. If not seen, add the ID to the set and append the chunk to the output list. This prevents duplicate chunks from appearing in the final context, which is critical in hybrid RAG retrieval where vector search and keyword search return overlapping results.
How does collections.deque manage conversation context windows?
Create a deque with maxlen set to the maximum number of messages your token budget allows. When the deque reaches capacity, it automatically evicts the oldest message in O(1) with no manual index management. The system prompt is maintained separately so it is never evicted. Convert the deque to a list only when building the final API call.
What is TypedDict and when should you use it instead of Pydantic?
TypedDict adds type safety to plain dicts without the overhead of Pydantic instantiation. Use it when working with libraries that return plain dicts, such as LangChain LCEL chains or vector database clients. TypedDict provides IDE autocompletion and catches key typos at write time, but it does not validate data at runtime like Pydantic does.
How do you batch chunks for embedding using itertools?
Use itertools.islice to lazily pull fixed-size batches from a generator of document chunks without loading all chunks into memory at once. In each iteration, slice the next batch_size items, embed them via the API, and upsert into the vector store. This pattern is essential when processing large corpora where the full dataset does not fit in memory.
How does asyncio.Queue help with parallel embedding pipelines?
asyncio.Queue implements a thread-safe producer-consumer pattern within the event loop. Producers enqueue EmbeddingJob dataclasses, and multiple async workers pull jobs from the queue, call the embedding API, and push results to an output queue. This enables controlled concurrency with a fixed number of workers, preventing API rate limit errors.
Why should you validate LLM-generated JSON with Pydantic instead of json.loads?
json.loads only checks syntax — it produces a dict that accepts any keys and values. If the LLM omits a required field, returns a string where a number is expected, or returns null where your code expects a value, the error surfaces deep in your pipeline as a KeyError or type mismatch. Pydantic raises a ValidationError immediately at parse time with field-level detail, catching malformed data at the boundary.