Skip to content

LlamaIndex Tutorial — Build RAG Without LangChain (2026)

This LlamaIndex tutorial takes you from pip install to a working RAG pipeline in under 50 lines of Python. You will load documents, build a vector index, create a query engine, and get grounded answers from your own data. Every code example runs as-is with LlamaIndex 0.10+.

Who this is for:

  • Beginners: You want a hands-on introduction to RAG using a framework designed specifically for retrieval
  • RAG builders: You need a faster path to production than building from scratch with raw embeddings and vector stores
  • Engineers evaluating options: You are comparing LlamaIndex against LangChain or Haystack for your next project

LlamaIndex is purpose-built for connecting LLMs to your data. Where LangChain is a general-purpose orchestration framework that handles chains, agents, and tool calling, LlamaIndex focuses on one thing: making your data searchable and queryable by an LLM.

  • Data-first design — Every abstraction in LlamaIndex exists to solve a data retrieval problem. Documents, Nodes, Indexes, and Query Engines form a pipeline that mirrors how production RAG systems actually work.
  • Built-in chunking intelligence — LlamaIndex’s SentenceSplitter and SemanticSplitterNodeParser handle document chunking out of the box. You configure chunk size and overlap; the framework handles sentence boundaries and metadata propagation.
  • 160+ data loaders — LlamaHub provides readers for PDFs, Google Docs, Slack, Notion, databases, APIs, and more. Loading data into LlamaIndex rarely requires custom code.
  • Index varietyVectorStoreIndex for similarity search, TreeIndex for hierarchical summarization, KeywordIndex for term-based lookup. Each index type optimizes for a different retrieval pattern.

If your primary use case is document search and Q&A, LlamaIndex gets you to a working prototype in 15 minutes. The equivalent LangChain setup requires more boilerplate because LangChain’s abstractions are designed for broader use cases.


Both frameworks are production-grade, but they solve different problems. Choosing wrong means fighting against the framework instead of building with it.

DimensionLlamaIndexLangChain
Primary strengthData ingestion, indexing, and retrievalChain composition, agent orchestration, tool calling
RAG pipeline5-10 lines for a complete pipeline15-25 lines with more manual wiring
Agent workflowsBasic agent support (improving)Mature agent framework with LangGraph
Data loaders160+ via LlamaHub80+ via community integrations
ChunkingBuilt-in semantic and sentence splittersManual setup with RecursiveCharacterTextSplitter
Index typesVector, Tree, Keyword, Knowledge GraphVector only (via retrievers)
Learning curveLower for RAG-focused workLower for general LLM app development
Best forDocument Q&A, enterprise search, knowledge basesChatbots, multi-tool agents, complex workflows
  • Your app is primarily document search and Q&A over internal data
  • You need multiple index types (vector + keyword hybrid, tree-based summarization)
  • You are working with structured data (SQL tables, knowledge graphs) alongside unstructured text
  • You want the fastest path from raw documents to working retrieval
  • Your app needs complex agent logic with tool calling, reasoning loops, and state management
  • You are building multi-step workflows with conditional branching
  • You need to orchestrate multiple LLM calls in a pipeline (summarize, then classify, then extract)
  • You want a unified interface across different LLM providers

Many production systems use both. LlamaIndex handles the retrieval layer (ingestion, indexing, querying), and LangChain handles the orchestration layer (agent logic, chain composition). LlamaIndex’s as_langchain_tool() method makes this integration seamless.


3. Core Concepts — The LlamaIndex Pipeline

Section titled “3. Core Concepts — The LlamaIndex Pipeline”

LlamaIndex organizes RAG into five stages. Understanding these stages is all you need to build and customize any pipeline.

  1. Documents — Raw data loaded from files, APIs, or databases. Each Document object holds text content and metadata (filename, page number, source URL). LlamaIndex’s SimpleDirectoryReader loads entire folders of mixed file types.

  2. Nodes — Chunks created by splitting Documents. Each Node preserves its parent Document reference and optional relationships to neighboring Nodes. The SentenceSplitter creates Nodes at sentence boundaries with configurable overlap.

  3. Indexes — Data structures that organize Nodes for fast retrieval:

    • VectorStoreIndex — Embeds Nodes and stores vectors for similarity search. The most common choice for RAG.
    • TreeIndex — Builds a hierarchical tree of summaries. Good for answering questions that require synthesizing information across many documents.
    • KeywordIndex — Maps keywords to Nodes using term frequency. Useful as a complement to vector search in hybrid retrieval setups.
  4. Query Engines — Combine retrieval and response synthesis. A Query Engine retrieves relevant Nodes from an Index, formats them as context, and sends the augmented prompt to the LLM. You can configure the number of retrieved chunks, the synthesis strategy, and the response format.

  5. Response Synthesizers — Control how the LLM generates answers from retrieved context. CompactAndRefine (default) packs as many chunks as possible into a single prompt. TreeSummarize builds a hierarchical response from many chunks. Accumulate generates separate responses per chunk and concatenates them.

Load Documents → Parse into Nodes → Build Index → Create Query Engine → Query → Response

Each stage is pluggable. You can swap the document loader, change the chunking strategy, use a different vector store, or customize the response synthesizer — without rewriting the rest of the pipeline.


This section walks you through a complete LlamaIndex RAG pipeline in 5 steps. By the end, you will have a working system that answers questions about your own documents.

Terminal window
# Requires: llama-index>=0.10.0
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

LlamaIndex 0.10+ uses a modular package structure. The core package (llama-index) provides the framework, and provider-specific packages add LLM and embedding integrations.

Set your OpenAI API key:

Terminal window
export OPENAI_API_KEY="sk-..."
# Requires: llama-index>=0.10.0
from llama_index.core import SimpleDirectoryReader
# Load all files from a directory (supports PDF, TXT, DOCX, CSV, and more)
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")
print(f"First doc preview: {documents[0].text[:200]}")

SimpleDirectoryReader auto-detects file types and extracts text. Place your documents in a ./data folder — PDFs, text files, Word docs, or any supported format. Each file becomes one or more Document objects with metadata.

# Requires: llama-index>=0.10.0
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure the LLM and embedding model
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Build the index — this embeds all documents automatically
index = VectorStoreIndex.from_documents(documents)

VectorStoreIndex.from_documents() does three things in one call: splits documents into Nodes using the default SentenceSplitter, embeds each Node with your configured embedding model, and stores the vectors in an in-memory vector store. For production, you will swap the in-memory store for Qdrant, Pinecone, or Weaviate.

# Requires: llama-index>=0.10.0
query_engine = index.as_query_engine(
similarity_top_k=3, # Retrieve top 3 most relevant chunks
)

The Query Engine wraps the index with retrieval and synthesis logic. similarity_top_k=3 means each query retrieves the 3 most similar chunks as context for the LLM.

# Requires: llama-index>=0.10.0
response = query_engine.query("What are the main topics covered in these documents?")
print(response)
# Access the source nodes used to generate the response
for node in response.source_nodes:
print(f"Source: {node.metadata.get('file_name', 'unknown')} | Score: {node.score:.4f}")

That is the complete pipeline: load, index, query. Five steps, under 20 lines of code, and you have a working RAG system that answers questions grounded in your documents.

Here is the full pipeline as a single runnable script:

# Requires: llama-index>=0.10.0
# pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# 1. Configure
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# 2. Load
documents = SimpleDirectoryReader("./data").load_data()
# 3. Index
index = VectorStoreIndex.from_documents(documents)
# 4. Query
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the key points from these documents.")
print(response)

Eighteen lines. That is the LlamaIndex value proposition — a complete RAG pipeline with document loading, embedding, indexing, retrieval, and response synthesis, all handled by the framework.


This diagram shows how data flows through the LlamaIndex RAG stack, from document loading through response generation.

LlamaIndex RAG Pipeline Stack

From raw documents to grounded LLM responses

Document Loading
SimpleDirectoryReader, PDFs, APIs, databases
Node Parsing
Chunking, metadata extraction, relationships
Index Construction
VectorStoreIndex, TreeIndex, KeywordIndex
Query Engine
Retrieval + response synthesis
Response Generation
LLM generates answer from retrieved context
Idle

Each layer is independently configurable. You can swap SimpleDirectoryReader for a custom loader, change the chunking strategy in the Node Parser, replace the in-memory vector store with a managed database, or customize the response synthesis strategy — without touching the other layers.



These four examples progress from basic usage to production-ready patterns. Each builds on the core pipeline from Section 4.

The minimal pipeline from Section 4 already covers this. Here is a variation that uses a local model instead of OpenAI:

# Requires: llama-index>=0.10.0
# pip install llama-index llama-index-llms-ollama llama-index-embeddings-huggingface
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Use local models — no API key needed
Settings.llm = Ollama(model="llama3.2", request_timeout=120)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the key findings?")
print(response)

This version runs entirely on your machine using Ollama for the LLM and a HuggingFace embedding model. Zero API costs, full data privacy.

The default SentenceSplitter works for most cases, but you can customize chunking for better retrieval quality:

# Requires: llama-index>=0.10.0
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Custom chunking: smaller chunks with more overlap
splitter = SentenceSplitter(
chunk_size=512, # Tokens per chunk (default: 1024)
chunk_overlap=50, # Overlapping tokens between chunks (default: 20)
)
documents = SimpleDirectoryReader("./data").load_data()
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes from {len(documents)} documents")
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Explain the technical architecture in detail.")
print(response)

Smaller chunks (512 tokens) with more overlap (50 tokens) improve retrieval precision when your documents contain dense technical content. Larger chunks (1024+) work better for narrative content where context spans multiple paragraphs.

Example C: LlamaIndex with Qdrant for Production

Section titled “Example C: LlamaIndex with Qdrant for Production”

For production, replace the in-memory store with a persistent vector database. Here is the Qdrant integration:

# Requires: llama-index>=0.10.0
# pip install llama-index llama-index-vector-stores-qdrant qdrant-client
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from qdrant_client import QdrantClient
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Connect to Qdrant (docker run -p 6333:6333 qdrant/qdrant)
qdrant_client = QdrantClient(url="http://localhost:6333")
# Create a vector store backed by Qdrant
vector_store = QdrantVectorStore(
client=qdrant_client,
collection_name="my_documents",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Load and index — vectors go to Qdrant, not memory
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# Query — retrieves from Qdrant
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the performance benchmarks?")
print(response)

The same 3-line query pattern works regardless of the backend. Swap QdrantVectorStore for PineconeVectorStore or WeaviateVectorStore and the rest of your code stays identical.

Example D: Adding Reranking for Better Retrieval Quality

Section titled “Example D: Adding Reranking for Better Retrieval Quality”

Reranking uses a cross-encoder model to re-score retrieved chunks, improving precision by 10-25% for complex queries:

# Requires: llama-index>=0.10.0
# pip install llama-index llama-index-postprocessor-cohere
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Reranker: retrieve 10 candidates, rerank, return top 3
reranker = CohereRerank(top_n=3)
query_engine = index.as_query_engine(
similarity_top_k=10, # Cast a wide net
node_postprocessors=[reranker], # Then rerank for precision
)
response = query_engine.query("Compare the two approaches discussed in the documents.")
print(response)

The pattern is: retrieve broadly (similarity_top_k=10), then rerank for precision (top_n=3). This two-stage approach catches relevant chunks that a tight initial retrieval might miss, then uses a cross-encoder to select the best ones. For more on advanced RAG techniques, including hybrid retrieval and query routing, see the dedicated guide.


LlamaIndex saves significant development time, but it comes with trade-offs you should evaluate before committing.

DimensionLlamaIndexFrom Scratch
Time to prototype15 minutes2-5 days
Time to production1-2 weeks3-6 weeks
Abstraction overheadModerate — black-box behavior in query enginesNone — you control every step
Customization depthGood but requires understanding internalsUnlimited
DebuggingStack traces can be deep and opaqueDirect and transparent
Dependency weight~50 packages in full installOnly what you choose
  • Single-document Q&A — If you have one document and one question type, the OpenAI API with a simple prompt is faster to build and easier to maintain.
  • Fixed query patterns — If every query follows the same template with no variation, a hardcoded retrieval function beats a framework.
  • Extreme latency requirements — Each LlamaIndex abstraction layer adds 1-5ms of overhead. For sub-10ms retrieval, direct vector database calls are faster.

LlamaIndex has undergone significant API changes. The 0.10.0 release restructured the entire package system:

  • Before 0.10: Single llama-index package with everything bundled
  • After 0.10: Modular packages (llama-index-core, llama-index-llms-openai, etc.)
  • Migration impact: Import paths changed for every integration. from llama_index import became from llama_index.core import plus provider-specific packages.

Pin your versions explicitly in requirements.txt:

# Requires: llama-index>=0.10.0
llama-index==0.10.68
llama-index-llms-openai==0.1.25
llama-index-embeddings-openai==0.1.11

Check the LlamaIndex migration guide before upgrading major versions. Test your full pipeline against a golden dataset after any upgrade.

LlamaIndex’s QueryEngine combines retrieval and synthesis into one call. This is convenient for prototyping but can hide problems:

  • You cannot easily inspect what chunks were retrieved before the LLM sees them
  • Default synthesis strategies may not match your quality requirements
  • Error handling happens inside the framework, making custom retry logic harder

Mitigation: Use index.as_retriever() instead of index.as_query_engine() when you need visibility into the retrieval step. Retrieve nodes first, inspect them, then pass to the LLM yourself.


8. Interview Questions — LlamaIndex and RAG

Section titled “8. Interview Questions — LlamaIndex and RAG”

These three questions appear frequently in GenAI engineering interviews. Each answer demonstrates the depth that interviewers look for.

Q1: “Why would you choose LlamaIndex over LangChain for a RAG system?”

Section titled “Q1: “Why would you choose LlamaIndex over LangChain for a RAG system?””

Strong answer: “LlamaIndex is purpose-built for the data retrieval pipeline — loading, chunking, indexing, and querying documents. Its abstractions map directly to RAG stages, so you write less boilerplate for the same result. LangChain is a general-purpose orchestration framework where RAG is one of many capabilities. For a project where the primary workflow is document Q&A, LlamaIndex gives me VectorStoreIndex.from_documents() and index.as_query_engine() — two lines that handle embedding, storage, retrieval, and synthesis. The equivalent LangChain setup requires manually wiring the retriever, prompt template, and output parser. If I later need agent orchestration or complex multi-step chains, I would layer LangChain on top and use LlamaIndex’s as_langchain_tool() for the retrieval component.”

Q2: “Explain LlamaIndex’s indexing strategies and when to use each.”

Section titled “Q2: “Explain LlamaIndex’s indexing strategies and when to use each.””

Strong answer:VectorStoreIndex embeds nodes and retrieves by semantic similarity — use it for general-purpose Q&A where the query and relevant text are semantically close. TreeIndex builds a hierarchical summary tree — use it when the answer requires synthesizing information across many documents, like ‘summarize the quarterly results across all 50 reports.’ KeywordIndex maps terms to nodes using keyword extraction — use it as a fallback when semantic search misses results due to vocabulary mismatch. In production, I often combine VectorStoreIndex with KeywordIndex in a hybrid approach where the vector path captures semantic matches and the keyword path catches exact term matches that embedding models sometimes miss.”

Q3: “How do you customize chunking in LlamaIndex?”

Section titled “Q3: “How do you customize chunking in LlamaIndex?””

Strong answer: “LlamaIndex provides three main node parsers. SentenceSplitter breaks text at sentence boundaries with configurable chunk_size and chunk_overlap — I use 512 tokens with 50 token overlap for technical documentation where precision matters. SemanticSplitterNodeParser groups semantically similar sentences using an embedding model to create more coherent chunks — better for narrative text but slower due to embedding computation during chunking. TokenTextSplitter does fixed-token splits without sentence awareness — fast but risks splitting mid-sentence. For structured documents like API docs or legal contracts, I write custom node parsers that split on section headers and preserve the document hierarchy as node metadata. The metadata propagation is key — each node carries its parent document reference, page number, and section heading so the retrieval evaluation pipeline can trace answers back to sources.”


Moving from a prototype to production requires handling persistence, performance, cost, and monitoring.

Re-embedding documents on every restart wastes time and money. Persist your index to disk or a managed vector store:

# Requires: llama-index>=0.10.0
# Save to disk
index.storage_context.persist(persist_dir="./storage")
# Load from disk (no re-embedding)
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

For production workloads, use a managed vector database (Qdrant, Pinecone, Weaviate) instead of local disk persistence. Managed databases provide replication, backups, and horizontal scaling.

Streaming reduces perceived latency by showing tokens as they are generated:

# Requires: llama-index>=0.10.0
query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Explain the architecture.")
for text in streaming_response.response_gen:
print(text, end="", flush=True)

Streaming works with OpenAI, Anthropic, and local models. For web applications, pipe the generator into an SSE (Server-Sent Events) endpoint.

For high-throughput applications, use async to process multiple queries concurrently:

# Requires: llama-index>=0.10.0
import asyncio
async def query_async(engine, question):
response = await engine.aquery(question)
return str(response)
async def main():
questions = [
"What is the main architecture?",
"What are the performance benchmarks?",
"What limitations are documented?",
]
tasks = [query_async(query_engine, q) for q in questions]
results = await asyncio.gather(*tasks)
for q, r in zip(questions, results):
print(f"Q: {q}\nA: {r}\n")
asyncio.run(main())

Async queries share the same index but issue parallel LLM calls. This is particularly valuable when serving multiple users or processing batch questions.

LlamaIndex tracks token usage through callbacks. Monitor your LLM costs per query:

# Requires: llama-index>=0.10.0
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
import tiktoken
token_counter = TokenCountingHandler(
tokenizer=tiktoken.encoding_for_model("gpt-4o-mini").encode
)
Settings.callback_manager = CallbackManager([token_counter])
# After queries...
print(f"Embedding tokens: {token_counter.total_embedding_token_count}")
print(f"LLM prompt tokens: {token_counter.prompt_llm_token_count}")
print(f"LLM completion tokens: {token_counter.completion_llm_token_count}")

At OpenAI’s current pricing, gpt-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens. A typical RAG query with 3 retrieved chunks and a 200-token response costs approximately $0.0002. Track this per query to catch cost anomalies before they hit your bill.

Production RAG systems degrade silently. Set up automated evaluation to catch retrieval drift:

  • Relevance scoring — Log the similarity scores of retrieved nodes. Alert when the average score drops below your baseline threshold.
  • Source diversity — Track how many unique source documents contribute to answers. Low diversity may indicate over-indexing of certain content.
  • Response length — Sudden changes in average response length often signal retrieval problems (too few chunks retrieved, or irrelevant chunks diluting the context).
  • User feedback — Implement thumbs up/down on responses and correlate negative feedback with specific queries and retrieved chunks.

Follow this process when upgrading LlamaIndex:

  1. Read the changelog for breaking changes (especially major/minor version bumps)
  2. Run your test suite against a golden dataset (10-20 questions with known correct answers)
  3. Compare retrieval results — check that the same queries return the same top-k chunks
  4. Monitor costs — new versions sometimes change default embedding or LLM parameters
  5. Stage the upgrade — deploy to a staging environment and run RAG evaluation before production

  • LlamaIndex is purpose-built for RAG — its abstractions (Documents, Nodes, Indexes, Query Engines) map directly to the stages of a retrieval pipeline
  • A complete RAG pipeline takes under 20 linesSimpleDirectoryReader loads data, VectorStoreIndex.from_documents() embeds and indexes, as_query_engine().query() retrieves and generates
  • Three index types cover most use casesVectorStoreIndex for semantic search, TreeIndex for summarization, KeywordIndex for term matching
  • Custom chunking improves retrieval quality — use SentenceSplitter with 512-token chunks and 50-token overlap for technical docs, SemanticSplitterNodeParser for narrative text
  • Production requires persistence — swap in-memory storage for Qdrant, Pinecone, or Weaviate and use storage_context.persist() for local development
  • Reranking adds 10-25% retrieval precision — retrieve broadly with similarity_top_k=10, rerank with a cross-encoder to top_n=3
  • LlamaIndex and LangChain complement each other — use LlamaIndex for retrieval, LangChain for orchestration, and as_langchain_tool() to bridge them
  • Pin your versions — LlamaIndex has had breaking changes between major versions; pin exact versions in requirements.txt and test upgrades against a golden dataset

Frequently Asked Questions

What is LlamaIndex?

LlamaIndex is an open-source Python framework purpose-built for connecting LLMs to your data. It handles the full RAG pipeline — document loading, chunking, indexing, retrieval, and response synthesis — with purpose-built abstractions that require less custom code than general-purpose frameworks. LlamaIndex supports 160+ data loaders and integrates with all major vector databases and LLM providers.

Is LlamaIndex better than LangChain for RAG?

For RAG-focused projects, LlamaIndex is typically the better choice. Its indexing strategies (VectorStoreIndex, TreeIndex, KeywordIndex), query engines, and retrieval optimization are purpose-built for document search and Q&A. LangChain is better for complex agent workflows with tool use and multi-step reasoning. Many production teams use both — LlamaIndex for retrieval, LangChain for orchestration.

How do I install LlamaIndex?

Run pip install llama-index to install the core package. For specific integrations, install additional packages: pip install llama-index-llms-openai for OpenAI models, pip install llama-index-vector-stores-qdrant for Qdrant, etc. LlamaIndex 0.10+ uses a modular package structure where you install only what you need.

What is a VectorStoreIndex?

VectorStoreIndex is LlamaIndex's most commonly used index type. It converts your documents into vector embeddings, stores them in a vector database, and retrieves the most relevant chunks at query time using similarity search. You create one with VectorStoreIndex.from_documents(documents) and query it with index.as_query_engine().query('your question').

Can LlamaIndex work with Pinecone or Qdrant?

Yes. LlamaIndex integrates with all major vector databases through dedicated packages. Install llama-index-vector-stores-qdrant for Qdrant or llama-index-vector-stores-pinecone for Pinecone. You pass the vector store to VectorStoreIndex via a StorageContext, and LlamaIndex handles embedding, upserting, and querying automatically.

How does LlamaIndex handle document chunking?

LlamaIndex uses a NodeParser to split documents into nodes (chunks). The default SentenceSplitter breaks text at sentence boundaries with configurable chunk_size (default 1024 tokens) and chunk_overlap (default 20 tokens). You can also use SemanticSplitterNodeParser for embedding-based chunking that groups semantically similar sentences, or TokenTextSplitter for fixed-token splits.

Is LlamaIndex production-ready?

Yes. LlamaIndex 0.10+ is used in production at companies processing millions of documents. Key production features include persistent storage (save and reload indexes without re-embedding), async query execution, streaming responses, integration with managed vector databases, and observability through callbacks. The modular package architecture keeps dependencies minimal for deployment.

How do I add streaming to LlamaIndex?

Create a query engine with streaming enabled: query_engine = index.as_query_engine(streaming=True). Then call streaming_response = query_engine.query('your question') and iterate over tokens with for text in streaming_response.response_gen: print(text). Streaming works with all major LLM providers including OpenAI, Anthropic, and local models.

What embedding models work with LlamaIndex?

LlamaIndex supports OpenAI embeddings (text-embedding-3-small, text-embedding-3-large), Cohere embeddings, Hugging Face models via sentence-transformers, Google Vertex AI embeddings, and local models via Ollama. You configure the embedding model globally with Settings.embed_model or per-index. For most projects, text-embedding-3-small offers the best cost-to-quality ratio.

Can I use LlamaIndex and LangChain together?

Yes, and this is a common production pattern. Use LlamaIndex for what it does best — indexing and retrieval — and LangChain for agent orchestration and complex workflows. LlamaIndex provides an as_langchain_tool() method that wraps any query engine as a LangChain tool. This lets you use LlamaIndex retrieval inside LangChain agents with zero friction.

Last updated: March 2026 | LlamaIndex 0.10+ / Python 3.10+