Retrieval 组件

LangChain 的 RAG 组件:从文档加载到向量检索,全流程覆盖。


文档加载(Document Loaders)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from langchain_community.document_loaders import (
WebBaseLoader, # 网页
PyPDFLoader, # PDF
TextLoader, # 文本
CSVLoader, # CSV
NotionDirectoryLoader, # Notion
)

# 网页
loader = WebBaseLoader("https://docs.something.com/guide")
docs = loader.load()

# PDF
loader = PyPDFLoader("./document.pdf")
docs = loader.load()

# 文本
loader = TextLoader("./readme.md")
docs = loader.load()

文档分割(Text Splitting)

1
2
3
4
5
6
7
8
9
10
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # 每块 Token 数
chunk_overlap=50, # 块之间重叠,防止边界切断
separators=["\n\n", "\n", "。", ". ", " "]
)

chunks = splitter.split_documents(docs)
print(f"分割为 {len(chunks)} 个块")

其他分割器

分割器说明
RecursiveCharacterTextSplitter通用,推荐默认
MarkdownTextSplitter按 Markdown 标题树分割
TokenTextSplitter基于 Token 数分割,更精准
SemanticChunker基于语义相似度分割(更智能但更慢)

向量存储(Vector Stores)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant, Chroma

# Embedding 模型
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Qdrant(生产推荐)
vectorstore = Qdrant.from_documents(
chunks,
embeddings,
location=":memory:", # 或 "http://localhost:6333"
collection_name="my_docs"
)

# Chroma(开发/测试)
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./chroma_db"
)

检索器(Retrievers)

基础检索

1
2
3
4
5
6
7
8
9
# as_retriever() 转换
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5} # Top-K
)

# 检索
results = retriever.invoke("LangChain 是什么")
for doc in results:
print(doc.page_content[:100])

多重检索策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from langchain.retrievers import (
MultiQueryRetriever, # 多角度查询扩展
ContextualCompressionRetriever, # 压缩上下文
EnsembleRetriever, # 混合检索
)

# MultiQueryRetriever:用 LLM 生成多个查询角度,减少单一查询的信息差
retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=llm
)

# ContextualCompression:压缩检索结果,去除无关内容
from langchain_cohere import CohereRerank

compressor = CohareRerank(model="rerank-multilingual-v3.0")
compression_retriever = ContextualCompressionRetriever(
base_retriever=retriever,
compressors=[compressor]
)

完整 RAG 链

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel

# 1. 向量数据库检索器
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 2. Prompt
SYSTEM_PROMPT = """基于以下上下文回答问题。
如果上下文中没有相关信息,说"我不知道"。

上下文:
{context}
"""

prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_PROMPT),
("human", "{question}")
])

# 3. 模型
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# 4. RAG 链
rag_chain = (
RunnableParallel({
"context": lambda x: retriever.invoke(x["question"]),
"question": lambda x: x["question"]
})
| prompt
| llm
| StrOutputParser()
)

# 5. 调用
result = rag_chain.invoke({"question": "LangChain 是什么"})

混合检索(Hybrid Search)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain.retrievers import EnsembleRetriever

# 1. 向量检索
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# 2. 关键词检索(BM25)
from langchain_community.retrievers import BM25Retriever
keyword_retriever = BM25Retriever.from_texts(chunks)

# 3. 混合合并
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, keyword_retriever],
weights=[0.6, 0.4] # 向量权重更高
)

Parent Document Retriever(父子文档检索)

解决精确 vs 上下文的两难:

1
2
3
4
5
6
7
8
9
10
11
12
13
from langchain.retrievers import ParentDocumentRetriever

# 小块用于精确检索
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# 大块用于完整上下文
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)

retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore, # 存储大块的 docstore
child_splitter=child_splitter,
parent_splitter=parent_splitter
)

流程:

1
用户问题 → 检索小块(精确匹配)→ 获取父文档 ID → 返回完整大块