Retrieval 组件 LangChain 的 RAG 组件:从文档加载到向量检索,全流程覆盖。
文档加载(Document Loaders) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from langchain_community.document_loaders import ( WebBaseLoader, PyPDFLoader, TextLoader, CSVLoader, NotionDirectoryLoader, ) loader = WebBaseLoader("https://docs.something.com/guide" ) docs = loader.load() loader = PyPDFLoader("./document.pdf" ) docs = loader.load() loader = TextLoader("./readme.md" ) docs = loader.load()
文档分割(Text Splitting) 1 2 3 4 5 6 7 8 9 10 from langchain.text_splitter import RecursiveCharacterTextSplittersplitter = RecursiveCharacterTextSplitter( chunk_size=500 , chunk_overlap=50 , separators=["\n\n" , "\n" , "。" , ". " , " " ] ) chunks = splitter.split_documents(docs) print (f"分割为 {len (chunks)} 个块" )
其他分割器 分割器 说明 RecursiveCharacterTextSplitter通用,推荐默认 MarkdownTextSplitter按 Markdown 标题树分割 TokenTextSplitter基于 Token 数分割,更精准 SemanticChunker基于语义相似度分割(更智能但更慢)
向量存储(Vector Stores) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from langchain_openai import OpenAIEmbeddingsfrom langchain_community.vectorstores import Qdrant, Chromaembeddings = OpenAIEmbeddings(model="text-embedding-3-small" ) vectorstore = Qdrant.from_documents( chunks, embeddings, location=":memory:" , collection_name="my_docs" ) vectorstore = Chroma.from_documents( chunks, embeddings, persist_directory="./chroma_db" )
检索器(Retrievers) 基础检索 1 2 3 4 5 6 7 8 9 retriever = vectorstore.as_retriever( search_kwargs={"k" : 5 } ) results = retriever.invoke("LangChain 是什么" ) for doc in results: print (doc.page_content[:100 ])
多重检索策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from langchain.retrievers import ( MultiQueryRetriever, ContextualCompressionRetriever, EnsembleRetriever, ) retriever = MultiQueryRetriever.from_llm( retriever=base_retriever, llm=llm ) from langchain_cohere import CohereRerankcompressor = CohareRerank(model="rerank-multilingual-v3.0" ) compression_retriever = ContextualCompressionRetriever( base_retriever=retriever, compressors=[compressor] )
完整 RAG 链 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 from langchain_openai import ChatOpenAIfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.runnables import RunnableParallelretriever = vectorstore.as_retriever(search_kwargs={"k" : 3 }) SYSTEM_PROMPT = """基于以下上下文回答问题。 如果上下文中没有相关信息,说"我不知道"。 上下文: {context} """ prompt = ChatPromptTemplate.from_messages([ ("system" , SYSTEM_PROMPT), ("human" , "{question}" ) ]) llm = ChatOpenAI(model="gpt-4o" , temperature=0 ) rag_chain = ( RunnableParallel({ "context" : lambda x: retriever.invoke(x["question" ]), "question" : lambda x: x["question" ] }) | prompt | llm | StrOutputParser() ) result = rag_chain.invoke({"question" : "LangChain 是什么" })
混合检索(Hybrid Search) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from langchain.retrievers import EnsembleRetrievervector_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10 }) from langchain_community.retrievers import BM25Retrieverkeyword_retriever = BM25Retriever.from_texts(chunks) ensemble_retriever = EnsembleRetriever( retrievers=[vector_retriever, keyword_retriever], weights=[0.6 , 0.4 ] )
Parent Document Retriever(父子文档检索) 解决精确 vs 上下文的两难:
1 2 3 4 5 6 7 8 9 10 11 12 13 from langchain.retrievers import ParentDocumentRetrieverchild_splitter = RecursiveCharacterTextSplitter(chunk_size=200 ) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500 ) retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=docstore, child_splitter=child_splitter, parent_splitter=parent_splitter )
流程:
1 用户问题 → 检索小块(精确匹配)→ 获取父文档 ID → 返回完整大块