大模型学习笔记:LangChain 文档加载

Document loaders文档加载器:文档加载器提供了一套标准接口,用于将不同来源(CSV,PDF或Json等)的数据读取为LangChain的文档格式。这确保了无论数据来源如何,都能对其进行一致性处理。
文档加载器(内置或自行实现)需实现BaseLoader接口。`Class Document` 是LangChain内文档的统一载体,所有文档加载器最终返回此类的示例。一个基础的Document类实例,基于如下代码创建:

from langchain_core.documents import Document
document = Document(
page_content = "Hello, world!", metadata={"source": "https://example.com"}
)
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(
    file_path="./data/stu.csv",
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        #无表头可以使用,不然表头会当作数据加载进来
        "fieldnames": ["id", "name", "age"]
        },

    encoding="utf-8")

documents = loader.load() 
""" for document in documents:
    print(type(document), document) """

#懒加载
for document in loader.lazy_load():
    print(type(document), document)

JSONLoader

JSONLoader用于将JSON数据加载为Document类型对象。依赖:`pip install jq` ,jq是一个跨平台的json解析工具,LangChain底层对JSON的解析就是基于jq工具实现的。将JSON数据的信息出去出来,封装为Document对象,抽取的时候依赖jq_schema语法。

from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path="./data/stu_json_lines.json",
    jq_schema=".name",
    text_content=False,
    json_lines=True
    )
    
document =loader.load()
print(type(document), document)

TextLoader 文本加载

作用:读取文本文件(.txt),将全部内容放入一个Document对象中。

from langchain_community.document_loaders import TextLoader
loader = TextLoader(
"xxx.txt",
encoding = "utf-8"
)
docs = loader.load()
print(docs)
print(len(docs)) #结果为1

RecursiveCharacterTextSplitter

递归字符文本分割器,主要用于按自然段落分割大文档。是LangChain官方推荐的默认字符分割器。它在保持上下文完整性和控制片段大小之间实现了良好平衡,开箱即用效果佳。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
#pip install langchain_text_splitters

loader = TextLoader(file_path="./data/Python基础语法.txt", encoding="utf-8")
docs = loader.load()
print(docs, len(docs))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", "", ".", ",","。", "!", "!"],
    length_function=len
)

split_docs = splitter.split_documents(docs)
print(len(split_docs))
for doc in split_docs:
    print("="*20)
    print(doc)
    print("="*20)

PyPDFLoader

LangChain 内支持许多PDF的加载器,PyPDFLoader是其中一种,其依赖PyPDF库:`pip install pypdf`

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    file_path="./data/pdf2.pdf",
    mode="single", #single 表示将整个PDF文件作为一个文档加载,page表示将每一页作为一个文档加载
    password="itheima"
    )
i = 0
for doc in loader.lazy_load():
    i += 1
    print(f"第{i}页:")
    print(type(doc), doc)

 

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.