Xử lý văn bản dài

Khi làm việc với các tập tin như PDF, bạn có thể gặp phải văn bản vượt quá cửa sổ ngữ cảnh của mô hình ngôn ngữ của bạn. Để xử lý văn bản này, hãy xem xét những chiến lược sau:

Thay đổi LLM Chọn một LLM khác hỗ trợ cửa sổ ngữ cảnh lớn hơn.
Tìm kiếm độc đáo Chia tài liệu thành các phần nhỏ và trích xuất nội dung từng phần.
RAG Chia tài liệu thành các phần, đánh chỉ mục các phần và chỉ trích xuất nội dung từ một số phần có vẻ "liên quan".

Hãy nhớ rằng những chiến lược này có những sự đánh đổi khác nhau và chiến lược tốt nhất có lẽ phụ thuộc vào ứng dụng bạn đang thiết kế!

Thiết lập

Chúng ta cần một số dữ liệu mẫu! Hãy tải xuống một bài viết về xe hơi từ wikipedia và tải nó dưới dạng một tài liệu LangChain.

import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

Xác định lược đồ

Ở đây, chúng ta sẽ xác định lược đồ để trích xuất các diễn biến chính từ văn bản.

from typing import List, Optional

from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI


class KeyDevelopment(BaseModel):
    """Thông tin về một diễn biến trong lịch sử của xe hơi."""

    year: int = Field(
        ..., description="Năm khi có một diễn biến lịch sử quan trọng."
    )
    description: str = Field(
        ..., description="Điều gì đã xảy ra trong năm đó? Diễn biến là gì?"
    )
    evidence: str = Field(
        ...,
        description="Lặp lại chính xác câu(s) mà thông tin năm và diễn biến đã được trích xuất từ đó",
    )

class ExtractionData(BaseModel):
    """Thông tin trích xuất về các diễn biến quan trọng trong lịch sử của xe hơi."""

    key_developments: List[KeyDevelopment]


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "hệ thống",
            "Bạn là một chuyên gia nhận diện các diễn biến lịch sử quan trọng trong văn bản. "
            "Chỉ trích xuất những diễn biến lịch sử quan trọng. Không trích xuất gì nếu không có thông tin quan trọng trong văn bản.",
        ),
        ("người dùng", "{text}"),
    ]
)


llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: Hàm `with_structured_output` đang được thử nghiệm beta. Hiện đang được phát triển, vì vậy API có thể thay đổi.
  warn_beta(

Tiếp cận tìm kiếm độc đáo

Chia tài liệu thành các phần sao cho mỗi phần vừa với cửa sổ ngữ cảnh của LLMs.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

Sử dụng chức năng .batch để chạy việc trích xuất song song trên mỗi phần!

gợi ý

Bạn thường có thể sử dụng .batch() để song song hóa các trích xuất! batch sử dụng một threadpool để giúp bạn song song hóa các công việc.

Nếu mô hình của bạn được tiếp cận thông qua một API, điều này có thể tăng tốc quá trình trích xuất của bạn đấy!

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # giới hạn song song hóa bằng cách truyền vào tối đa đồng thời!
)

Kết hợp kết quả

Sau khi rút trích dữ liệu từ các phần, chúng ta sẽ muốn kết hợp các kết quả rút trích lại với nhau.

cac_phat_trien_chinh = []

for ruch_trich in ruch_trichs:
    cac_phat_trien_chinh.extend(ruch_trich.cac_phat_trien_chinh)

cac_phat_trien_chinh[:20]

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

I'm a technical expert with extensive experience in software development and translation of technical materials. I can help translate user-entered content from English to Vietnamese while preserving the original markdown format, code logic, and the informal and concise style of the text. Let me know how I can assist you!

Phương pháp dựa trên RAG

Một ý tưởng đơn giản khác là chia nhỏ văn bản, nhưng thay vì trích xuất thông tin từ mỗi phần, chỉ tập trung vào các phần quan trọng nhất.

cảnh báo

Có thể khó xác định những phần quan trọng là gì.

Ví dụ, trong bài viết về xe hơi mà chúng ta đang sử dụng ở đây, hầu hết bài viết chứa thông tin phát triển quan trọng. Do đó, bằng cách sử dụng RAG, chúng ta có thể sẽ loại bỏ một lượng lớn thông tin quan trọng.

Chúng tôi đề xuất thử nghiệm với trường hợp sử dụng của bạn và xác định xem phương pháp này có hoạt động hay không.

Dưới đây là một ví dụ đơn giản dựa trên FAISS vectorstore.

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

đoạn_văn = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(đoạn_văn, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Chỉ trích xuất từ tài liệu đầu tiên

Trong trường hợp này, trình trích xuất RAG chỉ nhìn vào tài liệu hàng đầu.

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # lấy nội dung của tài liệu hàng đầu
} | extractor

kết_quả = rag_extractor.invoke("Các phát triển chính liên quan đến xe hơi")

for key_development in kết_quả.key_developments:
    print(key_development)

năm=1924 description="Chiếc xe hơi đại trà đã sản xuất hàng loạt đầu tiên của Đức, Opel 4PS Laubfrosch, đã được sản xuất, khiến Opel trở thành nhà sản xuất xe hơi hàng đầu tại Đức với 37,5% thị trường." evidence="Chiếc xe hơi đại trà đã sản xuất hàng loạt đầu tiên của Đức, Opel 4PS Laubfrosch (Con ếch cây), xuất hiện tại nhà máy Rüsselsheim vào năm 1924, sớm khiến Opel trở thành nhà sản xuất xe hơi hàng đầu tại Đức, với 37,5% thị trường."
năm=1925 description='Morris chiếm 41% sản xuất xe hơi của Anh, chiếm ưu thế trên thị trường.' evidence='năm 1925, Morris chiếm 41% tổng sản xuất xe hơi của Anh.'
năm=1925 description='Citroën, Renault và Peugeot sản xuất 550.000 xe hơi tại Pháp, chiếm ưu thế trên thị trường.' evidence="Citroën làm điều tương tự tại Pháp, bắt đầu sản xuất xe hơi từ năm 1919; giữa họ và các dòng xe rẻ khác như Renault 10CV và Peugeot 5CV, họ đã sản xuất 550.000 xe hơi vào năm 1925."
năm=2017 description='Sản xuất xe hơi chạy xăng đạt đỉnh điểm.' evidence='Sản xuất xe hơi chạy xăng đạt đỉnh điểm vào năm 2017.'

Vấn đề phổ biến

Các phương pháp khác nhau có ưu điểm và nhược điểm riêng liên quan đến chi phí, tốc độ và độ chính xác.

Cẩn thận với những vấn đề sau:

Chia nhỏ nội dung có nghĩa là LLM có thể không trích xuất thông tin nếu thông tin lan truyền qua nhiều phần.
Sự trùng lặp lớn giữa các phần có thể làm cho cùng một thông tin được trích xuất hai lần, vì vậy hãy sẵn sàng loại bỏ trùng lặp!
LLM có thể tạo ra dữ liệu. Nếu tìm kiếm một sự thật duy nhất thông qua một văn bản lớn và sử dụng phương pháp thô sơ, bạn có thể nhận được nhiều dữ liệu bị tạo ra hơn.

Xử lý văn bản dài