Handle Long Text

Penanganan Teks yang Panjang

Saat bekerja dengan file, seperti PDF, Anda mungkin akan menemui teks yang melebihi jendela konteks model bahasa Anda. Untuk memproses teks ini, pertimbangkan strategi berikut:

Ubah LLM Pilih LLM yang berbeda yang mendukung jendela konteks yang lebih besar.
Paksa Kasar Potong dokumen tersebut, dan ekstrak konten dari setiap potongan.
RAG Potong dokumen tersebut, indekskan potongan-potongan tersebut, dan hanya ekstrak konten dari subset potongan yang terlihat "relevan".

Perhatikan bahwa strategi ini memiliki trade-off yang berbeda dan strategi terbaik kemungkinan bergantung pada aplikasi yang sedang Anda rancang.

Persiapan

Kita memerlukan beberapa contoh data! Ayo unduh sebuah artikel tentang mobil dari Wikipedia dan muat sebagai Dokumen LangChain.

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

Tentukan skema

Di sini, kita akan mendefinisikan skema untuk mengekstrak perkembangan kunci dari teks tersebut.

from typing import List, Optional
from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class KeyDevelopment(BaseModel):
    """Informasi tentang perkembangan dalam sejarah mobil."""

    year: int = Field(
        ..., description="Tahun ketika terjadi perkembangan bersejarah penting."
    )
    description: str = Field(
        ..., description="Apa yang terjadi pada tahun ini? Apa perkembangannya?"
    )
    evidence: str = Field(
        ...,
        description="Ulangi verbatim kalimat-kalimat dari mana informasi tahun dan deskripsi diekstraksi",
    )

class ExtractionData(BaseModel):
    """Informasi yang diekstrak tentang perkembangan kunci dalam sejarah mobil."""

    key_developments: List[KeyDevelopment]


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Anda adalah ahli dalam mengidentifikasi perkembangan sejarah kunci dalam teks. "
            "Hanya ekstrak perkembangan sejarah penting. Jangan ekstrak apa pun jika tidak ada informasi penting yang dapat ditemukan dalam teks.",
        ),
        ("human", "{text}"),
    ]
)

llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: Fungsi `with_structured_output` sedang dalam versi beta. Ini sedang aktif dikembangkan, jadi API dapat berubah.
  warn_beta(

Pendekatan paksa kasar

Pecah dokumen menjadi potongan-potongan agar setiap potongan cocok dengan jendela konteks LLMs.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

Gunakan fungsionalitas .batch untuk menjalankan ekstraksi secara paralel di setiap potongan!

tip

Anda seringkali dapat menggunakan .batch() untuk melakukan paralelisasi ekstraksi! batch menggunakan threadpool di bawahnya untuk membantu Anda paralelisasi beban kerja.

Jika model Anda terpapar melalui API, ini kemungkinan akan mempercepat aliran ekstraksi Anda!

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # batasi konkurensinya dengan melewati maksimal konkurensi!
)

Menggabungkan hasil

Setelah mengekstrak data dari berbagai bagian, kita akan menggabungkan hasil ekstraksi tersebut.

perkembangan_kunci = []

for ekstraksi in ekstraksis:
    perkembangan_kunci.extend(ekstraksi.perkembangan_kunci)

perkembangan_kunci[:20]

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

Hello! How can I assist you today?

Pendekatan berbasis RAG

Salah satu ide sederhana adalah mengelompokkan teks, tetapi alih-alih mengekstrak informasi dari setiap kelompok, fokuslah hanya pada kelompok yang paling relevan.

Perhatian

Mungkin sulit untuk mengidentifikasi kelompok mana yang relevan.

Sebagai contoh, dalam artikel mobil yang kami gunakan di sini, sebagian besar artikel berisi informasi pengembangan kunci. Jadi dengan menggunakan RAG, kemungkinan besar kita akan mengabaikan banyak informasi yang relevan.

Kami menyarankan untuk bereksperimen dengan kasus penggunaan Anda dan menentukan apakah pendekatan ini berhasil atau tidak.

Berikut contoh sederhana yang bergantung pada FAISS vectorstore.

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Hanya ekstrak dari dokumen pertama

Dalam kasus ini, extractor RAG hanya melihat dokumen teratas.

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # ambil konten dokumen teratas
} | extractor

results = rag_extractor.invoke("Perkembangan kunci terkait dengan mobil")

for key_development in results.key_developments:
    print(key_development)

tahun=1924 deskripsi="Mobil massa pertama Jerman, Opel 4PS Laubfrosch, diproduksi, menjadikan Opel sebagai pembuat mobil teratas di Jerman dengan 37,5% pangsa pasar." bukti="Mobil massa pertama Jerman, Opel 4PS Laubfrosch (Kodok Pohon), diproduksi di Rüsselsheim pada 1924, segera menjadikan Opel sebagai pembuat mobil teratas di Jerman, dengan 37,5 persen pangsa pasar."
tahun=1925 deskripsi='Morris memiliki 41% dari total produksi mobil Inggris, mendominasi pasar.' bukti='pada 1925, Morris memiliki 41 persen dari total produksi mobil Inggris.'
tahun=1925 deskripsi='Citroën, Renault, dan Peugeot memproduksi 550.000 mobil di Prancis, mendominasi pasar.' bukti="Citroën melakukan hal yang sama di Prancis, memasuki bisnis mobil pada tahun 1919; di antara mereka dan mobil murah lainnya sebagai balasan seperti Renault 10CV dan Peugeot 5CV, mereka memproduksi 550.000 mobil pada tahun 1925."
tahun=2017 deskripsi='Produksi mobil bensin mencapai puncaknya.' bukti='Produksi mobil bensin mencapai puncaknya pada tahun 2017.'

Masalah Umum

Metode yang berbeda memiliki pro dan kontra terkait dengan biaya, kecepatan, dan akurasi.

Perhatikan masalah-masalah berikut:

Memecah konten berarti LLM dapat gagal mengekstrak informasi jika informasinya tersebar di beberapa kelompok.
Tumpang tindih kelompok besar dapat menyebabkan informasi yang sama diekstrak dua kali, jadi siapkan diri untuk menduplikasi!
LLM bisa membuat data palsu. Jika mencari fakta tunggal di sepanjang teks besar dan menggunakan pendekatan secara paksa, Anda mungkin akan mendapatkan lebih banyak data yang dibuat-buat.