Gestione del testo lungo

Gestire testi lunghi

Quando si lavora con file, come i PDF, è probabile che ci si trovi di fronte a testo che supera la finestra di contesto del proprio modello di lingua. Per elaborare questo testo, si possono prendere in considerazione le seguenti strategie:

Cambiare LLM Scegliere un diverso LLM che supporti una finestra di contesto più ampia.
Forza bruta Suddividere il documento e estrarre il contenuto da ciascuna suddivisione.
RAG Suddividere il documento, indicizzare i chunk e estrarre il contenuto solo da un sottoinsieme di chunk che sembrano "rilevanti".

Tenete presente che queste strategie comportano differenti compromessi e la migliore strategia dipende probabilmente dall'applicazione che state progettando!

Preparare il setup

Abbiamo bisogno di alcuni dati di esempio! Scarichiamo un articolo su auto da Wikipedia e carichiamolo come un documento LangChain.

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

Definire lo schema

Qui, definiremo lo schema per estrarre le principali evoluzioni dal testo.

from typing import List, Optional
from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class KeyDevelopment(BaseModel):
    """Informazioni su un'evoluzione nella storia delle auto."""

    year: int = Field(
        ..., description="L'anno in cui c'è stata un'importante evoluzione storica."
    )
    description: str = Field(
        ..., description="Cosa è successo in quell'anno? Qual è stata l'evoluzione?"
    )
    evidence: str = Field(
        ...,
        description="Ripetere testualmente la/e frase/e da cui sono state estratte le informazioni sull'anno e la descrizione.",
    )

class ExtractionData(BaseModel):
    """Informazioni estratte sulle principali evoluzioni nella storia delle auto."""

    key_developments: List[KeyDevelopment]

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Sei un esperto nell'identificare le principali evoluzioni storiche nel testo. "
            "Estrai solo le evoluzioni storiche importanti. Non estrarre nulla se non si trovano informazioni importanti nel testo.",
        ),
        ("human", "{text}"),
    ]
)


llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.
  warn_beta(

Approccio forza bruta

Dividere i documenti in pezzi in modo che ciascun pezzo si adatti alla finestra di contesto delle LLM.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

Utilizzare la funzionalità .batch per eseguire l'estrazione in parallelo su ciascun chunk!

tip

Si può spesso utilizzare .batch() per parallelizzare le estrazioni! batch utilizza un threadpool nel background per aiutare a parallelizzare i carichi di lavoro.

Se il proprio modello è esposto tramite un'API, questo probabilmente velocizzerà il flusso di estrazione!

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limitare la concorrenza passando la massima concorrenza!
)

Unire i risultati

Dopo aver estratto i dati da tutte le parti, vorremo unire le estrazioni insieme.

sviluppi_chiave = []

for estrazione in estrazioni:
    sviluppi_chiave.extend(estrazione.sviluppi_chiave)

sviluppi_chiave[:20]

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

Thank you for reaching out for translation assistance. Please provide the English text that needs to be translated into Italian, and I'll be happy to help!

Approccio basato su RAG

Un'altra idea semplice è quella di suddividere il testo in pezzi, ma anziché estrarre informazioni da ogni pezzo, concentrati solo sui pezzi più rilevanti.

Attenzione: Possono essere difficili da identificare quali pezzi siano rilevanti.

Ad esempio, nell'articolo sulla macchina che stiamo utilizzando qui, la maggior parte dell'articolo contiene informazioni chiave sullo sviluppo. Quindi, utilizzando RAG, è probabile che scarteremo molte informazioni rilevanti.

Suggeriamo di sperimentare con il proprio caso d'uso e di determinare se questo approccio funziona o meno.

Ecco un esempio semplice che si basa sul FAISS vectorstore.

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Estrarre solo dal primo documento

In questo caso, l'estrattore RAG sta osservando solo il documento principale.

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # ottieni il contenuto del documento principale
} | extractor

results = rag_extractor.invoke("Principali sviluppi associati alle automobili")

for key_development in results.key_developments:
    print(key_development)

anno=1924 descrizione="La prima auto prodotta in serie in Germania, l'Opel 4PS Laubfrosch, è stata prodotta, facendo diventare Opel il principale costruttore di auto in Germania con il 37,5% del mercato." evidenza="La prima auto prodotta in serie in Germania, l'Opel 4PS Laubfrosch (Ranocchio), uscì dalla linea di produzione a Rüsselsheim nel 1924, facendo presto diventare Opel il principale costruttore di auto in Germania, con il 37,5% del mercato."
anno=1925 descrizione='Morris aveva il 41% della produzione totale di auto britanniche, dominando il mercato.' evidenza='Nel 1925, Morris aveva il 41% della produzione totale di auto britanniche.'
anno=1925 descrizione='Citroën, Renault e Peugeot hanno prodotto 550.000 auto in Francia, dominando il mercato.' evidenza="Citroën fece lo stesso in Francia, dedicandosi alle auto nel 1919; tra loro e altre auto economiche in risposta come la 10CV di Renault e la 5CV di Peugeot, produssero 550.000 auto nel 1925."
anno=2017 descrizione='Produzione di auto a benzina ha raggiunto il picco.' evidenza='La produzione di auto a benzina ha raggiunto il picco nel 2017.'

Problemi comuni

Metodi diversi hanno i loro pro e contro legati al costo, alla velocità e all'accuratezza.

Fai attenzione a questi problemi:

La suddivisione del contenuto significa che il LLM potrebbe non riuscire a estrarre informazioni se queste sono distribuite su più pezzi.
Un'elevata sovrapposizione di pezzi potrebbe causare l'estrazione due volte delle stesse informazioni, quindi preparati a eliminare i duplicati!
I LLM possono inventare dati. Se stai cercando un singolo fatto in un testo ampio e stai usando un approccio brute force, potresti finire per ottenere più dati inventati.