Handle Long Text

Lidar com Textos Longos

Ao trabalhar com arquivos, como PDFs, é provável que você encontre texto que excede a janela de contexto do seu modelo de linguagem. Para processar esse texto, considere estas estratégias:

Mudar LLM Escolha um LLM diferente que suporte uma janela de contexto maior.
Força Bruta Divida o documento em pedaços e extraia o conteúdo de cada pedaço.
RAG Divida o documento, indexe os pedaços e extraia o conteúdo apenas de um subconjunto de pedaços que parecem "relevantes".

Tenha em mente que essas estratégias têm diferentes compensações e a melhor estratégia provavelmente depende da aplicação que você está desenvolvendo!

Configuração

Precisamos de alguns dados de exemplo! Vamos baixar um artigo sobre carros da Wikipedia e carregá-lo como um documento LangChain.

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

Definir o Esquema

Aqui, vamos definir um esquema para extrair desenvolvimentos-chave do texto.

from typing import List, Optional
from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class KeyDevelopment(BaseModel):
    """Informações sobre um desenvolvimento na história dos carros."""

    year: int = Field(
        ..., description="O ano em que houve um desenvolvimento histórico importante."
    )
    description: str = Field(
        ..., description="O que aconteceu neste ano? Qual foi o desenvolvimento?"
    )
    evidence: str = Field(
        ...,
        description="Repetir textualmente a(s) frase(s) de onde foram extraídas as informações sobre o ano e a descrição.",
    )

class ExtractionData(BaseModel):
    """Informações extraídas sobre desenvolvimentos-chave na história dos carros."""

    key_developments: List[KeyDevelopment]

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Você é um especialista em identificar desenvolvimentos históricos-chave no texto. "
            "Apenas extraia desenvolvimentos históricos importantes. Não extraia nada se nenhuma informação importante puder ser encontrada no texto.",
        ),
        ("human", "{text}"),
    ]
)

llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.
  warn_beta(

Abordagem de Força Bruta

Divida os documentos em pedaços de forma que cada pedaço se encaixe na janela de contexto dos LLMs.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

Use a funcionalidade .batch para executar a extração em paralelo em cada pedaço!

Dica

Você pode frequentemente usar .batch() para paralelizar as extrações! batch utiliza um threadpool por baixo dos panos para ajudar a paralelizar as cargas de trabalho.

Se o seu modelo estiver exposto via uma API, isso provavelmente acelerará o fluxo de extração!

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limite a concorrência passando a máxima concorrência!
)

Fundir resultados

Depois de extrair dados de várias partes, vamos querer fundir as extrações juntas.

desenvolvimentos_chave = []

for extracao in extracoes:
    desenvolvimentos_chave.extend(extracao.desenvolvimentos_chave)

desenvolvimentos_chave[:20]

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

I have received the role definition for the task. Let's get started with the translation from English to Portuguese. If you have any specific material or content that needs translation, please provide it so that I can begin working on it.

Abordagem baseada em RAG

Uma ideia simples é dividir o texto em pedaços, mas em vez de extrair informações de cada pedaço, concentre-se nos pedaços mais relevantes.

atenção

Pode ser difícil identificar quais pedaços são relevantes.

Por exemplo, no artigo sobre carros que estamos usando aqui, a maior parte do artigo contém informações importantes de desenvolvimento. Por isso, ao usar o RAG, provavelmente estaremos dispensando muitas informações relevantes.

Sugerimos experimentar com o seu caso de uso e determinar se essa abordagem funciona ou não.

Aqui está um exemplo simples que depende do FAISS vectorstore.

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # Extrair apenas do primeiro documento

Neste caso, o extrator RAG está apenas olhando para o documento principal.

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # buscar conteúdo do documento principal
} | extractor

results = rag_extractor.invoke("Principais desenvolvimentos associados a carros")

for key_development in results.key_developments:
    print(key_development)

ano=1924 descrição="O primeiro carro produzido em massa da Alemanha, o Opel 4PS Laubfrosch, foi produzido, tornando a Opel a principal fabricante de carros na Alemanha com 37,5% do mercado." evidência="O primeiro carro produzido em massa da Alemanha, o Opel 4PS Laubfrosch (Sapo da Folha), saiu da linha em Rüsselsheim em 1924, logo tornando a Opel a principal fabricante de carros na Alemanha, com 37,5 por cento do mercado."
ano=1925 descrição='Morris tinha 41% da produção total de carros britânicos, dominando o mercado.' evidência='Em 1925, a Morris tinha 41 por cento da produção total de carros britânicos.'
ano=1925 descrição='Citroën, Renault e Peugeot produziram 550.000 carros na França, dominando o mercado.' evidência="A Citroën fez o mesmo na França, chegando aos carros em 1919; entre eles e outros carros baratos em resposta, como o 10CV da Renault e o 5CV da Peugeot, eles produziram 550.000 carros em 1925."
ano=2017 descrição='Produção de carros a gasolina atingiu o pico.' evidência='A produção de carros a gasolina atingiu o pico em 2017.'

Problemas Comuns

Diferentes métodos têm seus próprios prós e contras relacionados ao custo, velocidade e precisão.

Fique atento para esses problemas:

Dividir o conteúdo significa que o LLM pode falhar em extrair informações se as informações estiverem espalhadas por vários pedaços.
A sobreposição de grandes pedaços pode fazer com que as mesmas informações sejam extraídas duas vezes, então esteja preparado para eliminar duplicatas!
Os LLMs podem inventar dados. Se estiver procurando por um único fato em um texto grande e usando uma abordagem de força bruta, você pode acabar obtendo mais dados inventados.