معالجة النصوص الطويلة

التعامل مع النصوص الطويلة

عند العمل مع الملفات، مثل ملفات PDF، من المرجح أن تصادف نصوص تتجاوز نافذة سياق نموذج لغتك. لمعالجة هذه النصوص، يُنصح باعتبار الاستراتيجيات التالية:

تغيير LLM اختر نموذج لغة طويل السياق يدعم نافذة سياق أكبر.
القوة الغاشمة قسّم الوثيقة واستخرج المحتوى من كل قسم.
RAG قسّم الوثيقة، وفّرق القسم، واستخرج المحتوى فقط من مجموعة من الأقسام التي تبدو "ملائمة".

عليك أن تضع في اعتبارك أنه لكل استراتيجية تنازلات مختلفة ومن المرجح أن أفضل استراتيجية تعتمد على التطبيق الذي تقوم بتصميمه!

إعداد

نحتاج إلى بعض البيانات التوضيحية! دعنا نقوم بتنزيل مقال حول السيارات من ويكيبيديا وتحميله كـ Document في LangChain.

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

response = requests.get("https://en.wikipedia.org/wiki/Car")
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

تعريف النموذج

هنا، سنقوم بتعريف النموذج لاستخراج التطورات الرئيسية من النص.

from typing import List, Optional
from langchain.chains import create_structured_output_runnable
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class KeyDevelopment(BaseModel):
    """معلومات عن تطوّر مهم في تاريخ السيارات."""
    year: int = Field(
        ..., description="السنة التي حدث فيها تطوّر تاريخي مهم."
    )
    description: str = Field(
        ..., description="ماذا حدث في هذه السنة؟ ما هو التطوّر الذي حدث؟"
    )
    evidence: str = Field(
        ...,
        description="أعدد الجملة المأخوذة حرفيًا منها المعلومات عن السنة والوصف",
    )

class ExtractionData(BaseModel):
    """معلومات مستخرجة حول التطورات الرئيسية في تاريخ السيارات."""
    key_developments: List[KeyDevelopment]

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "أنت خبير في تحديد التطورات التاريخية الرئيسية في النص. "
            "استخرج فقط التطورات التاريخية المهمة. لا تستخرج شيئًا إذا لم تتمكن من العثور على معلومات مهمة في النص.",
        ),
        ("human", "{text}"),
    ]
)

llm = ChatOpenAI(
    model="gpt-4-0125-preview",
    temperature=0,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
    method="function_calling",
    include_raw=False,
)

/home/eugene/.pyenv/versions/3.11.2/envs/langchain_3_11/lib/python3.11/site-packages/langchain_core/_api/beta_decorator.py:86: LangChainBetaWarning: The function `with_structured_output` is in beta. It is actively being worked on, so the API may change.
  warn_beta(

النهج القوي

قسّم الوثائق إلى أجزاء بحيث يناسب كل جزء نافذة سياق نماذج اللغة.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=2000,
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

استخدم .batch لتشغيل عملية الاستخراج بشكل موازي عبر كل جزء!

نصيحة

غالبًا ما يمكنك استخدام .batch() لتوازي العمليات! يستخدم batch حوض نقل متعدد الخيوط تحت الغطاء لمساعدتك في توازي الأعباء العمل.

إذا كان نموذجك متاحًا عبر واجهة برمجة التطبيقات، سيرتفع بلا شك سرعة تدفّق الاستخراج!

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # قم بتقييد التزامن بتحديد سعة أقصى له
)

دمج النتائج

بعد استخراج البيانات من مختلف الشرائح، سنرغب في دمج الاستخراجات معًا.

key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:20]

[KeyDevelopment(year=1966, description="The Toyota Corolla began production, recognized as the world's best-selling automobile.", evidence="The Toyota Corolla has been in production since 1966 and is recognized as the world's best-selling automobile."),
 KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='French-born Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='The 1908 Model T, an affordable car for the masses, was manufactured by the Ford Motor Company.', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDevelopment(year=1881, description='Gustave Trouvé demonstrated a three-wheeled car powered by electricity.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity.'),
 KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
 KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
 KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
 KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),
 KeyDevelopment(year=1902, description='A new model DMG car was produced and named Mercedes after the Maybach engine.', evidence='Two years later, in 1902, a new model DMG car was produced and the model was named Mercedes after the Maybach engine, which generated 35 hp.'),
 KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle using a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.'),
 KeyDevelopment(year=1895, description='George Selden was granted a US patent for a two-stroke car engine.', evidence='After a delay of 16 years and a series of attachments to his application, on 5 November 1895, Selden was granted a US patent (U.S. patent 549,160) for a two-stroke car engine.'),
 KeyDevelopment(year=1893, description='The first running, petrol-driven American car was built and road-tested by the Duryea brothers.', evidence='In 1893, the first running, petrol-driven American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts.'),
 KeyDevelopment(year=1897, description='Rudolf Diesel built the first diesel engine.', evidence='In 1897, he built the first diesel engine.'),
 KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
 KeyDevelopment(year=1913, description="Henry Ford began the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant."),
 KeyDevelopment(year=1914, description="Ford's assembly line worker could buy a Model T with four months' pay.", evidence="In 1914, an assembly line worker could buy a Model T with four months' pay."),
 KeyDevelopment(year=1926, description='Fast-drying Duco lacquer was developed, allowing for a variety of car colors.', evidence='Only Japan black would dry fast enough, forcing the company to drop the variety of colours available before 1913, until fast-drying Duco lacquer was developed in 1926.')]

I am a technical expert with extensive experience in software development and have a good understanding of translating technical materials.

My task is to translate user-entered material from English to Arabic while ensuring that the translation meets the following requirements:

The translation should maintain the original markdown grammar format.
When translating code, comments, and string values, the logic of the code must remain unchanged.
Use free translation to ensure that the expression of sentences conforms to Arabic habits.
Use a colloquial and informal tone in the translation.
Keep the translation concise and to the point.

نهج مستند على تقنية "RAG"

فكرة بسيطة أخرى هي تقسيم النص إلى مجموعات، ولكن بدلًا من استخراج المعلومات من كل مجموعة، فقط التركيز على المجموعات الأكثر صلة.

تحذير

يمكن أن يكون من الصعب تحديد أي مجموعات صلة.

على سبيل المثال، في المقالة عن "السيارة" التي نستخدمها هنا، يحتوي معظم المقال على معلومات تطوير رئيسية. لذا، باستخدام RAG، ستكون هناك احتمالية كبيرة للتخلص من الكثير من المعلومات ذات الصلة.

نقترح تجربة حالتك الخاصة وتحديد ما إذا كان هذا النهج يعمل أم لا.

إليك مثال بسيط يعتمد على FAISS vectorstore.

from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 1}
)  # استخراج من أول مستند فقط

في هذه الحالة، يقوم مستخرج RAG بالنظر فقط إلى المستند الأول.

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)  # احصل على محتوى المستند الأول
} | extractor

results = rag_extractor.invoke("تطورات رئيسية مرتبطة بالسيارات")

for key_development in results.key_developments:
    print(key_development)

year=1924 description="أنتجت أول سيارة في ألمانيا بشكل جماعي، Opel 4PS Laubfrosch، مما جعل شركة Opel أكبر منتج للسيارات في ألمانيا بنسبة 37.5% من السوق." evidence="نشأت أول سيارة في ألمانيا بشكل جماعي، Opel 4PS Laubfrosch (الضفدع الشجري)، في Rüsselsheim في عام 1924، مما جعل شركة Opel أكبر منتج للسيارات في ألمانيا بنسبة 37.5% من السوق."
year=1925 description='كانت لدى Morris 41% من إجمالي إنتاج السيارات البريطانية، مهيمنة على السوق.' evidence='في عام 1925، كان لدى Morris 41% من إجمالي إنتاج السيارات البريطانية.'
year=1925 description='أنتجت Citroën وRenault وPeugeot 550،000 سيارة في فرنسا، وهي تهيمن على السوق.' evidence="Citroën فعلت الشيء نفسه في فرنسا، حيث بدأت في صناعة السيارات في عام 1919؛ بينهم وبين سيارات رخيصة أخرى كالـ 10CV من Renault و 5CV من Peugeot، أنتجوا 550,000 سيارة في عام 1925."
year=2017 description='بلغ إنتاج السيارات التي تعمل بالبنزين ذروته.' evidence='بلغ إنتاج السيارات التي تعمل بالبنزين ذروته في عام 2017.'

المشكلات الشائعة

تختلف الطرق المختلفة في الفوائد والعيوب المرتبطة بالتكلفة والسرعة والدقة.

تحذير من هذه المشكلات:

قد يؤدي تقسيم المحتوى إلى فشل LLM في استخراج المعلومات إذا كانت المعلومات متناثرة عبر مجموعات متعددة.
قد يتسبب التداخل الكبير بين المجموعات في استخراج نفس المعلومات مرتين، لذا كن مستعدًا لإزالة التكرار!
يمكن لـ LLMs إضافة بيانات زائفة. إذا كنت تبحث عن حقيقة واحدة عبر نص طويل وتستخدم نهج القوة الغاشمة، فيمكن أن تنتهي بالمزيد من البيانات المضافة.