解析 - LangChainを使用した構造化データの抽出

パーシング

それらがうまくプロンプトの指示に従えるLLMsは、特定の形式で情報を出力するように指示できます。

このアプローチは、良いプロンプトを設計し、LLMsの出力を解析して情報をうまく抽出することに依存しています。

ここでは、指示に従うことが得意なClaudeを使用します！アントロピックモデルを参照してください。

from langchain_anthropic.chat_models import ChatAnthropic

model = ChatAnthropic(model_name="claude-3-sonnet-20240229", temperature=0)

ヒント: 抽出品質に関するすべての同じ考慮事項がパーシングアプローチにも適用されます。抽出品質のガイドラインを確認してください。

このチュートリアルはシンプルなものであるが、一般的には本当にパフォーマンスを向上させるための参照例を含めるべきです！

PydanticOutputParserを使用する

以下の例では、組み込みの PydanticOutputParser を使用してチャットモデルの出力を解析します。

from typing import List, Optional

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator


class Person(BaseModel):
    """人に関する情報。"""

    name: str = Field(..., description="人物の名前")
    height_in_meters: float = Field(
        ..., description="メートルで表された人物の身長"
    )


class People(BaseModel):
    """テキスト内のすべての人物の識別情報。"""

    people: List[Person]


parser = PydanticOutputParser(pydantic_object=People)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "ユーザークエリに回答してください。出力を `json` タグで囲んでください\n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

モデルに送信される情報を見てみましょう

query = "アンナは23歳で、身長は6フィートです"

print(prompt.format_prompt(query=query).to_string())

System: ユーザークエリに回答してください。出力を `json` タグで囲んでください
出力は以下のJSONスキーマに準拠したJSONインスタンスとしてフォーマットする必要があります。

スキーマ {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]} のための例として、オブジェクト {"foo": ["bar", "baz"]} はスキーマの適切なフォーマットされたインスタンスです。オブジェクト {"properties": {"foo": ["bar", "baz"]}} は適切にフォーマットされていません。

以下が出力スキーマです:
{"description": "テキスト内のすべての人物の識別情報。", "properties": {"people": {"title": "People", "type": "array", "items": {"$ref": "#/definitions/Person"}}}, "required": ["people"], "definitions": {"Person": {"title": "Person", "description": "人に関する情報。", "type": "object", "properties": {"name": {"title": "Name", "description": "人物の名前", "type": "string"}, "height_in_meters": {"title": "Height In Meters", "description": "メートルで表された人物の身長", "type": "number"}}, "required": ["name", "height_in_meters"]}}}

Human: アンナは23歳で、身長は6フィートです

chain = prompt | model | parser
chain.invoke({"query": query})

People(people=[Person(name='アンナ', height_in_meters=1.83)])

カスタムパージング

LangChainとLCELを使用して、独自のプロンプトとパーサーを簡単に作成できます。

モデルからの出力を解析するためのシンプルな関数を使用できます！

import json
import re
from typing import List, Optional

from langchain_anthropic.chat_models import ChatAnthropic
from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator


class Person(BaseModel):
    """人物に関する情報。"""

    name: str = Field(..., description="人物の名前")
    height_in_meters: float = Field(
        ..., description="メートル単位で表された人物の身長。"
    )


class People(BaseModel):
    """テキスト内のすべての人物に関する識別情報。"""

    people: List[Person]


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "ユーザークエリに回答してください。以下のスキーマに一致するJSONで回答してください：```json\n{schema}\n```。回答を```json```と```tags```でラップすることを忘れないでください。",
        ),
        ("human", "{query}"),
    ]
).partial(schema=People.schema())


def extract_json(message: AIMessage) -> List[dict]:
    """JSONが```json```と```tags```で挟まれた文字列からJSONコンテンツを抽出します。

    Parameters:
        text (str): JSONコンテンツを含むテキスト。

    Returns:
        list: 抽出されたJSON文字列のリスト。
    """
    text = message.content
    pattern = r"```json(.*?)```"

    matches = re.findall(pattern, text, re.DOTALL)

    try:
        return [json.loads(match.strip()) for match in matches]
    except Exception:
        raise ValueError(f"パースに失敗しました: {message}")

query = "アンナは23歳で、身長は6フィートです"
print(prompt.format_prompt(query=query).to_string())

System: ユーザークエリに回答してください。以下のスキーマに一致するJSONで回答してください：```json
{'title': 'People', 'description': 'テキスト内のすべての人物に関する識別情報。', 'type': 'object', 'properties': {'people': {'title': 'People', 'type': 'array', 'items': {'$ref': '#/definitions/Person'}}}, 'required': ['people'], 'definitions': {'Person': {'title': 'Person', 'description': '人物に関する情報。', 'type': 'object', 'properties': {'name': {'title': 'Name', 'description': '人物の名前', 'type': 'string'}, 'height_in_meters': {'title': 'Height In Meters', 'description': 'メートル単位で表された人物の身長。', 'type': 'number'}}, 'required': ['name', 'height_in_meters']}}}
```. 回答を```json```と```tags```でラップすることを忘れないでください。
Human: アンナは23歳で、身長は6フィートです

chain = prompt | model | extract_json
chain.invoke({"query": query})

[{'people': [{'name': 'アンナ', 'height_in_meters': 1.83}]}]

その他のライブラリ

パージングアプローチを使用して抽出する場合は、Korライブラリをチェックしてみてください。これはLangChainのメンテナーの1人によって作成されたライブラリで、例を考慮に入れたプロンプトの作成、フォーマットの制御（JSONやCSVなど）、およびTypeScriptでスキーマを表現する機能があります。かなりうまく機能するようです！

解析

パーシング

PydanticOutputParserを使用する

カスタムパージング

その他のライブラリ

関連チュートリアル