https://kazuhira-r.hatenablog.com/entry/2025/02/24/195146

不均一なドキュメントの長さを扱う
- 現実のドキュメントのコレクションには様々な長さのドキュメントが含まれることがよくあり、分割するとすべてのドキュメントに対して一貫した処理が行えるようになる
モデルの制限を克服する
- 多くの埋め込みモデルや言語モデルには最大入力サイズの制限があり、テキストと分割することでこれらの制限を超えるドキュメントを処理できるようになる
表現品質の向上
- 長いドキュメントの場合、埋め込みやその他の表現は多くの情報を取得しようとするため品質が低下する可能性がある
- 分割すると、各セクションをより集中的かつ正確に表現できるようになる
検索精度の向上
- 情報検索システムでは分割によって検索結果の粒度が向上し、クエリーと関連するドキュメントのセクションをより正確に一致させることができる
計算リソースの最適化
- テキストのチャンクを小さくすることで、メモリー効率が向上し処理タスクの並列化もできるようになる

テキスト分割のアプローチは以下の4つがあります。

長さベース
- Text splitters / Approaches / Length-based
- 言語モデル向けに便利なトークンベース、シンプルに文字数に基づいて分割する文字ベースのアプローチがある
テキスト構造ベース
- Text splitters / Approaches / Text-structured based
- テキストは段落、文、単語などの階層的な単位で構成されるので、この構造を利用して自然な言語の流れや分割後の意味の一貫性を維持したまま分割する
ドキュメント構造ベース
- Text splitters / Approaches / Document-structured based
- HTML、Markdown、JSONなど、ドキュメントのフォーマットによっては固有の構造があり、このような場合は意味的に関連するテキストがグループ化されることが多いため、ドキュメントをその構造に基づいて分割すると便利なことがある
セマンティック・意味的ベース
- Text splitters / Approaches / Semantic meaning based

Text splittersのhow toガイドについては、こちらに一覧があります。

How-to guides / Components / Text splitters

Embeddeing modelsは、テキストをベクトル空間に埋め込む、いわゆる埋め込みに対する抽象化です。

Embedding models | 🦜️🔗 LangChain

Embeddeing modelsでは、2つのメソッドを使います。

embed_documents … 複数のドキュメントに対するテキスト埋め込みを行う
embed_query … 単一のクエリーに対するテキスト埋め込みを行う

この区別は重要で、モデルによってはドキュメント（検索対象）とクエリー（検索を行うための入力）に対して、
異なる埋め込み戦略をとっている場合があるからです。

埋め込みの類似度は、以下の3つの距離関数（類似性メトリクス）で測定します。

コサイン類似度
ユークリッド距離
ドット積

利用可能なインテグレーションはこちら。

Embedding models | 🦜️🔗 LangChain

Embeddeing modelsのhow toガイドについては、こちらに一覧があります。

How-to guides / Components / Embeddeing models

Vector storesは、テキストの埋め込み（ベクトル表現）に基づいて情報のインデックス作成と取得ができるデータストアに
対する抽象化です。

Vector stores | 🦜️🔗 LangChain

利用可能なインテグレーションはこちら。

Vector stores | 🦜️🔗 LangChain

Vector storesでは、主に以下のメソッドを使用します。

add_documents … ベクトルデータベースにテキストのリストを追加する
delete … ベクトルデータベースからドキュメントのリストを削除する
similarity_search … 指定されたクエリーに対して、類似するドキュメントを検索する

チュートリアルで言っているセマンティック検索は、この類似したドキュメントを検索することを言っています。

LangChainにおけるほとんどのVector storesでは、初期化の際にEmbedding modelが必要になります。

初期化後は前述の3つのメソッドを使っていくわけですが、ドキュメントに付与したメタデータでのフィルタリングが
可能な場合もあります。

Vector storesのhow toガイドについては、こちらに一覧があります。

How-to guides / Components / Vector stores

またRetrieversの方になりますが、データストアによってはキーワード検索とセマンティック検索を組み合わせた
ハイブリッド検索が使えるものもあります。

Hybrid Search | 🦜️🔗 LangChain

最後はRetrieversです。Retrieversは、様々なタイプの検索システムと対話するためのインターフェースです。

Retrievers | 🦜️🔗 LangChain

利用可能なインテグレーションはこちら。

Retrievers | 🦜️🔗 LangChain

Retrieversにクエリーを渡して呼び出すと、次の属性を持つドキュメントのリストを返します。

page_content … ドキュメントのコンテンツ（文字列）
metadata … ドキュメントに関連付けられた任意のメタデータ

Retrieversのhow toガイドについては、こちらに一覧があります。

How-to guides / Components / Retrievers

今回はチュートリアルの内容から、Embedding modelにOllama、Vector storeにQdrantを使って試してみたいと思います。

環境

今回の環境はこちら。

$ python3 --version
Python 3.12.3


$ uv --version
uv 0.6.2

Ollama。

$ bin/ollama serve
$ bin/ollama --version
ollama version is 0.5.11

Qdrantは172.17.0.2で動作しているものとします。

$ ./qdrant --version
qdrant 1.13.4

準備

まずはプロジェクトを作成します。

$ uv init --vcs none langchain-tutorial-semantic-search
$ cd langchain-tutorial-semantic-search
$ rm main.py

今回必要な依存関係をインストール。

$ uv add langchain-community langchain-ollama langchain-qdrant pypdf

mypyとRuffも入れておきます。

$ uv add --dev mypy ruff

インストールされた依存関係の一覧。

$ uv pip list
Package                  Version
------------------------ ---------
aiohappyeyeballs         2.4.6
aiohttp                  3.11.12
aiosignal                1.3.2
annotated-types          0.7.0
anyio                    4.8.0
attrs                    25.1.0
certifi                  2025.1.31
charset-normalizer       3.4.1
dataclasses-json         0.6.7
frozenlist               1.5.0
greenlet                 3.1.1
grpcio                   1.70.0
grpcio-tools             1.70.0
h11                      0.14.0
h2                       4.2.0
hpack                    4.1.0
httpcore                 1.0.7
httpx                    0.28.1
httpx-sse                0.4.0
hyperframe               6.1.0
idna                     3.10
jsonpatch                1.33
jsonpointer              3.0.0
langchain                0.3.19
langchain-community      0.3.18
langchain-core           0.3.37
langchain-ollama         0.2.3
langchain-qdrant         0.2.0
langchain-text-splitters 0.3.6
langsmith                0.3.10
marshmallow              3.26.1
multidict                6.1.0
mypy                     1.15.0
mypy-extensions          1.0.0
numpy                    2.2.3
ollama                   0.4.7
orjson                   3.10.15
packaging                24.2
portalocker              2.10.1
propcache                0.3.0
protobuf                 5.29.3
pydantic                 2.10.6
pydantic-core            2.27.2
pydantic-settings        2.8.0
pypdf                    5.3.0
python-dotenv            1.0.1
pyyaml                   6.0.2
qdrant-client            1.13.2
requests                 2.32.3
requests-toolbelt        1.0.0
ruff                     0.9.7
setuptools               75.8.0
sniffio                  1.3.1
sqlalchemy               2.0.38
tenacity                 9.0.0
typing-extensions        4.12.2
typing-inspect           0.9.0
urllib3                  2.3.0
yarl                     1.18.3
zstandard                0.23.0

pyproject.toml

[project]
name = "langchain-tutorial-semantic-search"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "langchain-community>=0.3.18",
    "langchain-ollama>=0.2.3",
    "langchain-qdrant>=0.2.0",
    "pypdf>=5.3.0",
]

[dependency-groups]
dev = [
    "mypy>=1.15.0",
    "ruff>=0.9.7",
]

[tool.mypy]
strict = true
disallow_any_unimported = true
#disallow_any_expr = true
disallow_any_explicit = true
warn_unreachable = true
pretty = true

LangChainのチュートリアルのセマンティック検索を試す

それでは、こちらに沿って進めていきます。

Build a semantic search engine | 🦜️🔗 LangChain

内容の区切りを見て、3つに分けて進めていきましょう。

ベクトルデータベースにドキュメントを保存する

最初は、ベクトルデータベースにドキュメントを保存するまでをやってみます。

チュートリアルでは、この3つのセクションですね。

Vector storesに関しては検索までは行いません。

作成したソースコードはこちら。

hello_load_documents.py

from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

file_path = "example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(f"loaded document count = {len(docs)}")

print()

print(f"{docs[0].page_content[:200]}\n")

print()

print(docs[0].metadata)

print()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(f"all splits count = {len(all_splits)}")

embeddings = OllamaEmbeddings(
    model="all-minilm:l6-v2", base_url="http://localhost:11434"
)

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}")
print(vector_1[:10])

client = QdrantClient("http://172.17.0.2:6333")
client.delete_collection(collection_name="tutorial_collection")
client.create_collection(
    collection_name="tutorial_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client, collection_name="tutorial_collection", embedding=embeddings
)

ids = vector_store.add_documents(all_splits)

説明はそれぞれ書いていきます。

実行はこちら。

$ uv run hello_load_documents.py

Documentのサンプル。ここで定義したデータは使わず、あくまで型のサンプルとしての提示ですね。

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

今回、実際に使うドキュメントのロードを行うコードはこちら。

file_path = "example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

読み込み対象はPDFファイルで、使用しているのはPyPDFLoaderですね。

PyPDFLoader | 🦜️🔗 LangChain

How to load PDFs | 🦜️🔗 LangChain

example_data/nke-10k-2023.pdfというのは、このPDFファイルのことです。

https://github.com/langchain-ai/langchain/blob/langchain-core%3D%3D0.3.37/docs/docs/example_data/nke-10k-2023.pdf

ダウンロードして、ローカルファイルとして読むようにします。

$ mkdir example_data
$ curl -L https://raw.githubusercontent.com/langchain-ai/langchain/refs/tags/langchain-core%3D%3D0.3.37/docs/docs/example_data/nke-10k-2023.pdf -o example_data/nke-10k-2023.pdf

読み込んだドキュメントの内容を表示。

print(f"loaded document count = {len(docs)}")

print()

print(f"{docs[0].page_content[:200]}\n")

print()

print(docs[0].metadata)

print()

それぞれ読み込んだドキュメント数、最初のドキュメントの200文字、最初のドキュメントのメタデータを表示していますが、
こんな結果になります。

loaded document count = 107

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F


{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}

次はテキストの分割です。

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(f"all splits count = {len(all_splits)}")

ここではドキュメントを1000文字のチャンクに分割し、チャンク間の重複を200文字にしています。チャンク間で重複する
範囲を持たせることで、チャンクに含まれる文が重要なコンテキストから分離されてしまう可能性を軽減します。

RecursiveCharacterTextSplitterを使うことで、各チャンクが適切なサイズになるまで再帰的に分割します。分割には、
改行などの一般的なセパレーターを使用します。

How to recursively split text by characters | 🦜️🔗 LangChain

add_start_index=Trueというのは、ドキュメント内の最初のチャンクにstart_indexというメタデータを付与する設定です。

今回は516のチャンクに分割されました。

all splits count = 516

テキストの埋め込み。

embeddings = OllamaEmbeddings(
    model="all-minilm:l6-v2", base_url="http://localhost:11434"
)

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}")
print(vector_1[:10])

今回は、Ollamaを使用してテキスト埋め込みを行いました。モデルはall-minilm:l6-v2を使っています。

embeddings = OllamaEmbeddings(
    model="all-minilm:l6-v2", base_url="http://localhost:11434"
)

OllamaEmbeddings | 🦜️🔗 LangChain

ここではサンプルとして、チャンクの最初の2つをベクトル化してベクトルの次元数を確認しています。それから、
最初のベクトル10個を表示しています。

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}")
print(vector_1[:10])

今回の結果はこちら。次元数は384ですね。

Generated vectors of length 384
[-0.024527563, -0.118282035, 0.004233229, 0.018769965, 0.0025654335, 0.09103639, 0.035418395, 0.012415745, -0.0065588024, -0.033638902]

最後は、Qdrantへテキスト埋め込みをしつつデータを保存します。

client = QdrantClient("http://172.17.0.2:6333")
client.delete_collection(collection_name="tutorial_collection")
client.create_collection(
    collection_name="tutorial_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client, collection_name="tutorial_collection", embedding=embeddings
)

ids = vector_store.add_documents(all_splits)

ここはQdrantのクライアントを直接操作し、Qdrantのコレクションを作成しています。次元数は384、距離メトリクスは
コサイン類似度にしました。

client = QdrantClient("http://172.17.0.2:6333")
client.delete_collection(collection_name="tutorial_collection")
client.create_collection(
    collection_name="tutorial_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

そしてQdrantのクライアント、コレクション名、Ollamaを使ったEmbedding modelを指定してQdrantとのVector storeを
作成します。

vector_store = QdrantVectorStore(
    client=client, collection_name="tutorial_collection", embedding=embeddings
)

ids = vector_store.add_documents(all_splits)

最後にドキュメントを保存しています。

Qdrant | 🦜️🔗 LangChain

この時、ドキュメントを保存する時に同時にテキスト埋め込みが行われます。

なので、この部分がこのスクリプトで1番重いです。

ids = vector_store.add_documents(all_splits)

http://[Qdrantが動作しているホスト]:6333/dashboardでQdrantのWeb UIが見れるようにしてあるので、確認しておきます。

良さそうです。

検索する

次は、ベクトルデータベースから検索してみます。

この部分ですね。

Build a semantic search engine / Vector stores / Usage

作成したソースコードはこちら。

hello_query.py

from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
import sys

embeddings = OllamaEmbeddings(
    model="all-minilm:l6-v2", base_url="http://localhost:11434"
)

client = QdrantClient("http://172.17.0.2:6333")

vector_store = QdrantVectorStore(
    client=client, collection_name="tutorial_collection", embedding=embeddings
)

query = sys.argv[1]

print(f"query = {query}")

print()

results = vector_store.similarity_search(query)

print(f"result count = {len(results)}")

print(f"first document = {results[0]}")

Vector storeを作成するところまでは、ドキュメントのロードの時と登場人物は変わりません。

クエリーはコマンドライン引数として受け取るようにしました。

query = sys.argv[1]

検索は、similarity_searchで行います。この時にクエリーもベクトル化されることになります。

results = vector_store.similarity_search(query)

今回はヒット件数とドキュメントの最初の1件を表示するようにしました。

実行結果。

$ uv run hello_query.py 'How many distribution centers does Nike have in the US?'
query = How many distribution centers does Nike have in the US?

result count = 4
first document = page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 4, 'page_label': '5', 'start_index': 3125, '_id': 'b88bb8f3-10c1-4147-9047-4ecdfd335912', '_collection_name': 'tutorial_collection'}


$ uv run hello_query.py 'When was Nike incorporated?'
query = When was Nike incorporated?

result count = 4
first document = page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 3, 'page_label': '4', 'start_index': 0, '_id': 'b19e33b5-8e05-4282-9063-bc308dd64e0d', '_collection_name': 'tutorial_collection'}

チュートリアルと同じになりましたね。

非同期にするにはasimilarity_searchとメソッド名の先頭にaを付けるみたいです。

また、スコアを得るにはsimilarity_search_with_scoreメソッドを使うようですね。

Retrieverを使う

最後はRetrieverを使います。ここではちょっと使ってみた、という感じですね。

Build a semantic search engine / Retrievers

今回の場合はVector storeからRetrieverを取得するのですが、@chainを使う方法とVector storeからas_retrieverメソッドを
使ってRetrieverを取得する方法を使います。

ソースコードはこちら。

hello_retriever.py

from langchain_core.documents import Document
from langchain_core.runnables import chain
from langchain_ollama import OllamaEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

embeddings = OllamaEmbeddings(
    model="all-minilm:l6-v2", base_url="http://localhost:11434"
)

client = QdrantClient("http://172.17.0.2:6333")

vector_store = QdrantVectorStore(
    client=client, collection_name="tutorial_collection", embedding=embeddings
)


@chain
def retriever(query: str) -> list[Document]:
    return vector_store.similarity_search(query, k=1)


results = retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

print(results[0][0])
print()
print(results[1][0])
print()

r = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

results = r.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

print(results[0][0])
print()
print(results[1][0])
print()

先ほどコマンドライン引数から与えたクエリーを直接指定しています。

実行結果はこちら。

$ uv run hello_retriever.py
/path/to/langchain-tutorial-semantic-search/.venv/lib/python3.12/site-packages/langchain/__init__.py:30: UserWarning: Importing debug from langchain root module is no longer supported. Please use langchain.globals.set_debug() / langchain.globals.get_debug() instead.
  warnings.warn(
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 4, 'page_label': '5', 'start_index': 3125, '_id': 'b88bb8f3-10c1-4147-9047-4ecdfd335912', '_collection_name': 'tutorial_collection'}

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 3, 'page_label': '4', 'start_index': 0, '_id': 'b19e33b5-8e05-4282-9063-bc308dd64e0d', '_collection_name': 'tutorial_collection'}

page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 4, 'page_label': '5', 'start_index': 3125, '_id': 'b88bb8f3-10c1-4147-9047-4ecdfd335912', '_collection_name': 'tutorial_collection'}

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'example_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 3, 'page_label': '4', 'start_index': 0, '_id': 'b19e33b5-8e05-4282-9063-bc308dd64e0d', '_collection_name': 'tutorial_collection'}

今回はこのくらいにしておきます。

おわりに

LangChainのチュートリアルのセマンティック検索を試してみました。

だいぶ基本的な要素が出てきた感じがしますね。

次はRAGをやってみましょうか。