https://kazuhira-r.hatenablog.com/entry/2023/12/08/003510

これは、なにをしたくて書いたもの？

前に、こんなエントリーを書きました。

OpenAI Python APIライブラリーからllama-cpp-pythonで立てたOpenAI API互換のサーバーのチャットモデルへアクセスしてみる - CLOVER🍀

この時は、llama-cpp-pythonで立てたOpenAI API互換のサーバーのチャットモデルのAPIにアクセスしてみましたが、今回は埋め込みの
APIを使ってテキストのベクトル化をしてみたいと思います。

OpenAIの埋め込み

OpenAIの埋め込み（Embeddings）に関するドキュメントはこちら。

Embeddings

Introductionの概念セクションでは、以下のように説明されていました。

コンテンツの意味を保持することを目的としたデータのベクトル表現
類似のデータのチャンクには、近いEmbeddingsが含まれる傾向にある

Introduction

Embeddingsのページで、もう少し見てみましょう。

OpenAIのテキスト埋め込みは、テキスト文字列の関連性を測定するものだそうです。

OpenAI’s text embeddings measure the relatedness of text strings.

一般的な用途は以下のようです。

検索 … クエリ文字列との類似度によって結果がランク付けされる
- Search (where results are ranked by relevance to a query string)
クラスタリング … テキスト文字列が類似度によってグループ化される
- Clustering (where text strings are grouped by similarity)
レコメンデーション … 関連するテキストを持つアイテムをリコメンドする
- Recommendations (where items with related text strings are recommended)
異常検出 … 関連性の低い外れ値が特定される
- Anomaly detection (where outliers with little relatedness are identified)
多様性測定 … 類似性の分布を分析する
- Diversity measurement (where similarity distributions are analyzed)
分類 … テキスト文字列が最も類似したラベルで分類する
- Classification (where text strings are classified by their most similar label)

埋め込みは、浮動小数点数字のベクトル（リスト）で表現されます。2つのベクトルは、距離によって関連性を測ることができます。
距離が小さい場合は関連性が高いことを示し、距離が大きい場合は関連性が低いことを示します。

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

距離の測定には、コサイン類似度を使用することが推奨されています。

Embeddings / Limitations & risks / Which distance function should I use?

一方で、どの距離関数を使うかはそれほど重要ではないとも書かれています。

また、OpenAIにおける埋め込みの長さは1に正規化され、次の特性を持つそうです。

コサイン類似度は、ドット積のみを使用して少し高速に計算できる
コサイン類似度とユークリッド距離で、同じランキングが得られる

埋め込みの利用は、Embeddings APIへのアクセスで行います。

ENDPOINTS / Embeddings

埋め込みの作成はこちらですね。

ENDPOINTS / Embeddings / Create embeddings

パラメーターとして、少なくともテキストとモデルが必要です。

埋め込みのモデルは2世代あり、第2世代（モデルIDが-002）と第1世代（モデルIDが-001）があります。

モデルの世代	トークナイザー	最大入力トークン
V2	cl100k_base	8191
V1	GPT-2/GPT-3	2046

第1世代のモデルは非推奨となっており、第2世代のモデルの利用が推奨されています。

第2世代のモデルはtext-embedding-ada-002のみです。トークナイザーはcl100k_base、最大入力トークンは8191です。
出力の次元数は1536です。

テキスト生成モデルと同様、埋め込みでもトークン数が料金に反映されます。

Pricing

ユースケースとサンプルはこちら。

Embeddings / Use cases

ベクトル化したデータを保存する、ベクトルデータベースの紹介もあります。

Embeddings / Limitations & risks / How can I retrieve K nearest embedding vectors quickly?

最後に、リスクについて。

Embeddings / Limitations & risks

OpenAIの埋め込みモデルは信頼度が低い、または社会的リスクを起こす可能性があることが警告されています。
なんらかの緩和策を用意した方がよい、と。

Our embedding models may be unreliable or pose social risks in certain cases, and may cause harm in the absence of mitigations.

実際に起こった例や、モデルがいつまでの知識を持っているかについても書かれています。

およそ埋め込みがどのようなものかはわかってきたので、今回はテキストデータを埋め込みAPIを使ってベクトル化するところを
やってみようと思います。
OpenAI API互換のサーバーとしては、llama-cpp-pythonを使います。

環境

今回の環境は、こちら。

$ python3 -V
Python 3.10.12

llama-cpp-pythonのバージョン。

$ pip3 freeze | grep llama_cpp_python
llama_cpp_python==0.2.20

モデルはこちらを使います。
※後で気づきましたが、使うモデルはこれだとダメな気がしますね…

TheBloke/Llama-2-7B-Chat-GGUF · Hugging Face

起動。

$ python3 -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf

テキストをベクトル化する

まずは、テキストをベクトル化してみましょう。

OpenAI APIのライブラリーをインストール。

$ pip3 install openai

バージョン。

$ pip3 freeze | grep openai
openai==1.3.7

作成したプログラムはこちら。

to_vector.py

import sys
import time
from openai import OpenAI

text = sys.argv[1]

start_time = time.perf_counter()

openai = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy-api-key")

response = openai.embeddings.create(input=text, model="text-embedding-ada-002")

elapsed_time = time.perf_counter() - start_time

print(f"raw response = {response}")

print()

print(f"input text = {text}, to vector = {len(response.data[0].embedding)}")

print()

print(f"elapsed time = {elapsed_time:.3f} sec")

base_urlには、llama-cpp-pythonのOpenAI API互換のエンドポイントを指定。APIキーは適当で構いません。

openai = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy-api-key")

あとはEmbeddings Create APIを呼び出します。モデル名はtext-embedding-ada-002を指定していますが、llama-cpp-pythonサーバーを
対象とする場合はなんでも構いません。

response = openai.embeddings.create(input=text, model="text-embedding-ada-002")

あとはベクトル化した結果を表示します。ベクトル化する対象のテキストは、コマンドライン引数で指定するようにしています。

print(f"raw response = {response}")

print()

print(f"input text = {text}, to vector = {len(response.data[0].embedding)}")

では、実行してみましょう。

$ python3 to_vector.py 'Hello World.'

結果。

$ python3 to_vector.py 'Hello World.'
raw response = CreateEmbeddingResponse(data=[Embedding(embedding=[0.34358564019203186, 0.8082753419876099, 1.666818380355835, -1.0591312646865845, 0.06530340015888214, -0.5284493565559387, -0.6435621380805969, 1.137858271598816, 1.3028210401535034, 0.2946239709854126, -0.5300033092498779, -1.1164494752883911, 0.7548168301582336, 0.40373867750167847, 1.1290810108184814, -0.7566999197006226, 0.267472505569458, 0.7608996629714966, -0.017051421105861664, -1.01883065700531, -0.1597931832075119, -1.3573452234268188, 0.8877636790275574, -1.2786635160446167, -0.24713543057441711, 0.6535821557044983, -0.5414071083068848, 

〜省略〜

-2.555652379989624, 0.18512584269046783, 0.42112821340560913, -0.04915745556354523, -0.1607542186975479, -1.6608763933181763, -1.181817650794983, 0.655724287033081, -0.15193837881088257, 0.18946832418441772, -0.06836213171482086, -0.19648043811321259, -1.2785874605178833, 1.3186522722244263, 0.26095831394195557, 0.595634400844574, -0.5786678194999695, -1.9923450946807861, 0.5934603810310364, -0.5940259099006653, 0.1100892424583435, 0.6473436951637268, -0.3595812916755676, 0.5893478393554688, 1.695295810699463], index=0, object='embedding')], model='text-embedding-ada-002', object='list', usage=Usage(prompt_tokens=4, total_tokens=4))

input text = Hello World., to vector = 4096

elapsed time = 0.774 sec

このテキストなら実行時間は1秒かかっていません。

ベクトル化した時の次元は4096でした。つまり、埋め込みの結果のリストには4096個の要素が含まれています。
OpenAIのドキュメントでは1536だったと思いますが…。

$ python3 to_vector.py 'this is apple.'
raw response = CreateEmbeddingResponse(data=[Embedding(embedding=[0.11018591374158859, 0.6475743651390076, 2.998861074447632, -0.586685836315155, 0.09722156822681427, -1.347420334815979, 0.048076026141643524, 2.2477073669433594, 0.57712721824646, 0.28412631154060364, -0.2854914367198944, -1.3638404607772827, 1.5588339567184448, 0.07106658816337585, 1.6731715202331543, -0.9277191758155823, 0.5767037272453308, 1.015160083770752, -0.494839608669281, -0.9214868545532227, -1.3904584646224976, -1.822590708732605, 1.106003761291504, -1.147274374961853, -1.0618056058883667, 0.18969042599201202, -1.3648691177368164, 


〜省略〜

-1.3505096435546875, -1.383029580116272, -5.094209671020508, -0.881213366985321, 0.28645381331443787, -0.6735928058624268, 0.49416056275367737, -1.0014870166778564, 1.0167360305786133, -0.4009270668029785, 0.032198164612054825, -0.5705236792564392, 1.8743517398834229, -0.6221891045570374, -1.7606291770935059, 0.4385615885257721, -0.2885802686214447, -0.7018359899520874, 0.23984572291374207, 0.07199232280254364, 1.4006679058074951, 1.113827109336853], index=0, object='embedding')], model='text-embedding-ada-002', object='list', usage=Usage(prompt_tokens=5, total_tokens=5))

input text = this is apple., to vector = 4096

elapsed time = 0.934 sec

とりあえず、テキストをベクトル化する方法はわかりました。

検索してみる

最後に検索してみましょう。ちなみに、これはちょっとうまくいきませんでした…。

ドキュメントに記載されている検索のサンプルは、こちらです。

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-ada-002')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

res = search_reviews(df, 'delicious beans', n=3)

Embeddings / Use cases

openai.embeddings_utilsのcosine_similarityを使ってコサイン類似度を計算しているのですが、実は現在のOpenAI APIの
Pythonライブラリーにはこの関数はありません。

1.0になる時に削除されたようです。

サンプルが動かないというissueが…。

v1.0 drops embeddings_util.py breaking semantic text search · Issue #676 · openai/openai-python · GitHub

この関数の定義はこちらで、numpyがあれば簡単に移植できるのでその方針にしました。

https://github.com/openai/openai-python/blob/v0.28.1/openai/embeddings_utils.py#L65-L66

というわけで、numpyをインストール。

$ pip3 install numpy

バージョン。

$ pip3 freeze | grep numpy
numpy==1.26.2

作成したプログラムはこちら。

search.py

import sys
from openai import OpenAI
import numpy as np

## https://github.com/openai/openai-python/blob/v0.28.1/openai/embeddings_utils.py#L65-L66
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

seeds = [
    {"name": "Apple", "feature": "With red flesh and thin skin, it has a balanced taste of mild acidity and sweetness."},
    {"name": "Banana", "feature": "A yellow fruit with a smooth texture and mild sweetness, known for its high nutritional value."},
    {"name": "Grapes", "feature": "Purple in color, these small fruits cluster together with juicy and mildly tangy flavor."},
    {"name": "Melon", "feature": "A green fruit with a refreshing texture and aroma, rich in sweetness and water content."},
    {"name": "Orange", "feature": "Wrapped in an orange peel, it offers a harmonious blend of refreshing acidity and sweetness, rich in vitamin C."},
    {"name": "Strawberry", "feature": "Recognized by its red hue, it carries a distinctive fragrance and a sweet-tart taste, with tiny seeds adding texture."},
    {"name": "Pineapple", "feature": "Featuring yellow flesh, it has a sweet-tangy flavor and a unique texture, accompanied by a rich aroma."},
    {"name": "Mango", "feature": "An orange fruit with a rich aroma and intense sweetness, offering a smooth and luscious flesh."},
    {"name": "Kiwi", "feature": "Green flesh with a balanced combination of acidity and sweetness, enhanced by small black seeds for texture."},
    {"name": "Peach", "feature": "Displaying peach-colored flesh, it is juicy and soft with a sweet aroma, complemented by the peach's beautiful appearance."}
]

openai = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy-api-key")

docs = []

for seed in seeds:
    feature = seed["feature"]
    response = openai.embeddings.create(input=feature, model="text-embedding-ada-002")
    docs.append({"name": seed["name"], "feature": feature, "embedding": response.data[0].embedding})


query = sys.argv[1]
response = openai.embeddings.create(input=query, model="text-embedding-ada-002")
query_embedding = response.data[0].embedding

docs_with_similarity = [{
    "name": d["name"],
    "feature": d["feature"],
    "embedding": d["embedding"],
    "similarity":  cosine_similarity(d["embedding"], query_embedding)
} for d in docs]

sorted_docs = sorted(docs_with_similarity, key=lambda d: d["similarity"], reverse=True)

print(f"query = {query}")
print()

print("ranking:")
for doc in sorted_docs:
    print(f"  name: {doc['name']}")
    print(f"    feature: {doc['feature']}")
    print(f"    similarity: {doc['similarity']}")

こちらはコサイン類似度を計算する関数です。

## https://github.com/openai/openai-python/blob/v0.28.1/openai/embeddings_utils.py#L65-L66
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

以下のような果物の名前と特徴のドキュメントに対して

seeds = [
    {"name": "Apple", "feature": "With red flesh and thin skin, it has a balanced taste of mild acidity and sweetness."},
    {"name": "Banana", "feature": "A yellow fruit with a smooth texture and mild sweetness, known for its high nutritional value."},
    {"name": "Grapes", "feature": "Purple in color, these small fruits cluster together with juicy and mildly tangy flavor."},
    {"name": "Melon", "feature": "A green fruit with a refreshing texture and aroma, rich in sweetness and water content."},
    {"name": "Orange", "feature": "Wrapped in an orange peel, it offers a harmonious blend of refreshing acidity and sweetness, rich in vitamin C."},
    {"name": "Strawberry", "feature": "Recognized by its red hue, it carries a distinctive fragrance and a sweet-tart taste, with tiny seeds adding texture."},
    {"name": "Pineapple", "feature": "Featuring yellow flesh, it has a sweet-tangy flavor and a unique texture, accompanied by a rich aroma."},
    {"name": "Mango", "feature": "An orange fruit with a rich aroma and intense sweetness, offering a smooth and luscious flesh."},
    {"name": "Kiwi", "feature": "Green flesh with a balanced combination of acidity and sweetness, enhanced by small black seeds for texture."},
    {"name": "Peach", "feature": "Displaying peach-colored flesh, it is juicy and soft with a sweet aroma, complemented by the peach's beautiful appearance."}
]

特徴を対象にベクトル化を行います。

docs = []

for seed in seeds:
    feature = seed["feature"]
    response = openai.embeddings.create(input=feature, model="text-embedding-ada-002")
    docs.append({"name": seed["name"], "feature": feature, "embedding": response.data[0].embedding})

検索文字列は、コマンドライン引数で与えてこちらもベクトル化します。

query = sys.argv[1]
response = openai.embeddings.create(input=query, model="text-embedding-ada-002")
query_embedding = response.data[0].embedding

そして、先ほどベクトル化した値を加えたドキュメントと検索文字列のベクトルで、コサイン類似度をとります。

docs_with_similarity = [{
    "name": d["name"],
    "feature": d["feature"],
    "embedding": d["embedding"],
    "similarity":  cosine_similarity(d["embedding"], query_embedding)
} for d in docs]

結果をコサイン類似度の値の降順でソート。

sorted_docs = sorted(docs_with_similarity, key=lambda d: d["similarity"], reverse=True)

結果表示。

print(f"query = {query}")
print()

print("ranking:")
for doc in sorted_docs:
    print(f"  name: {doc['name']}")
    print(f"    feature: {doc['feature']}")
    print(f"    similarity: {doc['similarity']}")

試してみます。

$ python3 search.py green

なんか微妙な結果になりました…。

query = green

ranking:
name: Kiwi
feature: Green flesh with a balanced combination of acidity and sweetness, enhanced by small black seeds for texture.
similarity: 0.04215274492006107
name: Banana
feature: A yellow fruit with a smooth texture and mild sweetness, known for its high nutritional value.
similarity: 0.0344139587395739
name: Melon
feature: A green fruit with a refreshing texture and aroma, rich in sweetness and water content.
similarity: 0.025705263210870112
name: Peach
feature: Displaying peach-colored flesh, it is juicy and soft with a sweet aroma, complemented by the peach's beautiful appearance.
similarity: 0.024638075297955937
name: Mango
feature: An orange fruit with a rich aroma and intense sweetness, offering a smooth and luscious flesh.
similarity: 0.014813710044700355
name: Apple
feature: With red flesh and thin skin, it has a balanced taste of mild acidity and sweetness.
similarity: 0.014664657619712652
name: Pineapple
feature: Featuring yellow flesh, it has a sweet-tangy flavor and a unique texture, accompanied by a rich aroma.
similarity: 0.010742138593120263
name: Grapes
feature: Purple in color, these small fruits cluster together with juicy and mildly tangy flavor.
similarity: 0.008852824997386993
name: Orange
feature: Wrapped in an orange peel, it offers a harmonious blend of refreshing acidity and sweetness, rich in vitamin C.
similarity: -0.015618559051309908
name: Strawberry
feature: Recognized by its red hue, it carries a distinctive fragrance and a sweet-tart taste, with tiny seeds adding texture.
similarity: -0.04353728373064396

惜しいような、やや外している感じもします。
他の単語で試してみたり、日本語でもやってみましたが、いずれも微妙な結果になりました…。
※というか、モデルの選択が誤っている気がします

とりあえず、使い方はわかったので今回はここまでにしておきましょう。

おわりに

llama-cpp-pythonで立てたOpenAI API互換のサーバーで、テキストのベクトル化を試してみました。

APIの使い方や用語の意味はおよそ把握できたかなと思いますが、結果がちょっと微妙でした。

まあ、使っているのはOpenAI自体ではないので、今回はこれくらいにしておきましょう。