https://touch-sp.hatenablog.com/entry/2025/02/07/124149

はじめに

Dockerでの使用が推奨されています。

今回はWLS2内にDockerをインストールしてサーバーとしてText Generation Inferenceを使用しました。

クライアント側はWindowsのPythonを使用しました。

WSL2（サーバー側）の設定

Ubuntu 24.04 on WSL2

最初の確認

「/etc/wsl.conf」に以下が記入されていなかったら書き込んでおく必要があります。

[boot]
systemd=true

書き込み方は

sudo nano /etc/wsl.conf

Docker EngineとDocker Composeのインストール

こちらに従いました。

インストール後に以下を実行しました。（ユーザー名は各自変更が必要です）

sudo gpasswd -a ユーザー名 docker

Nvidia container toolkitのインストール

こちらに従いました。

いったんここまででWSL2を再起動しました。（必要かどうかはわかりません）

Text Generation Inferenceの実行

model=teknium/OpenHermes-2.5-Mistral-7B
# share a volume with the Docker container to avoid downloading weights every run
volume=/home/hoge/data

docker run --gpus all --shm-size 64g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 \
    --model-id $model

クライアント側で推論の実行

WSL2のIP Addressは以下のコマンドで調べることができます。

ip addr show

DockerのIP AddressではなくWSL2のIP Addressで問題ないようです。

Pythonスクリプト

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="http://172.11.18.1:8080/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "how to make dictionary data from 2 lists in python"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content, end="")

ランキング参加中

プログラミング