https://touch-sp.hateblo.jp/entry/2025/02/24/212031

はじめに

OSはUbuntu 24.04です。

vLLMの導入に関してはこちらを見て下さい。
touch-sp.hateblo.jp

量子化とServing

CUDAでvLLMを使う場合と同様に量子化をするとうまく行きませんでした。

IPEX-LLMが公開してくれている「api_server.py」を使うとうまく行きました。

git clone https://github.com/intel/ipex-llm
cd ipex-llm/python/llm/src/ipex_llm/vllm/xpu/entrypoints/openai

python -m api_server \
  --served-model-name DeepSeek-R1-Distill-Qwen-14B-Japanese \
  --model cyberagent/DeepSeek-R1-Distill-Qwen-14B-Japanese \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 4096

Gradioを使ってチャットする

クライアント側はGradioを使いました。

from openai import OpenAI
import gradio as gr

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

system_prompt_text = "あなたは誠実で優秀な日本人のアシスタントです。"
init = {
    "role": "system",
    "content": system_prompt_text,
}

def make_message(
    message: str,
    history: list[dict]
):    
    if len(history) == 0:
        history.insert(0, init)
    history.append(
        {
            "role": "user", 
            "content": message
        }
    )
    return "", history

def bot(
    history: list[dict]
):
    stream = client.chat.completions.create(
        model="DeepSeek-R1-Distill-Qwen-14B-Japanese",
        messages=history,
        temperature=0.5,
        max_tokens=2048,
        #frequency_penalty=1.1,
        stream=True
    )
    history.append({"role": "assistant","content": ""})
    for chunk in stream:
        history[-1]["content"] += chunk.choices[0].delta.content
        yield history
    
with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages", allow_tags=["think"])
    msg = gr.Textbox()
    clear = gr.ClearButton([msg, chatbot], value="新しいチャットを開始")

    msg.submit(make_message, [msg, chatbot], [msg, chatbot], queue=False).then(
        bot, chatbot, chatbot
    )

demo.launch()

実行画面

補足

sym_int4

2025-02-25 20:34:13,457 - INFO - Loading model weights took 8.4595 GB
WARNING 02-25 20:34:26 xpu.py:97] Pin memory is not supported on XPU.
INFO 02-25 20:34:27 gpu_executor.py:76] # GPU blocks: 1882, # CPU blocks: 2730
INFO 02-25 20:34:27 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 3.68x

sym_int8

実行できませんでした。

2025-02-25 20:30:45,909 - INFO - Loading model weights took 15.4058 GB
2025-02-25 20:30:46,335 - ERROR - XPU out of memory. Tried to allocate 92.00 MiB. GPU 0 has a total capacity of 15.11 GiB. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 162.40 MiB is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.