はじめに
OSはUbuntu 24.04です。vLLMの導入に関してはこちらを見て下さい。touch-sp.hateblo.jp
量子化とServing
CUDAでvLLMを使う場合と同様に量子化をするとうまく行きませんでした。IPEX-LLMが公開してくれている「api_server.py」を使うとうまく行きました。git clone https://github.com/intel/ipex-llm cd ipex-llm/python/llm/src/ipex_llm/vllm/xpu/entrypoints/openai python -m api_server \ --served-model-name DeepSeek-R1-Distill-Qwen-14B-Japanese \ --model cyberagent/DeepSeek-R1-Distill-Qwen-14B-Japanese \ --enforce-eager \ --load-in-low-bit sym_int4 \ --max-model-len 4096
Gradioを使ってチャットする
クライアント側はGradioを使いました。from openai import OpenAI import gradio as gr openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) system_prompt_text = "あなたは誠実で優秀な日本人のアシスタントです。" init = { "role": "system", "content": system_prompt_text, } def make_message( message: str, history: list[dict] ): if len(history) == 0: history.insert(0, init) history.append( { "role": "user", "content": message } ) return "", history def bot( history: list[dict] ): stream = client.chat.completions.create( model="DeepSeek-R1-Distill-Qwen-14B-Japanese", messages=history, temperature=0.5, max_tokens=2048, #frequency_penalty=1.1, stream=True ) history.append({"role": "assistant","content": ""}) for chunk in stream: history[-1]["content"] += chunk.choices[0].delta.content yield history with gr.Blocks() as demo: chatbot = gr.Chatbot(type="messages", allow_tags=["think"]) msg = gr.Textbox() clear = gr.ClearButton([msg, chatbot], value="新しいチャットを開始") msg.submit(make_message, [msg, chatbot], [msg, chatbot], queue=False).then( bot, chatbot, chatbot ) demo.launch()
実行画面

補足
sym_int4
2025-02-25 20:34:13,457 - INFO - Loading model weights took 8.4595 GB WARNING 02-25 20:34:26 xpu.py:97] Pin memory is not supported on XPU. INFO 02-25 20:34:27 gpu_executor.py:76] # GPU blocks: 1882, # CPU blocks: 2730 INFO 02-25 20:34:27 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 3.68x
sym_int8
実行できませんでした。2025-02-25 20:30:45,909 - INFO - Loading model weights took 15.4058 GB 2025-02-25 20:30:46,335 - ERROR - XPU out of memory. Tried to allocate 92.00 MiB. GPU 0 has a total capacity of 15.11 GiB. Of the allocated memory 15.67 GiB is allocated by PyTorch, and 162.40 MiB is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.