https://touch-sp.hatenablog.com/entry/2026/01/30/173536

はじめに

VibeVoice-ASRは音声から文字起こしするモデルです。

スタンダードであるWhisperを超える性能があると言われています。

PC環境

Windows 11

実行画面（Gradioデモ）

音声はこちらから男性ナレーション、医療WEBドラマ医師役の音声をダウンロードさせて頂きました。

実行

環境構築は後述する方法を参考にして下さい。

uvを使っています。

スクリプトのダウンロード

PowerShellを想定しています。

# ダウンロードするファイルのRaw URL
$url = "https://raw.githubusercontent.com/microsoft/VibeVoice/main/demo/vibevoice_asr_gradio_demo.py"

# 保存するファイル名
$output = "vibevoice_asr_gradio_demo.py"

# ダウンロード実行
Invoke-WebRequest -Uri $url -OutFile $output

実行

uv run vibevoice_asr_gradio_demo.py --host localhost

環境構築

Cuda 12.6

Python 3.13 + Torch 2.9.1

liger-kernelはpyproject.tomlに記載出来ないので別にインストールしました。

[project]
name = "vibevoice-asr"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = "==3.13.*"
dependencies = [
    "flash-attn",
    "hf-xet==1.2.0",
    "torch==2.9.1+cu126",
    "triton-windows==3.6.0.post25",
    "vibevoice @ git+https://github.com/microsoft/VibeVoice"
]

[[tool.uv.index]]
name = "torch-cuda"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

[tool.uv.sources]
torch = [{ index = "torch-cuda" }]
flash-attn = { path = "flash_attn-2.8.3+cu126torch2.9.1cxx11abiTRUE-cp313-cp313-win_amd64.whl" }

uv sync
uv pip install liger-kernel==0.6.4 --no-deps

flash-attentionはこちらの方法でビルドしました。

こちらにビルド後のファイルを公開しています。

動作確認できているvibevoiceのバージョンを書いておきます。

再現できない時はこちらを使って下さい。

vibevoice @ git+https://github.com/microsoft/VibeVoice@b2aee8015c3c2d97c388346ebcfffdaf2f427f7d

Python 3.14 + Torch 2.10.0

[project]
name = "vibevoice-asr"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = "==3.14.*"
dependencies = [
    "flash-attn",
    "hf-xet==1.2.0",
    "torch==2.10.0+cu126",
    "triton-windows==3.6.0.post25",
    "vibevoice @ git+https://github.com/microsoft/VibeVoice"
]

[[tool.uv.index]]
name = "torch-cuda"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

[tool.uv.sources]
torch = [{ index = "torch-cuda" }]
flash-attn = { path = "flash_attn-2.8.3+cu126torch2.10.0cxx11abiTRUE-cp314-cp314-win_amd64.whl" }

uv sync
uv pip install liger-kernel==0.6.4 --no-deps

flash-attentionはこちらの方法でビルドしました。

こちらにビルド後のファイルを公開しています。

動作確認できているvibevoiceのバージョンを書いておきます。

再現できない時はこちらを使って下さい。

vibevoice @ git+https://github.com/microsoft/VibeVoice@b2aee8015c3c2d97c388346ebcfffdaf2f427f7d

ランキング参加中

プログラミング