https://nikkie-ftnext.hatenablog.com/entry/simonw-llm-whisper-api-add-support-gpt-4o-transcribe

はじめに

七尾百合子さん、お誕生日 54日目 おめでとうございます！ nikkieです。

simonw/llmのプラグインを主に自分向けに拡張しました¹

uvx --with git+https://github.com/ftnext/simonw-llm-whisper-api.git@support-other-transcribe-models llm whisper-api -m gpt-4o-transcribe --key $OPENAI_API_KEY audio.wav

その際の学びを記します（この後のプルリクエストで伝える準備も兼ねて）

OpenAIのspeech-to-textモデル

OpenAIは現在3つの speech-to-text モデルを提供しています。
https://platform.openai.com/docs/api-reference/audio/createTranscription
昔はwhisper-1だけでしたが、gpt-4o-transcribe、gpt-4o-mini-transcribeが追加されました。

私が惹かれているsimonwさんのllmのプラグインの中に、OpenAIのWhisperを呼び出すものがありました。
https://pypi.org/project/llm-whisper-api/

SpeechRecognitionのメンテナ経験（上記ブログ）から、gpt-4o-transcribe、gpt-4o-mini-transcribeのサポートは容易に追加できると考えました。

事前にGeminiに相談していて²、「modelの指定を変数に抽出。関数宣言を変え（引数に加え）、呼び出し元からモデル名を渡す」という作戦を授かりました。
https://github.com/simonw/llm-whisper-api/blob/0.1.1/llm_whisper_api.py#L56

-def transcribe(audio_content: bytes, api_key: str) -> str:
+def transcribe(audio_content: bytes, api_key: str, model: str) -> str:
    # 略
-    data = {"model": "whisper-1", "response_format": "text"}
+    data = {"model": model, "response_format": "text"}

やってみると、事前には見えていなかった学びがいくつかありました

gpt-4o-transcribe、gpt-4o-mini-transcribeの制約

Create transcriptionの呼び出しが whisper-1 ほど融通が効きません（あくまで執筆時点の話で、将来改善されるのを期待します）

まずresponse_format

For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json.

whisper-1向けには"text"を指定していましたが、これを"json"に変えます。
その影響で、APIの返り値をJSONとして扱います。
https://github.com/simonw/llm-whisper-api/blob/0.1.1/llm_whisper_api.py#L61

-        return response.text.strip()
+        return response.json()["text"].strip()

テストの変更は後述します

もう1点、APIに送るファイルの拡張子。
whisper-1はどんなファイルであっても.mp3に統一してうまくいきます。
ref: https://github.com/simonw/llm-whisper-api/blob/0.1.1/llm_whisper_api.py#L53

ところが、gpt-4o-transcribe、gpt-4o-mini-transcribeでは、audio.wavを渡したときにファイル名の拡張子を.mp3とすると、400 Bad Request。
拡張子を揃える必要がありました。
拡張子を揃えたとき、whisper-1も引き続き動きます。

simonw/llm-whisper-apiのテスト

Clickのテストを使ってllm whisper-api audio.mp3 --key xというコマンドをテストしています。
https://github.com/simonw/llm-whisper-api/blob/0.1.1/tests/test_whisper_api.py#L16

simonwさんはSDKを使わず、OpenAIのAPIを直接HTTPXで叩く実装をしています。
テスト実行のたびに Create transcription の通信処理が走らないように、pytest-httpxでモックしていました。
https://github.com/simonw/llm-whisper-api/blob/0.1.1/tests/test_whisper_api.py#L7-L12

https://pypi.org/project/pytest-httpx/

私はRESPX派だったので、初めて知りました。

response_formatに"json"を指定するので、JSONが返るようにモックの仕方を変更しています。

httpx_mock.add_response(
    # 略
-    text=expected_text,
+    json={"text": expected_text},
)

終わりに

simonw/llm-whisper-api、簡単にgpt-4o-transcribe（とgpt-4o-mini-transcribe）をサポート追加できるでしょ！とやってみたら、思いのほか学びがありました。

Create transcriptionで、gpt-4o-transcribe（たち）はwhisper-1よりも制約が強い
- response_format、JSONのみ
- ファイル名の拡張子
pytest-httpxを使ってもHTTPXをモックできる

変更差分の全体像はこちらです

NotebookLMのAudio Overviewを書き起こしたくて ↩
DeepWiki-Openで試しました
Awesome!! 👏👏

指定されたrepositoryを取得してembeddingsにして(OpenAIモデル)
RAGの要領でGeminiがWiki生成・回答するようです。
Wikiがexportできたの嬉しい😃 https://t.co/8NpRItzuHg
— nikkie(にっきー) / にっP (@ftnext) 2025年5月6日
↩

simonw/llm-whisper-api に OpenAIの新しい speech-to-text モデルのサポートを試みる

はじめに

目次

OpenAIのspeech-to-textモデル

gpt-4o-transcribe、gpt-4o-mini-transcribeの制約

simonw/llm-whisper-apiのテスト

終わりに