https://touch-sp.hatenablog.com/entry/2026/01/31/162621

PC環境

Windows 11

Python環境構築

uvを使っています。pyproject.tomlを載せておくので uv sync のみで環境構築可能です。

ただし、flash-attentionは事前にこちらの方法でビルドしています。

flash-attentionはなくても実行可能です。

[project]
name = "deepseekocr"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "addict==2.4.0",
    "easydict==1.13",
    "einops==0.8.2",
    "flash-attn",
    "hf-xet==1.2.0",
    "tokenizers==0.20.3",
    "torch==2.6.0+cu126",
    "torchvision==0.21.0+cu126",
    "transformers==4.46.3"
]

[[tool.uv.index]]
name = "torch-cuda"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

[tool.uv.sources]
torch = [{ index = "torch-cuda" }]
torchvision = [{ index = "torch-cuda" }]
flash-attn = { path = "flash_attn-2.7.3+cu126torch2.6.0cxx11abiFALSE-cp312-cp312-win_amd64.whl" }

結果

入力画像

ブログ記事を入力として使用しました。

出力（結果）

Markdownで結果が返ってきます。

# スクリプトのダウンロード

PowerShellを想定しています。

# ダウンロードするファイルのRaw URL
$url =
"https://raw.githubusercontent.com/microsoft/Vehicle/main/demo/vehiclecore_asm_gradio_demo.py"

# 保存するファイル名
$output = "vehiclecore_asm_gradio_demo.py"

# ダウンロード実行
Invoke-WebRequest -Uri $url -OutFile $output

# 実行

uv run vehiclecore_asm_gradio_demo.py --host localhost

コードブロックが認識できませんでした。

「vibevoice_asr_gradio_demo.py」という単語も認識できていません。

LightOnOCR-2-1Bというモデルを使った時はうまくいったのですが。

touch-sp.hatenablog.com

Pythonスクリプト

flash-attentionがなければ以下の1行はコメントアウトして下さい。

_attn_implementation='flash_attention_2'

from transformers import AutoModel, AutoTokenizer
import torch

model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
    _attn_implementation='flash_attention_2'
)

model.to("cuda")

prompt = "<image>\nFree OCR. "
#prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = '1.jpg'
output_path = 'results'


res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path = output_path,
    base_size = 1024,
    image_size = 768,
    crop_mode=True,
    save_results=True
)

ランキング参加中

プログラミング