https://touch-sp.hatenablog.com/entry/2026/01/31/131134

PC環境

Windows 11

Python環境構築

uvを使っています。pyproject.tomlを載せておくので uv sync のみで環境構築可能です。

[project]
name = "lightonocr2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.14"
dependencies = [
    "pillow==12.1.0",
    "pypdfium2==5.3.0",
    "torch==2.10.0+cu126",
    "torchvision==0.25.0+cu126",
    "transformers==5.0.0",
]

[[tool.uv.index]]
name = "torch-cuda"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

[tool.uv.sources]
torch = [{ index = "torch-cuda" }]
torchvision = [{ index = "torch-cuda" }]

結果

入力画像

ブログ記事を入力として使用しました。

出力（結果）

Markdownで結果が返ってきます。

驚くことにコードブロックがpowershellであることを正確に当てています。

# スクリプトのダウンロード

PowerShellを想定しています。

```powershell
# ダウンロードするファイルのRaw URL
$url = "https://raw.githubusercontent.com/microsoft/VibeVoice/main/demo/vibevoice_asr_gradio_demo.py"

# 保存するファイル名
$output = "vibevoice_asr_gradio_demo.py"

# ダウンロード実行
Invoke-WebRequest -Uri $url -OutFile $output
```

## 実行

```powershell
uv run vibevoice_asr_gradio_demo.py --host localhost
```

Pythonスクリプト

import torch
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor

device = "cuda"
dtype = torch.bfloat16

model = LightOnOcrForConditionalGeneration.from_pretrained(
    #"lightonai/LightOnOCR-2-1B",
    "LightOnOCR-2-1B",
    torch_dtype=dtype
).to(device)

processor = LightOnOcrProcessor.from_pretrained(
    #"lightonai/LightOnOCR-2-1B"
    "LightOnOCR-2-1B",
)

url = "sample.jpg"

conversation = [{
    "role": "user", 
    "content": [
        {
            "type": "image",
            "url": url
        }
    ]
}]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()}

output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
output_text = processor.decode(generated_ids, skip_special_tokens=True)
print(output_text)

Gradioで実行

[project]
name = "lightonocr2"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.14"
dependencies = [
    "gradio==6.5.1",
    "pillow==12.1.0",
    "pypdfium2==5.3.0",
    "torch==2.10.0+cu126",
    "torchvision==0.25.0+cu126",
    "transformers==5.0.0",
]

[[tool.uv.index]]
name = "torch-cuda"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

[tool.uv.sources]
torch = [{ index = "torch-cuda" }]
torchvision = [{ index = "torch-cuda" }]

import torch
import gradio as gr
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor

device = "cuda"
dtype = torch.bfloat16

model = LightOnOcrForConditionalGeneration.from_pretrained(
    "lightonai/LightOnOCR-2-1B",
    #"LightOnOCR-2-1B",
    torch_dtype=dtype
).to(device)

processor = LightOnOcrProcessor.from_pretrained(
    "lightonai/LightOnOCR-2-1B"
    #"LightOnOCR-2-1B",
)

def ocr_process(image):
    """Process an image and extract text using LightOnOCR"""
    if image is None:
        return "Please upload an image"

    conversation = [{
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image
            }
        ]
    }]

    inputs = processor.apply_chat_template(
        conversation,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()}

    output_ids = model.generate(**inputs, max_new_tokens=1024)
    generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
    output_text = processor.decode(generated_ids, skip_special_tokens=True)

    return output_text

# Create Gradio interface
demo = gr.Interface(
    fn=ocr_process,
    inputs=gr.Image(type="pil", label="Upload Image"),
    outputs=gr.Markdown(
        label="Extracted Text",
        buttons=["copy"]
    ),
    title="LightOnOCR",
    description="Extract text from images using LightOnOCR-2-1B model"
)

if __name__ == "__main__":
    demo.launch()

ランキング参加中

プログラミング