PC環境
Windows 11
Python環境構築
uvを使っています。pyproject.tomlを載せておくので uv sync のみで環境構築可能です。
[project] name = "lightonocr2" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.14" dependencies = [ "pillow==12.1.0", "pypdfium2==5.3.0", "torch==2.10.0+cu126", "torchvision==0.25.0+cu126", "transformers==5.0.0", ] [[tool.uv.index]] name = "torch-cuda" url = "https://download.pytorch.org/whl/cu126" explicit = true [tool.uv.sources] torch = [{ index = "torch-cuda" }] torchvision = [{ index = "torch-cuda" }]
結果
入力画像
ブログ記事を入力として使用しました。

出力(結果)
Markdownで結果が返ってきます。
驚くことにコードブロックがpowershellであることを正確に当てています。
# スクリプトのダウンロード PowerShellを想定しています。 ```powershell # ダウンロードするファイルのRaw URL $url = "https://raw.githubusercontent.com/microsoft/VibeVoice/main/demo/vibevoice_asr_gradio_demo.py" # 保存するファイル名 $output = "vibevoice_asr_gradio_demo.py" # ダウンロード実行 Invoke-WebRequest -Uri $url -OutFile $output ``` ## 実行 ```powershell uv run vibevoice_asr_gradio_demo.py --host localhost ```
Pythonスクリプト
import torch from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor device = "cuda" dtype = torch.bfloat16 model = LightOnOcrForConditionalGeneration.from_pretrained( #"lightonai/LightOnOCR-2-1B", "LightOnOCR-2-1B", torch_dtype=dtype ).to(device) processor = LightOnOcrProcessor.from_pretrained( #"lightonai/LightOnOCR-2-1B" "LightOnOCR-2-1B", ) url = "sample.jpg" conversation = [{ "role": "user", "content": [ { "type": "image", "url": url } ] }] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ) inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()} output_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids = output_ids[0, inputs["input_ids"].shape[1]:] output_text = processor.decode(generated_ids, skip_special_tokens=True) print(output_text)
Gradioで実行
[project] name = "lightonocr2" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.14" dependencies = [ "gradio==6.5.1", "pillow==12.1.0", "pypdfium2==5.3.0", "torch==2.10.0+cu126", "torchvision==0.25.0+cu126", "transformers==5.0.0", ] [[tool.uv.index]] name = "torch-cuda" url = "https://download.pytorch.org/whl/cu126" explicit = true [tool.uv.sources] torch = [{ index = "torch-cuda" }] torchvision = [{ index = "torch-cuda" }]
import torch import gradio as gr from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor device = "cuda" dtype = torch.bfloat16 model = LightOnOcrForConditionalGeneration.from_pretrained( "lightonai/LightOnOCR-2-1B", #"LightOnOCR-2-1B", torch_dtype=dtype ).to(device) processor = LightOnOcrProcessor.from_pretrained( "lightonai/LightOnOCR-2-1B" #"LightOnOCR-2-1B", ) def ocr_process(image): """Process an image and extract text using LightOnOCR""" if image is None: return "Please upload an image" conversation = [{ "role": "user", "content": [ { "type": "image", "image": image } ] }] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ) inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()} output_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids = output_ids[0, inputs["input_ids"].shape[1]:] output_text = processor.decode(generated_ids, skip_special_tokens=True) return output_text # Create Gradio interface demo = gr.Interface( fn=ocr_process, inputs=gr.Image(type="pil", label="Upload Image"), outputs=gr.Markdown( label="Extracted Text", buttons=["copy"] ), title="LightOnOCR", description="Extract text from images using LightOnOCR-2-1B model" ) if __name__ == "__main__": demo.launch()