◾️はじめに

　やりたいこと
『画像又はPDFの情報からデータを抽出し、DBに入れるシステムを作成したい
(無料のツールで)』があり、以下「【５】おまけ：構想していること」の
「１）処理フロー」と「２）システム構成」で考えている。

まず、その中の「Tesseract OCR」について、調べてみた

【１】Tesseract OCR
　１）ライセンス
　２）公式サイト
【２】環境設定
　１）Docekerを使用した場合
【３】サンプル
【４】オプション
　１）--psm (ページセグメンテーションモード)
　２）--oem
　３）-l jpn
　４）-c preserve_interword_spaces=1
【５】ホワイトリスト・ブラックリスト
　１）tessedit_char_whitelist
　２）tessedit_char_blacklist
【６】おまけ：構想していること
　１）処理フロー
　２）システム構成

【１】Tesseract OCR

* 昔、Java版で少し触っていた。。。

https://dk521123.hatenablog.com/entry/2017/01/17/215359
https://dk521123.hatenablog.com/entry/2017/01/09/000100

* Tesseract（テッセラクト） = 四次元立方体
* プログラム言語：C++ (Pythonでも簡単に使える)
* 日本語サポート
* 機械学習により、精度を上げることができる

１）ライセンス

https://github.com/tesseract-ocr/tesseract

* Apache License V2.0

２）公式サイト

https://github.com/tesseract-ocr/tesseract/wiki

【２】環境設定

１）Docekerを使用した場合

* 今回は、Docekerを使って設定した

Dockerfile

FROM python:3.13-slim

RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-jpn \
    fonts-noto-cjk \
    build-essential \
    gcc \
    libgl1 \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip install --no-cache-dir pillow
RUN pip install --no-cache-dir pytesseract

実行例

# Building Docker image completed.
docker build -t demo-env .

# Generating sample files... （サンプルを実行させる場合）
docker run --rm -v "$PWD":/app demo-env python generate_samples.py

【３】サンプル

* 自分で事前に画像データを用意しておく
 => 今回は、以下の関連記事のサンプルを使った

Python ～画像処理 / Pillow ～
https://dk521123.hatenablog.com/entry/2023/07/10/000000

demo_ocr.py

import pytesseract
from PIL import Image

# Tesseract OCR needs to be installed separately.
# sudo apt install tesseract-ocr -y (Linux)
# brew install tesseract (macOS)
# Windows has an installer

def extract_text_from_image(img_path):
    img = Image.open(img_path)
    text = pytesseract.image_to_string(img, lang='jpn')
    return text

if __name__ == "__main__":
    img_path = "test_files/sample.png"
    extracted_text = extract_text_from_image(img_path)
    print("Extracted Text:")
    print(extracted_text)

【４】オプション

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage

Options	Explanations	Memo
--psm 6	単一の均一なテキストブロック（改行ありの段落）に最適
--oem 3	OCR Engine Mode：LSTM + レガシーOCRの自動選択
-l jpn	日本語モデルを使用	tesseract-ocr-jpnが必要
-c preserve_interword_spaces=1	空白を詰める

１）--psm (ページセグメンテーションモード)

* PSM = Page Segmentation Mode
* 特定の画像に特化したレイアウト解析をするためのオプション

https://binary-star.net/tesseract-option-psm

Value	Explanations	Memo
0	文字角度の識別と書字系のみの認識(OSD)のみ実施
1	OSDと自動ページセグメンテーション
2	OSDなしの自動セグメンテーション	OCRは行われない
3	OSDなしの完全自動セグメンテーション	デフォルト
4	可変サイズの1列テキストを想定する
5	縦書きの単一のテキストブロックとみなす
6	横書きの単一のテキストブロックとみなす	複数行の文字列を認識させる
7	画像を1行のテキストとみなす
8	画像を単語とみなす
9	円の中に記載された1単語とみなす	例：①など
10	画像を1文字とみなす
11	まだらなテキスト。特定の順序でなるべく多くの単語を検出する（角度無し）
12	文字角度検出を実施(OSD)しかつ、まだらなテキストとしてなるべく多くの単語を検出する
13	Tesseract固有の処理を回避して1行のテキストとみなす

２）--oem

* OCR Engine Mode

Value	Explanations	Memo
0	以前(3.5まで)のTesseractエンジンのみを使用する
1	ニューラルネットLSTMのみを使用する
2	TesseractエンジンとLSTM両方使用する
3	LSTMとTesseractエンジンを状況に応じて使用する	デフォルト

３）-l jpn

* 言語指定
* tesseract-ocr-jpnが必要

Options	Explanations
-l jpn	日本語
-l jpn_vert	日本語縦書き
-l eng+jpn	英語と日本語

４）-c preserve_interword_spaces=1

* 空白を詰める

【５】ホワイトリスト・ブラックリスト

１）tessedit_char_whitelist

* 特定の文字のみ検出したい場合、どの文字列を有効にするかを指定

サンプル

custom_config = r'--psm 6 -l jpn --oem 3 -c tessedit_char_whitelist=一二三四五六七八九〇万年収〜ー0123456789円会社名職種勤務地給与： '
image = Image.open('sample_image.png')
text = pytesseract.image_to_string(image, config=custom_config)

print("OCR結果:")
print(text)

２）tessedit_char_blacklist

* 特定の文字は検出しないようにしたい場合、どの文字列を無効にするかを指定

【６】おまけ：構想していること

画像又はPDFの情報からデータを抽出し、DBに入れるシステムを作成したい
(無料のツールで)

１）処理フロー

[PDF/画像]
   ↓
[OCR or PDFパーサー]
   ↓
[テキスト抽出]
   ↓
[ルールベース処理 or 軽量NLP]
   ↓
[データ整形]
   ↓
[DBに格納]

２）システム構成

機能	ツール	Memo
OCR（画像）	Tesseract OCR	日本語も対応可能
PDFパース	PyMuPDF or pdfminer.six
NLP抽出	Python + 正規表現 / spaCy
DB	SQLite or PostgerSQL	初めはSQLiteで、最終的にPostgerSQL？
UI	streamlit	とりあえず、どうにでもなるので後回し