Tesseract OCR と pytesseract を活用した OCR

岩佐孝浩

2022年6月6日

5 min read

Computer Vision Python

はじめに

光学文字認識 (OCR) は、画像からテキストを抽出するコンピュータビジョンの興味深い分野です。本記事では、Tesseract OCR v4 と pytesseract を使用して、日本語 PDF の OCR を実行する方法を紹介します。使用する元のテキストは、著作権が消滅した 太宰治の「走れメロス」 です。

コード例は GitHub リポジトリで公開していますので、ご参考ください。

前提条件

以下のライブラリをインストール済みであることを確認してください。

また、Tesseract OCR 本体もインストールが必要です。インストール手順は公式リポジトリをご参照ください。

OCR の実行手順

手順 1: PDF を画像に変換

まず、PDF の各ページを pdf2image を使って PNG 形式の画像に変換します。以下はそのための Python 関数です。

def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to PNG images. """
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    return images

手順 2: Tesseract OCR でテキストを抽出

次に、変換した画像データから pytesseract を使用してテキストを抽出します。以下がその関数です。

def to_string(image) -> str:
    """ Perform OCR on image data. """
    return pytesseract.image_to_string(image, lang='jpn')

手順 3: 抽出結果を正規化

抽出されたテキストの可読性を向上させるため、改行や日本語文字間の不要なスペースを削除して正規化します。

def normalize(target: str) -> str:
    """ Normalize OCR result text. """
    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result

手順 4: 結果を保存

最後に、処理したテキストをファイルに保存します。

def save(result: str) -> str:
    """ Save OCR results to a text file. """
    path = 'result.txt'
    with open(path, 'w') as f:
        f.write(result)
    return path

完全な Python スクリプト

以下は、上記のすべての手順を組み合わせた完全なスクリプトです。

import re
from datetime import datetime

import pdf2image
import pytesseract


def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

    Args:
        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

    Returns:
        list: List of image data
    """

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    print(f'A total of converted png images is {len(images)}.')
    return images


def to_string(image) -> str:
    """ OCR an image data.

    Args:
        image: Image data

    Returns:
        str: OCR processed characters
    """

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')


def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

    Args:
        target (str): Target text to be normalized

    Returns:
        str: Normalized text
    """

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result


def save(result: str) -> str:
    """ Save the result text in a text file.

    Args:
        result (str): Result text

    Returns:
        str: Text file path
    """

    path = 'result.txt'
    with open(path, 'w') as f:
        f.write(result)
    return path


def main() -> None:
    start = datetime.now()
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)
    path = save(result)

    end = datetime.now()
    duration = end.timestamp() - start.timestamp()

    print('----------------------------------------')
    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result file path: {path}')
    print('----------------------------------------')


if __name__ == '__main__':
    main()

結果

元のテキスト: Original text
OCR 結果: OCR result

差分ツールについて

スクリプトでは改行を削除するため、diff コマンドを使用して比較するのは適していません。その代わりに、Araxis Merge や WinMerge といった GUI ベースのツールを推奨します。

まとめ

Tesseract OCR と pytesseract を活用することで、日本語 PDF から効率的にテキストを抽出することができます。結果は良好ですが、PDF のレイアウトが複雑な場合、精度に影響が出る可能性があります。Tesseract の設定を調整することで、さらに精度を向上させることが期待されます。

Happy Coding! 🚀