AWS Lambda コンテナイメージを使用して Tesseract OCR と pytesseract を実行する方法

岩佐孝浩

2022年6月22日

5 min read

Lambda Python

はじめに

開発者は、Tesseract OCR と pytesseract を Lambda コンテナイメージ を使用することで、効率的でスケーラブルな OCR 処理を実行できます。

この記事で使用されているサンプルコードは、GitHub リポジトリ から取得できます。

前提条件

以下のツールがシステムにインストールされていることを確認してください。

AWS SAM
Python 3.x

プロジェクトのセットアップ

ディレクトリ構成

以下のようにプロジェクトを整理してください。

/
|-- src/
|   |-- Dockerfile
|   |-- __init__.py
|   |-- app.py
|   |-- requirements.txt
|   `-- run-melos.pdf
|-- README.md
|-- __init__.py
|-- requirements.txt
`-- template.yaml

AWS SAM テンプレートの記述

以下の SAM テンプレートは、EventBridge によってトリガーされる Lambda 関数を設定します。これは API Gateway に最大タイムアウト制限が 29 秒あり、サンプルの Python スクリプトがこの制限を超えるためです。

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
Resources:
  TesseractOcrSample:
    Type: AWS::Serverless::Function
    Properties:
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Enabled: true
            Schedule: cron(0 * * * ? *)
      MemorySize: 512
      PackageType: Image
      Timeout: 900
    Metadata:
      DockerTag: latest
      DockerContext: ./src/
      Dockerfile: Dockerfile

Dockerfile の作成

Dockerfile を作成して、ランタイム環境を定義します。日本語など特定の言語のテキストを処理する場合は、エンコーディングの問題を回避するために LANG 環境変数 (3 行目) を設定します。

FROM public.ecr.aws/lambda/python:3.9

ENV LANG=ja_JP.UTF-8
WORKDIR ${LAMBDA_TASK_ROOT}
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
    && yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
    && pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

CMD ["app.lambda_handler"]

Python スクリプトの作成

`requirements.txt` の定義

必要なライブラリを requirements.txt に追加します。

pdf2image==1.16.0
pytesseract==0.3.9

`app.py` の実装

このスクリプトは PDF を画像に変換し、OCR を実行し、結果をログに記録します。

import re
from datetime import datetime

import pdf2image
import pytesseract


def lambda_handler(event: dict, context: dict) -> None:
    start = datetime.now()
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)

    end = datetime.now()
    duration = end.timestamp() - start.timestamp()

    print('----------------------------------------')
    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result: {result}')
    print('----------------------------------------')


def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

    Args:
        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

    Returns:
        list: List of image data
    """

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    print(f'A total of converted png images is {len(images)}.')
    return images


def to_string(image) -> str:
    """ OCR an image data.

    Args:
        image: Image data

    Returns:
        str: OCR processed characters
    """

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')


def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

    Args:
        target (str): Target text to be normalized

    Returns:
        str: Normalized text
    """

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result

ビルドとデプロイ

アプリケーションのビルド

次のコマンドを実行してアプリケーションをビルドします。

sam build

ローカルでアプリケーションを実行するには、以下を実行します。

sam local invoke

アプリケーションのデプロイ

ECR リポジトリ が存在しない場合は、以下のコマンドで作成します。

aws ecr create-repository --repository-name tesseract-ocr-lambda

アプリケーションをデプロイします。

sam deploy \
  --stack-name aws-lambda-tesseract-ocr-sample \
  --image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \
  --capabilities CAPABILITY_IAM

デプロイ後、Lambda 関数は毎時実行され、OCR の結果は CloudWatch Logs に書き込まれます。

クリーンアップ

プロビジョニングされた AWS リソースをクリーンアップするには、以下を実行します。

sam delete --stack-name aws-lambda-tesseract-ocr-sample

まとめ

AWS Lambda で Tesseract OCR を コンテナイメージ として実行することで、複雑な OCR ワークフローを効率的かつスケーラブルに処理できます。Docker の柔軟性を活用して、特定の要件に応じた環境を構築できます。

Happy Coding! 🚀