Enhancing OCR with Tesseract OCR and Pytesseract

Takahiro Iwasa

Jun 6, 2022

4 min read

Computer Vision Python

Introduction

Optical Character Recognition (OCR) is a fascinating field of computer vision that enables the extraction of text from images. In this post, I will demonstrate how to perform OCR on Japanese PDFs using Tesseract OCR v4 and pytesseract. The source text is Run, Melos! by Osamu Dazai, a work that is now in the public domain.

For those interested, you can access the code examples from my GitHub repository.

Prerequisites

Before we start, ensure the following libraries are installed:

Additionally, Tesseract OCR itself must be installed. Follow the instructions on the official repository to set it up.

Steps to Perform OCR

Step 1: Convert PDF to Images

To process the PDF, we first convert each page into PNG images using pdf2image. Below is the Python function for this task:

def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to PNG images. """
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    return images

Step 2: Extract Text with Tesseract OCR

After converting to images, use pytesseract to extract text from the image data. Here’s the corresponding function:

def to_string(image) -> str:
    """ Perform OCR on image data. """
    return pytesseract.image_to_string(image, lang='jpn')

Step 3: Normalize Extracted Text

Normalize the extracted text by removing newlines and unnecessary spaces between Japanese characters for better readability:

def normalize(target: str) -> str:
    """ Normalize OCR result text. """
    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result

Step 4: Save the Result

Finally, save the processed text into a file for future use:

def save(result: str) -> str:
    """ Save OCR results to a text file. """
    path = 'result.txt'
    with open(path, 'w') as f:
        f.write(result)
    return path

Full Python Script

Here is the complete script combining all the steps:

import re
from datetime import datetime

import pdf2image
import pytesseract


def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

    Args:
        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

    Returns:
        list: List of image data
    """

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    print(f'A total of converted png images is {len(images)}.')
    return images


def to_string(image) -> str:
    """ OCR an image data.

    Args:
        image: Image data

    Returns:
        str: OCR processed characters
    """

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')


def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

    Args:
        target (str): Target text to be normalized

    Returns:
        str: Normalized text
    """

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result


def save(result: str) -> str:
    """ Save the result text in a text file.

    Args:
        result (str): Result text

    Returns:
        str: Text file path
    """

    path = 'result.txt'
    with open(path, 'w') as f:
        f.write(result)
    return path


def main() -> None:
    start = datetime.now()
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)
    path = save(result)

    end = datetime.now()
    duration = end.timestamp() - start.timestamp()

    print('----------------------------------------')
    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result file path: {path}')
    print('----------------------------------------')


if __name__ == '__main__':
    main()

Result

Original Text: Original text
OCR Result: OCR result

Notes on Diff Tools

Since the script removes newlines, comparing results using the diff command may be ineffective. Instead, consider GUI-based tools like Araxis Merge or WinMerge.

Conclusion

Using Tesseract OCR and pytesseract, we can efficiently extract text from Japanese PDFs. While the results are promising, the accuracy may vary based on the complexity of the PDF layout. Further enhancements and experimentation with Tesseract configurations can help improve outcomes.

Happy Coding! 🚀