Enhancing OCR with Tesseract OCR and Pytesseract
Introduction
Optical Character Recognition (OCR) is a fascinating field of computer vision that enables the extraction of text from images. In this post, I will demonstrate how to perform OCR on Japanese PDFs using Tesseract OCR v4 and pytesseract. The source text is Run, Melos! by Osamu Dazai, a work that is now in the public domain.
For those interested, you can access the code examples from my GitHub repository.
Prerequisites
Before we start, ensure the following libraries are installed:
Additionally, Tesseract OCR itself must be installed. Follow the instructions on the official repository to set it up.
Steps to Perform OCR
Step 1: Convert PDF to Images
To process the PDF, we first convert each page into PNG images using pdf2image. Below is the Python function for this task:
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
""" Convert a PDF to PNG images. """
images = pdf2image.convert_from_path(
pdf_path=pdf_path,
fmt='png',
first_page=first_page,
last_page=last_page,
)
return images
Step 2: Extract Text with Tesseract OCR
After converting to images, use pytesseract to extract text from the image data. Here’s the corresponding function:
def to_string(image) -> str:
""" Perform OCR on image data. """
return pytesseract.image_to_string(image, lang='jpn')
Step 3: Normalize Extracted Text
Normalize the extracted text by removing newlines and unnecessary spaces between Japanese characters for better readability:
def normalize(target: str) -> str:
""" Normalize OCR result text. """
result = re.sub('\n', '', target)
result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
return result
Step 4: Save the Result
Finally, save the processed text into a file for future use:
def save(result: str) -> str:
""" Save OCR results to a text file. """
path = 'result.txt'
with open(path, 'w') as f:
f.write(result)
return path
Full Python Script
Here is the complete script combining all the steps:
import re
from datetime import datetime
import pdf2image
import pytesseract
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
""" Convert a PDF to a PNG image.
Args:
pdf_path (str): PDF path
first_page (int): First page starting 1 to be converted
last_page (int): Last page to be converted
Returns:
list: List of image data
"""
print(f'Convert a PDF ({pdf_path}) to a png...')
images = pdf2image.convert_from_path(
pdf_path=pdf_path,
fmt='png',
first_page=first_page,
last_page=last_page,
)
print(f'A total of converted png images is {len(images)}.')
return images
def to_string(image) -> str:
""" OCR an image data.
Args:
image: Image data
Returns:
str: OCR processed characters
"""
print(f'Extract characters from an image...')
return pytesseract.image_to_string(image, lang='jpn')
def normalize(target: str) -> str:
""" Normalize result text.
Applying the following:
- Remove newlines.
- Remove spaces between Japanese characters.
Args:
target (str): Target text to be normalized
Returns:
str: Normalized text
"""
result = re.sub('\n', '', target)
result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
return result
def save(result: str) -> str:
""" Save the result text in a text file.
Args:
result (str): Result text
Returns:
str: Text file path
"""
path = 'result.txt'
with open(path, 'w') as f:
f.write(result)
return path
def main() -> None:
start = datetime.now()
result = ''
images = to_images('run-melos.pdf', 1, 2)
for image in images:
result += to_string(image)
result = normalize(result)
path = save(result)
end = datetime.now()
duration = end.timestamp() - start.timestamp()
print('----------------------------------------')
print(f'Start: {start}')
print(f'End: {end}')
print(f'Duration: {int(duration)} seconds')
print(f'Result file path: {path}')
print('----------------------------------------')
if __name__ == '__main__':
main()
Result
- Original Text: Original text
- OCR Result: OCR result
Notes on Diff Tools
Since the script removes newlines, comparing results using the diff
command may be ineffective. Instead, consider GUI-based tools like Araxis Merge or WinMerge.
Conclusion
Using Tesseract OCR and pytesseract, we can efficiently extract text from Japanese PDFs. While the results are promising, the accuracy may vary based on the complexity of the PDF layout. Further enhancements and experimentation with Tesseract configurations can help improve outcomes.
Happy Coding! 🚀