Python – Tesseract – OCR – IMAGE

You can do some pretty cool things with tesseract-ocr.  Using PyOCR, which is a wrapper for Tesseract, you can generate text from an image using Tesseract.

Example Image:

aws_.jpg

Example Output:

Tesseract.png

Example Code:

from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import sys

reload(sys) 
sys.setdefaultencoding('utf8')
 
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1]
 
txt_list = tool.image_to_string(
 PI.open('/home/build/aws.jpg'),
 lang=lang,
 builder=pyocr.builders.TextBuilder())

outputFile = open('output.txt', 'w')
for item in txt_list:
 outputFile.write("%s" % item)
outputFile.close()

Another use case I was working on today was rendering the text in a PDF file using Tesseract.  I was converting the PDF to an image file first, then performing the above actions to read the text from the new image.

Here are a couple valuable resources I used to complete this little test.

  • Installing Tesseract on a RHEL system – http://www.keienberg.com/install-tesseract-3-04-centos-7/ (link)
  • Installing PyOCR and other image conversion tools – https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/ (link)

Getting all the prerequisites installed was by far the hardest part on this effort.