You can do some pretty cool things with tesseract-ocr. Using PyOCR, which is a wrapper for Tesseract, you can generate text from an image using Tesseract.
Example Image:

Example Output:

Example Code:
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import sys
reload(sys)
sys.setdefaultencoding('utf8')
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1]
txt_list = tool.image_to_string(
PI.open('/home/build/aws.jpg'),
lang=lang,
builder=pyocr.builders.TextBuilder())
outputFile = open('output.txt', 'w')
for item in txt_list:
outputFile.write("%s" % item)
outputFile.close()
Another use case I was working on today was rendering the text in a PDF file using Tesseract. I was converting the PDF to an image file first, then performing the above actions to read the text from the new image.
Here are a couple valuable resources I used to complete this little test.
- Installing Tesseract on a RHEL system – http://www.keienberg.com/install-tesseract-3-04-centos-7/ (link)
- Installing PyOCR and other image conversion tools – https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/ (link)
Getting all the prerequisites installed was by far the hardest part on this effort.