Python

Python – Tesseract – OCR – IMAGE

You can do some pretty cool things with tesseract-ocr.  Using PyOCR, which is a wrapper for Tesseract, you can generate text from an image using Tesseract.

Example Image:

aws_.jpg

Example Output:

Tesseract.png

Example Code:

from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import sys

reload(sys) 
sys.setdefaultencoding('utf8')
 
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1]
 
txt_list = tool.image_to_string(
 PI.open('/home/build/aws.jpg'),
 lang=lang,
 builder=pyocr.builders.TextBuilder())

outputFile = open('output.txt', 'w')
for item in txt_list:
 outputFile.write("%s" % item)
outputFile.close()

Another use case I was working on today was rendering the text in a PDF file using Tesseract.  I was converting the PDF to an image file first, then performing the above actions to read the text from the new image.

Here are a couple valuable resources I used to complete this little test.

  • Installing Tesseract on a RHEL system – http://www.keienberg.com/install-tesseract-3-04-centos-7/ (link)
  • Installing PyOCR and other image conversion tools – https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/ (link)

Getting all the prerequisites installed was by far the hardest part on this effort.

AWS Lambda and S3

Below are a couple of problems I ran into when writing a Python 2.7 Lambda function that created a file and then uploaded it to S3. (s3.upload_file)

  1. The file I was creating and writing to in the function was empty in S3 after the upload.
    • Turns out I needed the “( )” braces on the Python “close” command.  Silly issue, but took my like 20 minutes to figure out….
  2. In your Lambda function, you need to create your files under /tmp, which is your functions ephemeral storage.
    • fileName = ‘/tmp/’ + name

s3_image

Cygwin and PIP Package Missing

I installed Cygwin today and forgot to install the PIP package for Python 2.7.

cygwin

Looking online, I found that you could install the PIP package by re-running the Cygwin installation .exe.  However, re-running the .exe took a LOT longer than I expected.

It is 100% faster downloading and executing get-pip.py from the following location:

Execute “python -m pip –version” to verify your version!

Flask-RESTful – Basic Authentication

I’m continuing to develop the REST API that will be used with the API.AI Webhook.  I decided that some sort of authentication is needed.

I played around with adding Basic Authentication to my API as API.AI supports this.  Below are the steps I took to get my authentication setup using Flask.  (Recommend reading Flask-HTTPAuth documentation)

  1. Include the necessary package
    • Flask-HTTPAuth==2.3.0
    • flaskauth
  2. Add get_password callback function.
    1. @auth.get_password
      def get_password(username):
          if username == 'devopsunleasheduser':
              return 'devopsunleashedpassword'
          return None
  3. Add error_handler callback function (Note “jsonify()” will need jsonify package)
    1. @auth.error_handler
      def unauthorized():
          # return 403 instead of 401 to prevent browsers from displaying the default
          # auth dialog
          return make_response(jsonify({'message': 'Unauthorized access'}), 403)
  4. Add login_required decorator to both classes to verify authentication before returning any info.
    1. decorators = [auth.login_required]

 

Python, Flask, and Rest API

This blog post is a continuation of my Google Home\Google Action side project.  The first blog post is here.

The project keeps evolving the more I think about it.  The 2 diagrams below illustrate the workflows that need to be developed.

Populate the Database – Workflow 1

googlehome2

Google Home\Google Action – Workflow 2

googlehome3

So as you can see above, a critical piece of this puzzle is the REST API that needs to be created to allow us to interact with the data.

For the REST API, I have decided to use Python, Flask, and Flask-RESTful.  For this service I can host the API instance on Heroku if needed.  As for which database to use, I am thinking MongoDB, and if needed host it online with mLab.  Tools I will use to help get this API created are Postman, Curl, and PyCharm.  (Great resource)

First – Lets stub out the RestAPI code and the HTTP method “flask-restful” classes using PyCharm.

restmethods

Below is a “rough” first draft of the python REST API.

from flask import Flask
from flask_restful import reqparse, abort, Api, Resource

app = Flask(__name__)
api = Api(app)

# Sample lunch menu data for testing
LUNCHMENU = {
    '4202017': {'menu': 'Main entry is corn dogs'},
    '4212017': {'menu': 'Main entry is chicken nuggets'},
    '4222017': {'menu': 'Main entry is tacos'},
}

# Argument parser - must add all arguments explicitly here
parser = reqparse.RequestParser()
parser.add_argument('item')
parser.add_argument('date')

# Check if item exists and error if it doesn't
def error_if_item_doesnt_exist(id):
    if id not in LUNCHMENU:
        abort(404, message="Menu Item {} doesn't exist".format(id))

# Check if item exists and error if it does
def error_if_item_does_exist(id):
    if id in LUNCHMENU:
        abort(404, message="Menu Item {} exist".format(id))


# Define get, delete, and put method class
class MenuItem(Resource):
    def get(self, id):
        error_if_item_doesnt_exist(id)
        return LUNCHMENU[id]

    def delete(self, id):
        error_if_item_doesnt_exist(id)
        del LUNCHMENU[id]
        return '', 204

    def put(self, id):
        args = parser.parse_args()
        LUNCHMENU[id] = {'item': args['item']}
        return LUNCHMENU[id], 201


# Define list and post method class
class MenuList(Resource):
    def get(self):
        return LUNCHMENU

    def post(self):
        args = parser.parse_args()
        error_if_item_does_exist(args['date'])
        LUNCHMENU[args['date']] = {'item': args['item']}
        return LUNCHMENU[args['date']], 201


# Setup Api resource routing to classes here
api.add_resource(MenuList, '/menuitems')
api.add_resource(MenuItem, '/menuitems/<id>')

# If program is executed itself, then run
if __name__ == '__main__':
    app.run(debug=True)

Next step will be to true up the code to accept\parse the incoming JSON and then send the correct JSON format back to the API.AI agent.

  • Request format – https://docs.api.ai/docs/webhook#section-format-of-request-to-the-service
  • Response format – https://docs.api.ai/docs/webhook#section-format-of-response-from-the-service

Python, Google API, and PDF

As a side project, I was thinking about creating a Google Action that returned the daily lunch menu for my children’s school from Google Home.  This would surely come in handy for my family and others.

To complete this project, I was thinking the following tasks would need to be completed.

  1. Find a solution to getting the daily lunch menu via Python (This blog post)
  2. Store the lunch menu data is some location\format so my Google Action can read it
  3. Create Google Action (Create only in my local test environment…)

This blog post goes over what I did for step #1.  There may be easier ways to do this, but it was a good learning experience.

So, the school lunch menu is stored in a PDF file on Google Drive.  I first need to find a way to automatically download this PDF file, read the contents of the PDF, and then parse out the necessary lunch menu information for use.

Step #1 – Download the Lunch Menu PDF file from Google Drive

  1. We will use the Google Drive API and Python to download the PDF file locally.
    • Note that to store the data locally we will need to use “io.FileIO” and not “io.BytesIO” from the provided example in the above link.
  2. You will need to activate the Google API with your projects correct OAuth credentials.  Follow the steps in this link as needed.  This link will also show you how to install the Google API python package.
  3. Get the Google Drive ID associated with the file you wish to download.
    • You can get your file ID from a shareable link to the document.  To get a files shareable link, go to Google Drive, select your file, and then click “Get shareable link” which will return a URL that will contain the ID.
    • googleapilink.png
  4. Create your Python script to download the file.
from __future__ import print_function
import httplib2
import os
import io

from apiclient import discovery
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage
from apiclient.http import MediaIoBaseDownload

CREDENTIAL_DIR = './credentials'
CREDENTIAL_FILENAME = 'drive-python-quickstart.json'
CLIENT_SECRET_FILE = 'client_secret.json'
APPLICATION_NAME = 'Lunch Menu Download'
SCOPES = 'https://www.googleapis.com/auth/drive.readonly'
FILE_ID = 'Your File ID Here'
DOWNLOADED_FILE_NAME = 'lunch.pdf'

def get_credentials():
  credential_dir = CREDENTIAL_DIR
  if not os.path.exists(credential_dir):
    os.makedirs(credential_dir)
  credential_path = os.path.join(credential_dir, CREDENTIAL_FILENAME)
  store = Storage(credential_path)
  credentials = store.get()

  if not credentials or credentials.invalid:
    flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
    flow.user_agent = APPLICATION_NAME
      if flags:
        credentials = tools.run_flow(flow, store, flags)
      else:
        credentials = tools.run(flow, store)
  return credentials

def main():
  credentials = get_credentials()
  http = credentials.authorize(httplib2.Http())
  service = discovery.build('drive', 'v3', http=http)
  file_id = FILE_ID
  request = service.files().get_media(fileId=file_id)
  fh = io.FileIO(DOWNLOADED_FILE_NAME, 'wb')
  downloader = MediaIoBaseDownload(fh, request)
  done = False

  while done is False:
    status, done = downloader.next_chunk()
    print('Download %d%%.' % int(status.progress() * 100))

if __name__ == '__main__':
main()

 Step #2 – Parse the PDF file that you just downloaded

  1. To parse the PDF file with Python, we will use the package Slate.  The Slate package will basically return a Slate object that contains all the text from the PDF file.
  2. The text returned was “for the most part” in a structured order with newline characters separating the contents.
  3. I decided to split the text returned by “\n\n” and use the order to pull out the necessary information.
    • pdf_text_split = stringpdf_text.split(“\\n\\n”)
  4. Now I step through each item in the list that was returned and place it into 1 of 4 lists based on some compare logic.
    • Main Entry #1
    • Main Entry #2
    • Main Entry #3
    • Snack
  5. Once I parsed all the data from the split and placed it in the correct list, I combined all the individual lists into a 4-dimensional list called Lunch Menu
    • lunchmenu=[mainentry1,mainentry2,mainentry3,snack1]
  6. Now I can retrieve the items from the lists with ease.
    • lunchmenu[1][12] would return the 13th “Main Entry #2” item.  The #13 would represent the 13th day of school for a particular month, not the 13th day of the month. (M = 1,  F = 5)
    • In my case, the result is “Beef Taco Salad” for the second main entry on the 13th day.
  7. Below is the Python code that performs the above actions.
# Import slate python package for extracting text from a PDF file
import slate

# Open PDF file
with open('lunchmenu.pdf', 'rb') as f:

# Generate text from PDF file
pdf_text = slate.PDF(f)

mainentry1 = []
mainentry2 = []
mainentry3 = []
snack1 = []
lunchmenu=[[],[],[],[]]

stringpdf_text = str(pdf_text)
pdf_text_split = stringpdf_text.split("\\n\\n")
index = 0
for food in pdf_text_split:
  if"1)" in food:
    mainentry1.append(food)
  if"2)" in food:
    mainentry2.append(food)
  if"3)" in food:
    mainentry3.append(food)
  if "Snack" in food:
    snack1.append(food)
  index = index + 1

lunchmenu=[mainentry1,mainentry2,mainentry3,snack1]

# Persist the lunchmenu list here or do something else...
# Should close the file handle also...(f.close())
So the next step might be to persist the data somewhere instead of creating lists and then determine the proper way to serve it up to a new Google Action.