Python

Python, Google API, and PDF

As a side project, I was thinking about creating a Google Action that returned the daily lunch menu for my children’s school from Google Home.  This would surely come in handy for my family and others.

To complete this project, I was thinking the following tasks would need to be completed.

  1. Find a solution to getting the daily lunch menu via Python (This blog post)
  2. Store the lunch menu data is some location\format so my Google Action can read it
  3. Create Google Action (Create only in my local test environment…)

This blog post goes over what I did for step #1.  There may be easier ways to do this, but it was a good learning experience.

So, the school lunch menu is stored in a PDF file on Google Drive.  I first need to find a way to automatically download this PDF file, read the contents of the PDF, and then parse out the necessary lunch menu information for use.

Step #1 – Download the Lunch Menu PDF file from Google Drive

  1. We will use the Google Drive API and Python to download the PDF file locally.
    • Note that to store the data locally we will need to use “io.FileIO” and not “io.BytesIO” from the provided example in the above link.
  2. You will need to activate the Google API with your projects correct OAuth credentials.  Follow the steps in this link as needed.  This link will also show you how to install the Google API python package.
  3. Get the Google Drive ID associated with the file you wish to download.
    • You can get your file ID from a shareable link to the document.  To get a files shareable link, go to Google Drive, select your file, and then click “Get shareable link” which will return a URL that will contain the ID.
    • googleapilink.png
  4. Create your Python script to download the file.
from __future__ import print_function
import httplib2
import os
import io

from apiclient import discovery
from oauth2client import client
from oauth2client import tools
from oauth2client.file import Storage
from apiclient.http import MediaIoBaseDownload

CREDENTIAL_DIR = './credentials'
CREDENTIAL_FILENAME = 'drive-python-quickstart.json'
CLIENT_SECRET_FILE = 'client_secret.json'
APPLICATION_NAME = 'Lunch Menu Download'
SCOPES = 'https://www.googleapis.com/auth/drive.readonly'
FILE_ID = 'Your File ID Here'
DOWNLOADED_FILE_NAME = 'lunch.pdf'

def get_credentials():
  credential_dir = CREDENTIAL_DIR
  if not os.path.exists(credential_dir):
    os.makedirs(credential_dir)
  credential_path = os.path.join(credential_dir, CREDENTIAL_FILENAME)
  store = Storage(credential_path)
  credentials = store.get()

  if not credentials or credentials.invalid:
    flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
    flow.user_agent = APPLICATION_NAME
      if flags:
        credentials = tools.run_flow(flow, store, flags)
      else:
        credentials = tools.run(flow, store)
  return credentials

def main():
  credentials = get_credentials()
  http = credentials.authorize(httplib2.Http())
  service = discovery.build('drive', 'v3', http=http)
  file_id = FILE_ID
  request = service.files().get_media(fileId=file_id)
  fh = io.FileIO(DOWNLOADED_FILE_NAME, 'wb')
  downloader = MediaIoBaseDownload(fh, request)
  done = False

  while done is False:
    status, done = downloader.next_chunk()
    print('Download %d%%.' % int(status.progress() * 100))

if __name__ == '__main__':
main()

 Step #2 – Parse the PDF file that you just downloaded

  1. To parse the PDF file with Python, we will use the package Slate.  The Slate package will basically return a Slate object that contains all the text from the PDF file.
  2. The text returned was “for the most part” in a structured order with newline characters separating the contents.
  3. I decided to split the text returned by “\n\n” and use the order to pull out the necessary information.
    • pdf_text_split = stringpdf_text.split(“\\n\\n”)
  4. Now I step through each item in the list that was returned and place it into 1 of 4 lists based on some compare logic.
    • Main Entry #1
    • Main Entry #2
    • Main Entry #3
    • Snack
  5. Once I parsed all the data from the split and placed it in the correct list, I combined all the individual lists into a 4-dimensional list called Lunch Menu
    • lunchmenu=[mainentry1,mainentry2,mainentry3,snack1]
  6. Now I can retrieve the items from the lists with ease.
    • lunchmenu[1][12] would return the 13th “Main Entry #2” item.  The #13 would represent the 13th day of school for a particular month, not the 13th day of the month. (M = 1,  F = 5)
    • In my case, the result is “Beef Taco Salad” for the second main entry on the 13th day.
  7. Below is the Python code that performs the above actions.
# Import slate python package for extracting text from a PDF file
import slate

# Open PDF file
with open('lunchmenu.pdf', 'rb') as f:

# Generate text from PDF file
pdf_text = slate.PDF(f)

mainentry1 = []
mainentry2 = []
mainentry3 = []
snack1 = []
lunchmenu=[[],[],[],[]]

stringpdf_text = str(pdf_text)
pdf_text_split = stringpdf_text.split("\\n\\n")
index = 0
for food in pdf_text_split:
  if"1)" in food:
    mainentry1.append(food)
  if"2)" in food:
    mainentry2.append(food)
  if"3)" in food:
    mainentry3.append(food)
  if "Snack" in food:
    snack1.append(food)
  index = index + 1

lunchmenu=[mainentry1,mainentry2,mainentry3,snack1]

# Persist the lunchmenu list here or do something else...
# Should close the file handle also...(f.close())
So the next step might be to persist the data somewhere instead of creating lists and then determine the proper way to serve it up to a new Google Action.