Early on in my new job at a truck sales company, I was given a unique problem to solve. When we ordered a new truck, we would get a PDF that contained all of that truck's specifications. No CSV, Excel file, or API we could access -- just a PDF for each truck that had been ordered from the manufacturer. And we needed to automate the process of getting those specs added to the item in inventory.
If we continued as-is, it could take far more time than reasonable to manually add those specs where needed. The PDF is available for internal use, but we are left with the problem of making those specs available to people outside of the organization through the website.
Enter the Inventory Assistant, a service written in Python and designed to:
It currently contains a module for obtaining data from PDFs and functionality for interfacing with Airtable.
This was my first Python program with a scope large enough to result in multiple iterations/rewrites and 700+ lines of code. Everything I had done previously consisted of small scripts for basic processing. This program has come a long way and helped me learn quite a bit about OOP, threading/multiprocessing, queues, events, RegEx, JSON, APIs, and program design. It currently accomplishes what it needs to, but there's plenty of room for expansion and improvement.
The information below is an overview of the program design. For details on program usage, view the README on GitHub!
The program is currently designed to be modular with a 3 part dataflow:
Input modules:
Output modules:
Everything is event-based where possible. A web interface is in the works for displaying objects currently in inventory and the error log. A webpage for adjusting PDF processing settings is planned but not in progress.
The project started small. I had only been presented with one type of document that needed to be parsed; as far as I knew, that was the only one we dealt with. But the number of document types grew. And then we had other sources of data that needed to be added.
As the codebase grew, it quickly became hard to read and even harder to wrap your head around -- especially when coming back to the project after a few days away. To resolve this, the program was restructured with the idea that the main program would only handle taking in information, reconciling it with existing information, and sending the result out to update external sources.
Standardizing the input and output allowed me to separate the functions that deal with the various data sources out into their own files, making them far easier to manage and update. New sources could be easily added without concern for breaking other parts of the program.
The idea is to make it easy to create new modules as input/output sources that are imported into the main program. The main program will provide those sources with two queues -- one for Inventory Objects, one for errors.
The inventory queue accepts inventory objects, which are instances of an inventoryObject class. In the same step, the queue is passed the name/source of the new information so it can ignore that source as an output; you do not want new input data from a source to go back out to the same source, as that may update an attribute which could cause it to loop.
The error queue is to ensure all functions output errors to the same locations and in the same format.
This currently exists as an object that stores inventory objects in a dictionary and contains a method to compare inventory objects and update that dictionary. Once that method has finished, it will add all relevant information to the Output queue.
Once a new inventory object makes it to the datastore, it will be checked against the contents. If another object exists with the same name, the contents of the new object will update the contents of the old. If no other object in the datastore has the same name, the new object will be added to the datastore.
Once all changes have been made, the object and its associated input source are passed to the output queue.
Any inventory object received by the Outputs class is passed on to all output modules that do not share the same source name as the inventory object. Processing of the object is handled by each individual module. The Error queue from the Inputs class is passed to each module on module instantiation.
The program currently contains one module -- a module for processing PDF files.
There are some functions for Airtable within the main script that need to be separated out into modules, but they are currently too tightly bound to other parts of the code (mainly each other).
When this module is instantiated, it will monitor a given folder for new PDF files. When a PDF is found, it will pull the text out by using pdftotext.exe (a command-line utility from xpdfreader.com). The document is split as needed; sometimes a PDF will contain multiple separate invoices.
Once the document has been split and all the parts have been written to disk as individual PDF files, an inventory object is created for each new document. The individual inventory objects contain the text from each of the document's pages, as well as the information pulled from that text with RegEx. After this has been completed, the inventory object is sent to the Input queue to update the datastore.
There are 3 separate functions that currently exist.
These three functions are tightly integrated with other functions/modules making it a challenge to separate them out into their own modules.
#1 is tied to the RegEx processing settings found in the PDF processing module. Uploading data to an Airtable column that cannot be changed will return an error and no updates will be made, so the program needs to ensure that it only pulls data from columns in Airtable whose content can be changed.
#2 & #3 are tied together. The filesystem layout for modules did not take this possibility into account, so some minor changes will need to be made.