Automating File Processing with Python and SQL Server
Garry Bargsley ( b | t ) asks us this month what does “Automate all the things” mean to you. After all, if we have to do something over and over, it only makes sense to automate whatever we can to avoid having to do repeat the process manually. We’ve seen this throughout history where repetitive, time-intensive tasks have been automated or simplified.
One of my favorite moments comes from a job I had many years ago. I was tasked with ensuring that we transferred files from our in-house imaging and scanning system to another facility where people would use those images to do data entry, categorize those images, then send them back to us along with the data to import into our system. We worked in auto-lending so had many documents to process as part of the application. Our in-house system worked pretty well to just scan all pages of all documents in the application and tag them appropriately.
At first, I was pretty limited by my knowledge. I knew DOS Batch files, I had access to WinZip, and I had access to an FTP app that could synchronize across sites. That seems like it would work well to handle the files. Look for a “trigger” file to indicate the process was done, zip all files matching that name, move the zip file to the “outbound” folder with its Trigger file, and move on. Turns out – the FTP sync was not always reliable. Sometimes it wouldn’t pick up a file properly or would pick up the trigger file and not the actual file, which resulted in issues trying to process at the data entry site. In each case, I was asked to troubleshoot the issue, which took hours each day with limited logging.
After far too long with this, I decided we needed something different and nobody else was going to be able to help. In the days before PowerShell, options were a bit more limited for Windows. I knew that VBScript was a non-starter. It was already old by then. Over my Christmas vacation, I taught myself Python with the end goals of:
-
Log everything to SQL Server
-
Error handling throughout the process
-
Create modules as much as possible for code re-use
-
Document the code and process so someone else could help if things went wrong.
-
Successfully transfer all files in both directions
It took my entire break, barring some meal times, but I managed to return with some crude, but workable, Python examples. I pleaded my case to replace the existing Batch files and eventually got permission to go ahead. I set up the servers, installed Python, copied over my code, with the appropriate changes so it would run on the servers, and flipped the switch. (no test servers in that system) It failed horribly because it couldn’t keep up with the volume.
I’d made the mistake of trying to use native Python functions and my very limited knowledge to parse files out of tens of thousands, grab the matching files and compress them. This took far longer in a production system than my limited Proof of Concept so the system lagged behind horribly. I was able to look at my logs and realize that the time was spent trying to compress the files. I replaced the native “compress” function with WinZip’s command-line and immediately saw a dramatic increase in speed. The new Python code was working well, I had my logs, I could easily tweak parts of the code because it was modular, others could help me because it was documented, and I could move on from spending most of my day troubleshooting the system.
Later, I was able to build on those modules to create an SFTP and Encryption set of scripts to transfer data back and forth to our partners. We used GnuPG to encrypt and decrypt sensitive data, similar logging code, and Python modules to contain all of the functions. The process was well documented on how to create keys, give the important parts to customers, install them on the servers, and copy the Template script to set up new customers. I found out that this system stayed in place for 5+ years after I left, which is impressive for a person who is primarily a SQL Server worker and taught himself Python.
I can’t share the scripts or the DTS code used to process the files, but hopefully this will encourage others to find solutions to their problems. Maybe you need to look outside of your comfort zone. Learn some Powershell, PowerBI, SSRS, Python, R, or some other language you’ve never used with the end-goal of automating or simplifying your task. You’ll be glad you did!