Validation Value Addition: Adopting Target's Data Validator

I really enjoy watching Python/data related presentations/conferences etc. The topics are a boon and my latest inspiration is from the fascinating work happening at Target with their data validator (Video of Presentation). There were a few things I was looking for when developing an approach to data validation:

A Framework: something that would give structure to the raw, ad-hoc way of doing things
Reproducibility: something that could prove valuable in other spheres of influence
Iteration Ease: something that is low-lift to iterate on

Working on the data validation aspect of the state reporting data submission was not without its challenges. I did not anticipate the hours I would spend looking for workarounds or troubleshooting craziness. COVID-19 seems to have wrestled with some of my creativity and so this will be a different style than my usual prose. Here were a few highlights of my learning moments:

State Management / Data Persistence

Airflow is unlike any other framework I have used. It’s a deeply complex system and just when I think I understand a part of it, it throws a curve ball that really jolts my understanding of the world. At first it was about trying to wrap my head around paths when coupled with a docker environment. And as I was barely grasping some aspects of paths, I spent a considerable amount of time trying to understand how variables/objects change in Airflow. I wanted to organize the code so that it was easier to pick it up and run with it year to year with minor changes. Although objects in general are first-class citizens in Python, it still feels weird trying to get classes/objects to play nice with operators. One area where I kept tripping up is I would call self.attribute, but for some reason, I somehow couldn’t guarantee that the value of attribute, whenever the scheduler picked a task up, is the value I was expecting. Maybe I didn’t have enough sleep and the issue was something else entirely. Relatedly, the fact that Airflow reads and executes the dag file several times is still something that is tricky to manage and I’m in search of more elegant solutions. Most of my troubles here though really stem from not having a deep understanding of this tool – but I’m slowly getting there.

Additionally, I had to make a decision on how to store the result of the validation checks across multiple tasks (a task being a validation). My heart kept wanting to do Xcoms, but I was concerned about how I would go about deleting that data after the validations were done. In hindsight, an Airflow variable might have been a good alternative. In general, though, I was leaning toward a solution that wouldn’t rely on the local DigitalOcean machine storage, and settled on a JSON read/write to S3 operation that complemented our existing infrastructue. When multiple tasks might be reading and writing on the same file though, it’s important to guarantee that the tasks aren’t overwriting each other’s data by reading/writing almost simultaneously.

Real Talk: A Thing that Took An Unexpectedly Long Time to Root Out

I have accepted that working in Airflow invariably means spending time occasionally on something that was a 10 minute solution at best. This time around it was because I was trying to use the e-mail operator, and could not because my Wi-Fi was possibly blocking access. The only error I saw was that the connection wasn’t working and I tried a whole bunch of things: kept toggling between smtplib.py on GitHub; deleted and re-inputted the password; and doubted the setup of the Docker container. All of this, before I decided to try to use smtplib in Jupyter Notebooks as a quick experimental check and noticing that I was running into the same issue. That’s when I realized I was working on the university visitor Wi-Fi which has been known to be restrictive. Not ideal to spend that much time fixing a problem when the solution is to switch one’s Wi-Fi, but nonetheless, it was a humbling experience.

Magnitude of Impact

Throughout this work, I have to keep reminding myself to keep an eye on the prize: the magnitude of impact of the automations. I’ve struggled to quantitatively measure that impact, but I think the narrative impact is powerful in this instance. One of the blockers to starting out the year strong with compliance reporting is that students have to be submitted to the state reporting apparatus (CEDARS from OSPI in this case). Whenever we submitted this early piece of data, the files can be riddled with errors. This upcoming year, we will be able to submit as soon as CEDARS opens and with very little to no lift, we will have the full set of validations available at our disposal. This was something that felt fairly ad-hoc and not systematic. Now, we can rapidly create and iterate on data and we are able to trust the data because our controls are ready to be monitored from day 1. That’s an exciting opportunity.