Divergence Into Networking and the CALPADS Web API

The grind of life post COVID-19 had taken its toll on my existence. Gratefully, I did not have any serious issues, but I could tell something about this new era of living was wearing me down. I was less energetic in odd ways, but most notably I just felt less creative. I even started wondering if this is what burnout feels like. In some ways, maybe I did need to step back. Fortunately, it didn’t last long before a magnetic creative energy was pulling me in a different direction. This time, it was a winding educational journey to a pretty exciting rest stop.

The Driver

I am still developing improvements on the state reporting Airflow project. However, my attempt at a ducttape based project on automating CALPADS reporting using Selenium had run into a roadblock. We are running a containerized Airflow environment in the cloud and one of the downloads was inconsistent if it ran at all. Through that process, I learned that some work was not a good fit for “scheduling” per se. In its stead, I keep clinging on to an image of a workbench. While it’s true that the schedule of work might be “irregular”, the core of the work could still benefit from a solid framework. The primary pitfall, in my opinion, is that the lack of a framework quickly devolves to messy ingestion layer processes. Because I would have to maintain and organize all the files I would have to download, there’s a high degree of potential mismanagement and certainly a lack of a historical paper trail if you will. But what would a workbench look like?

A Selenium + Jupyter Notebook Workbench

I started to explore if a workbench built on Selenium and Jupyter Notebook would fit the workflow. My idea was that I would spin up a Selenium chromedriver (not headless this time) and set a download location that would be easily accessible from a Jupyter Notebook(s). From there, I would manually operate the UI and download a file from either our Student Information System or from CALPADS as needed. Then, once the file was downloaded, I could process the file as needed. Even as I write it, I still think it’s a pretty solid idea. I loved that I could control where the download folder would be on a per working session basis. It would reduce the need of having to maintain files in a Downloads folder and move them somewhere else – it would immediately address some of the paper trail mismanagement risks.

In my excitement, I jumped in and started asking questions. One thing on my mind was that if I was working on transforming the data download in Jupyter, my sessions could time out in the browser. This piqued my curiosity on how the communications worked across ports. I knew Jupyter was rendered on a localhost server and I had seen that Selenium was also launching something DevTools or other and listening on some port. If I could set up another background task that occasionally “refreshed” my session in some way (probably by sending a call to execute Refresh), it would be golden. This led me down two paths; and I’m only sorry I couldn’t take both, however far they went, at the same time.

The Grassy Wear and The Road Not Taken

One path that was exciting was to try to understand how Selenium and Jupyter used computer networking under the hood and that’s where I started. I watched a few promising videos from JupyterCon past and was beginning to wrap my head around the high level concepts. At the same time, I was dancing around between Selenium and chromedriver documentation. It was during this process that I started discovering more about the power of browser Dev Tools. In particular, I was struck by the existence of chrome://tracing (about:tracing). It’s an incredible tool that lets you see under the hood of the browser itself. And as if that wasn’t enough, you can even connect it to remote browsers like the chromedriver sessions, even when it’s headless!. Mind blown. Meanwhile, Jupyter was also incredibly enticing. Understanding the kernel and ZeroMQ might have taken the whole year, if not longer, but it would have been absolutely worth it. Some other very honorable mentions include the Web Driver protocol and the Chrome DevTools Protocol. Knowing that both of these protocols exist and how they help me understand my work in the immediate is setting me up for success in keeping up with their development in the near future. Between Docker, Jupyter, and chromedriver, I’m left in awe of just how much there is to learn about computer networking. For now, it really is all about the climb, no matter how tall the mountain might be.

The Other, Perhaps The Better Claim

The other path also incorporated some of the learnings from my exploration of Selenium and Jupyter, particularly around better utilizing browser Dev Tools. After closely reinspecting the download that kept failing, I noticed that at the heart, it was a JavaScript call. I tried to call the method used, and it was still running into the same issue. Nothing appeared broken, but the download was a step shy of actually downloading. (The solution for this might honestly be at the tip of my tongue. It seems something fails to execute in its entirety on headless when JavaScript opens a new window/tab for downloading. Switching to the new window appears to crash the driver and has been, so far, irrecoverable).

I started looking into whether I could mimic this download by using the webdriver to fetch the file directly via a url. It took some finesse, but I successfully logged into a Selenium driver and used Requests to fetch the resource directly. One of the final frontiers for the ducttape project was possibly whether automations can take advantage of authorization and use solely HTTP sessions to download data. Emboldened by my recent success, I asked whether we could remove all of Selenium for a lightweight, faster, and more robust Web API approach. The added bonus of this approach is it would likely remain functional even while headless.

CALPADS doesn’t officially support an API let alone a Python wrapper for it. The first barrier was exploring how in the world I was going to log into the system. With Selenium, albeit complex on its own, it amounted to instructing the driver to click the submit button. Now we were venturing into the unknown and trying to reverse engineer some processes. Using the browser Dev Tools, I reviewed the network traffic and page source to build a mental model of the requests the CALPADS website was making with each link or button click. Using Requests and understanding OAuth/OpenID was a woozy journey. The problem space was unfamiliar, but I felt determined and thankful that the jolt of creativity returned at just the right time. The documentation for how to log in using OAuth is actually mostly great. In order to get CALPADS to work, though, I did have to beat my own drum for the infamous “OAuth Dance”.

Telling This With a Smile, Somewhere Ages and Ages Hence

After all of this, I landed on calpads, an experimentation for a CALPADS Web API Python wrapper. Because we don’t have an official Web API, I have had to make educated guesses for a few things. As I grow in my knowledge for how Web API design works, I’m hoping to crack more of the mysterious code. I think it’s fair to say this will remain in “experimentation” mode for the foreseeable future since access to any of these endpoints/resources isn’t guaranteed. For example, one of the things we can do now is actually return resources in JSON format for quite a number of very useful endpoints. But there’s no guarantee that access to the endpoint OR to the formatting will still be here tomorrow. Nonetheless, I am very happy with how I landed here on this journey. I learned so much and I am still yearning for more.

I want to close out this lengthy blog post (against all odds, I miss the long blog posts) on a promising note. With just a few more features, if calpads can prove to be stable, here’s what I might be able to achieve next:

The unit tests for ducttape-calpads took at least an hour –> with direct API requests, that time would be considerably shredded (somewhere in the range of 50 - 70%).
All data downloads will be much faster. (~50-70%)
All data downloads will be more robust. (Not subject to the whims of UI elements loading as expected.)
Headless. Downloads.
A workbench
- Web API to download the CALPADS files from SIS
- package/library for common transformations of the data
- Web API for CALPADS import
- E.T.L.!!!