Bringing modern dev-ops practices to survey data analysis

How the next Docker revolution can benefit survey data analysis

Geir Freysson
Datasmoothie

--

Over the past decade the container revolution, with technologies like Docker and Kubernetes, has super-charged the scalability of server-side computing. This same revolution is now coming to the developer environment, with technologies like Gitpod and GitHub’s CodeSpaces. In this article I will highlight how this shift in the technology landscape can bring enormous benefits to industries that rely on my specialty: survey data.

Imagine if anyone within an organisation, regardless of whether they’ve ever touched a project before, whether they’re on Windows, a Mac, or even an iPad, and whether they are working from home or are on a beach in Bali, can log in, run a script and generate every Excel table, Powerpoint document or dashboard ever delivered to a client at any point in history in a matter of minutes. That is the promise of a modern dev-ops environment for survey data.

Step one: Every project is a git repository

The first step is to move all data processing and report generation code into a git repository. This not only brings version control to projects, but also an audit trail (trust me: it’s sexy).

Every new client project is created as a repository, which is based on a pre-defined template, so that all the standard code will be ready to customise and run — even while the survey is still in field and all we have is preliminary data. And instead of your data being downloaded and run on individual (costly) devices with the security implications involved, you get to know not only, who did what to it and when, but that it is stored, centralised and securely.

Thanks to the template, once the repository is created the README file will have a button with a magic phrase, “ready-to-code”. Click on the button and Gitpod will launch your environment.

Step 2: Ready to code (in the cloud)

Docker based development environments (like our favourite: Gitpod) allow users to describe their development environment as code. This means that if you specify things the right way, you and your team can have everything they need for a project at their fingertips at the click of a button. No setting up or installing software.

Do you need tweak your weighting schema? Or band your respondents’ age into Millenials, Gen Xers and Boomers? Click on ready-to-code and you only have one more step before your client gets their results.

Step 3: A pipeline of Jupyter notebooks

The template repository is a collection of Jupyter Notebooks that leverage the open source python library Quantipy (which specialises in survey data), our own Datasmoothie API (for tables, Powerpoints and dashboards) and a library called Papermill, which allows coders to re-use Jupyter Notebooks and chain them into a pipeline (see more on automation and notebooks).

Your pipeline and processes can also be automated, again by using git:

git push remote weekly
Push your pipeline to the “weekly” repository and use Airflow to run it automatically every week and alert your team on Slack.

If you set up a separate repository for all pipelines that need to run regularly and automatically, and then use a scheduler like Airflow to both pull new pipelines and run them, your Jupyter Notebook pipeline will run automatically and you can alert your team on Slack when a pipeline is successfully run.

Finally: The audit trail

An audit trail — that sounds boring. But it’s not: It’s magical. Because all your work is in a git repository, every change to the code that generates output for a project is stamped with a unique ID (in the above case, it’s 382964a). If someone needs to reproduce an old version of Powerpoint slides or Excel tables, simply check out that commit and you can reproduce everything that your colleague did before you.

What is more, when outputs are generated, they can be stamped with the unique ID of the commit that generated them. A Powerpoint slideshow can have small-print at the bottom of the front page that says “Generated by commit 382964a”). Everything is verifiable and reproducible.

I told you it was sexy.

Your Powerpoint deck, Excel file or any other output can have an id number that makes it possible for anyone to recreate it from scratch, on any computer or device.

Geir Freysson is co-founder of Datasmoothie, a platform that specialises in survey data analysis and visualisation. If you’re interested in using open source software for survey data analysis, sign up to our newsletter, called Unprompted Awareness.

--

--

Co-founder of Datasmoothie. I also maintain the open-source survey data library Quantipy and it’s enterprise equivalent Tally.