5/19/2019
How I Do an Analysis
I’ve been working as a full-time data scientist for the better part of a year now and wanted to take a moment to step back and document my process.
The Booz Allen data science community is largely an R + Python shop, with my team falling into the latter camp. I confess having reservations about Python for data science, but for small-scale analyses, I’ve seen that it can work.
When I decide to tackle an analysis using Python, here’s what I do.
Requisites
To get the most out of this post, I recommend having at least topical familiarity with the following:
- Command line (knowing how to navigate folders on the command line, knowing what it means to be “in” a directory, etc.)
- Python language and project basics (have used
setuptoolsand friends) - The
makebuild system - Git
WARNING: Opinions ahead.
00 - Initialization
Every project I do — including analyses — gets its own folder, so
my first step is always to create a folder. On my work machine, a
Mac, I have a folder ~/Documents/proj where I put all my projects.
If the analysis is stand-alone (i.e. not part of a larger project) it
gets its own folder here:
mkdir ~/Documents/proj/my-analysis
cd $_
Otherwise, if it is a component of a larger project, it becomes a subdirectory of that project.
Document EVERYTHING
An analysis is ultimately a scientific experiment to test one or more hypotheses about some data, whose findings will form the basis for further research, business decisions, or some other communication. However, the recipients of our findings are often not data scientists themselves, and can’t verify by introspection alone whether our conclusions are valid. We need another way to build trust with our audiences.
To this end, one of the most important attributes of any scientific experiment is reproducibility: if someone else can emulate our experimental environment and get the same results as us, then that is evidence supporting the analysis’ validity. On the other hand, if someone using an analogous environment fails to reproduce our findings, then this is evidence that something has gone wrong.
This is as true for data science as it is for any other science. Therefore, I strive to make my analyses as easy to reproduce as possible. One of the most important things I do to achieve this is to document EVERYTHING.
As we’ll soon see, this includes (but is not limited to) explicitly writing down:
- Steps to recreate the environment
- Python package dependencies
- Explanation for all changes made to code base
- Experimental procedure/ task interdependencies
The documentation of an analysis should form a cohesive,
well-defined, easy-to-follow narrative with a consistent beginning,
middle, and end. My “beginning” is alwaysREADME.md, a GitHub
Flavored Markdown document which
contains my initial ideas and plans for the analysis. Here’s an
outline:
# my-analysis
I had this really cool idea about doing this one thing on this one
[dataset][dataset online]. I'm going to compare this model and that
one to see which one is better for my use case.
[dataset online]: https://cool-data.com/info
## Requirements
WIP
## Setup
WIP
I include blank sections for Requirements and Setup because I
use them to document the analysis’ environment. They will be present
in virtually every single README I write, though what exactly they
say will of course depend on the project.
At this point I also download GitHub’s Python gitignore as
this will be important for the next step:
curl -LsSo .gitignore 'https://raw.githubusercontent.com/github/gitignore/master/Python.gitignore'
Version Control EVERYTHING
Version control is itself a form of documentation: it documents the change history of your files. Making sure I’m still in the new project directory (which is often referred to as the “project root”) I initialize it as a Git repository:
git init
I track projects with Git whenever I can, even if I’m the only one who will ever see or touch the code. Having a log of every revision you make to a project and being able to revert those changes can be indispensable (and this usefulness extends beyond the code files of your project, including data and documentation too). For example, if I come back to some funky looking code a month after I write it, I don’t have to wonder what was going on in my head, I can check the log:
$ git blame src/funstuff.py
a6ea915b (Will Badart 2019-03-26 14:19:12 -0700 1) FOO = 'bar'
a6ea915b (Will Badart 2019-03-26 14:19:12 -0700 2)
a6ea915b (Will Badart 2019-03-26 14:19:12 -0700 3) def funtimes():
c3e471dd (Will Badart 2019-03-27 03:43:05 -0700 4) global FOO
c3e471dd (Will Badart 2019-03-27 03:43:05 -0700 5) FOO = 'teehee'
$ git show c3e471dd
commit c3e471dd812fa8e0b35210478207597daab1c72d
Author: Will Badart <will@willbadart.com>
Date: Wed Mar 27 03:43:05 2019 -0700
Fixed that bug I was having with a really clever and sustainable
solution.
The benefits multiply when it comes time to collaborate on project code. Best of all, Git is completely free money-wise, and insanely cheap effort/ overhead -wise. All you have to do is write a little message when you make a set of changes.
NOTE: You will enjoy your life more if you keep your commits atomic.
NOTE: Check out nbdime to help version control Jupyter notebooks.
At this point, I write my initial commit explaining what steps I took to initialize the project:
git add README.md .gitignore
git commit -v
If you’re planning on using GitHub or some other remote, now would be a good time to create the remote repository in the web interface and configure your local repo to talk to it. Using the URL from the Clone or download button:
git remote add origin git@github.com:wbadart/my-analysis.git
origin is just the conventional name for the main remote. The URL
should reflect your preference for SSH vs. HTTPS (mine is SSH because
it lets met push without inputting a password).
01 - Sandbox Configuration
If you try and use your global system Python binary and/ or packages for every project (or any project for that matter) you will find yourself in a world of pain.
Instead, use one of the many virtual environment management tools put
forth by the Python community over the years to isolate your project
environment, and keep projects from overwriting each others’
dependencies. If I know up front that my project will depend on any
Anaconda-only packages, I use conda’s environment management.
Otherwise, I stick with the lightweight, tried and true
virtualenv:
virtualenv -p `which python3` .venv
. .venv/bin/activate
Isolating your environment gives you bonus points for reproducibility; the stronger the isolation, the easier it is to consistently recreate the environment. At the expense of a bit of overhead and boilerplate, you can run your analysis in a Docker container for near-perfect isolation, but for most small to medium analyses, a virtual environment is enough.
Dependency Management
When writing Python software, I prefer to track project dependencies
in the install_requires section of a setup.py file. For an
analysis, however, which will not be “installed” to a new system with
setuptools, I track them in requirements.txt. Some people will
advise you to periodically update your requirements.txt file like
so:
pip freeze > requirements.txt
You would then commit the changes. This is a valid approach that strongly guarantees that the versions of any packages used by the project will align across systems. However, I prefer a different method.
You see, pip freeze outputs the complete contents of your virtual
environment’s site packages. It’s a lot of information, but
interestingly, it removes the information that I try to communicate
to others with a requirements.txt file, namely, the direct
dependencies of my project; the packages my code imports and
uses. My dependencies’ dependencies are just noise.
For this reason, I maintain requirements.txt by hand, adding a new
entry every time I pip install something, making sure to include
sufficient version constraints:
$ pip install pandas
Collecting pandas
Using cached https://files.pythonhosted.org/packages/2a/67/0a59cb257c72bb837575ca0ddf5f0fe2a482e98209b7a1bed8cde68ddb46/pandas-0.24.2-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Requirement already satisfied: numpy>=1.12.0 in ./.venv/lib/python3.6/site-packages (from pandas) (1.16.3)
Requirement already satisfied: python-dateutil>=2.5.0 in ./.venv/lib/python3.6/site-packages (from pandas) (2.8.0)
Requirement already satisfied: pytz>=2011k in ./.venv/lib/python3.6/site-packages (from pandas) (2019.1)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Installing collected packages: pandas
Successfully installed pandas-0.24.2
$ echo 'pandas==0.24.2' >> requirements.txt
A caveat: the one package I use in just about every analysis but do
not track in requirements.txt is Jupyter Lab. Unlike the other
packages I use for an analysis, Jupyter Lab isn’t a library I import
and use in the source code. It’s a tool. People may prefer to use
other notebook viewers, such as the old-school Jupyter Notebook
interface.
Of course, I commit changes to requirements.txt whenever I make
them.
02 - Analysis is a DAG, or, How I Learned to Stop Worrying and Love make
In this analysis, my pre-processing script can’t run unless the raw data is present, and the visualization notebook can’t run without the pre-processed data and the serialized model files, which are generated by…
Sound familiar? A full analysis is a network of these interdependent tasks which, when executed in just the right order, take you all the way from raw data to you conclusions.
This ordering of which tasks depend on which others exists in any analysis, regardless of whether it’s explicitly documented. As a courtesy to others trying to run my analysis (which, again, should be easy if my results are to be reproducible, and, by extension, trustworthy) I choose to document it.
Writing down “you need to run the pre-processing script before you
train the models” and such in a README is better than nothing, but
it leaves the door wide open to human error in reproducing your
steps. Therefore, I choose a more expressive medium for declaring my
task dependencies, one whose native language is tasks and
dependencies, and one which can be found on virtually any (*NIX)
system: make.
Just like requirements.txt, I build up a Makefile incrementally
as I execute the tasks of my analysis. It records the commands to run
a task as well as the dependencies of that task. (Here’s a quick tutorial
if you’ve never seen a Makefile before.) For example, I usually
start by recording how to acquire the raw data:
##
# Makefile
# created: MAY 2019
##
data: data/raw/cooldataset.csv data/raw/otherstuff.csv
data/raw:
mkdir -p $@
# cooldataset.csv depends on data/raw being present
# If it's not, run the data/raw "recipe" above
data/raw/cooldataset.csv: data/raw
curl -LsSo $@ https://cool-data.com/raw.csv
chmod -w $@
# same story here
data/raw/otherstuff.csv: data/raw
aws s3 cp s3://other/stuff.csv ./$@
chmod -w $@
Now I, or anyone with a copy of the project, can get the raw dataset simply by running:
make data
I leave it as a task to the reader to research the internal
mechanics of make, how it analyzes dependencies and only does work
when it has to. Just know that if you’ve accurately stated every
task’s decencies in the Makefile, you can simply state your end
goal (e.g. make deliverables at the command line) and make will
take care of everything else that needs to be done without
duplicating work done. This definitely beats reproducing someone’s
steps by hand.
NOTE: When my raw data lands on disk, I chmod -w it (make it
read-only) to prevent accidental changes.
Now let’s say I’ve created a pre-processing script. At this point I’d
record how to use it as well as what’s required to run it in the
Makefile:
data/processed:
mkdir -p $@
data/processed/cooldataset.csv: data/raw/cooldataset.csv data/processed src/preprocess.py
python -m src.preprocess $< --gpu -v > $@
The command line args of your pre-processing script will of course vary, this is just an example. Now someone can run:
make data/processed/cooldataset.csv
to generate the pre-processed dataset.
Ultimately, I’ll end up with a target such as model or report or
some other deliverable such that someone can clone my analysis’ Git
repository, and simply run make report to completely reproduce my
results. In other words, I try my darnedest to make my results
completely reproducible with three simple commands:
git clone https://github.com/wbadart/my-analysis.git
cd my-analysis
make report
Since make knows the internal dependencies of our tasks, we can
tell it run independent tasks concurrently with the -j (“jobs”)
flag:
make -j2 report # Use 2 concurrent worker processes
make -j report # Use `nproc` concurrent worker processes
I leave you with some further reading from the folks I first heard this idea from, Cookiecutter Data Science. I encourage you to read the whole page, but there’s a link to Analysis is a DAG in the sidebar if you’re pressed for time.
Putting it All Together
With the above process as a backdrop, the sometimes tedious work of
experimentation becomes a lot smoother an more enjoyable. By
recording my procedure in the Makefile and keeping tabs on file
changes with Git, I always know how I got to where I’m at, which
helps me determine where to go next. Having a consistent README
across projects makes it easier for others to hit the ground running
when they want to contribute. Using a sandbox environment keeps me
from getting inaccurate results when package versions change out from
under me.
This flow has worked well for me, but it’s not the only path to reproducibility. If I missed something big, or there’s something I should try, let me know!