Projects

Co-founder at DrivenData

DrivenData is a social enterprise that runs predictive modeling data science competitions for nonprofits, NGOs, and other mission-driven organizations. I built the platform along with my co-founder Peter — frontend, backend, and deployment. We use test-driven development and code reviews to stay sane, Github issues to track bugs and user stories, waffle.io for workflow and sprint planning; Coveralls for tracking code coverage, and Codeship for continuous integration and continuous deployment. Successful builds are automatically deployed to Heroku.

Our backend stack is Django, Postgres, Redis, rq, NumPy/Pandas; the frontend is customized Bootstrap along with some helpful jQuery plugins.


Cookiecutter data science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done.

Here are some questions we've learned to ask with a sense of existential dread: Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks? Come to think of it, which notebook do we have to run first before running the plotting code: was it "process data" or "clean data"? Where did the shapefiles get downloaded from for the geographic plots? Et cetera, times infinity.

These types of questions are painful and are symptoms of a disorganized project. A good project structure encourages practices that make it easier to come back to old work, for example separation of concerns, abstracting analysis as a DAG, and engineering best practices like version control.

Star


Modeling and simulating civil conflict in Uganda

We used data from ACLED (Armed Conflict Location and Event Data Project) and MCMC methods to simulate where civil conflicts happen in both space and time.

In the simulation part of this project, we treated the past data as an empirical distribution from which to sample. This worked out well, because the goal was less to predict exactly when and where conflicts take place than to create plausible scenarios for the optimization part of the project.

In the optimization part of this project, we used the temporal/geospatial conflict occurence model as a source of randomly sampled data points. We then used simulated annealing to get approximate solutions for a traveling salesman problem (visiting all conflicts for aid delivery) combined with a knapsack problem aspect (packing the aid vehicles with the right supplies).


Parallelizing an Integer Linear Program (ILP) with MPI

We modeled an assignment ILP in a small private school where each grade of 80 to 100 students must be divided into four all-day class groups while satisfying several hard constraints (e.g. not placing certain pairs of students in the same class or making sure that all for classes have roughly the same number of students) and soft constraints (e.g. putting two specific friends together).

We achieved excellent speedups (over 100x) by parallelizing this problem. Subdividing the feasible region and using MPI to split up solving, we were able to move from a compute time of almost 4 hours down to about 2 minutes using a cluster of 256 processors. MPI proved a powerful and flexible way to parallelize LP solving.


Predicting workforce attrition in the U.S. Coast Guard

The United States Coast Guard (USCG) must decide each year how many enlisted personnel to promote in each specific rank and occupational specialty, as well as how many new recruits to bring on. These decisions are subject to many constraints, but the most important factor is the rate at which personnel leave the service.

The available data spanned FY 2005-2013. For each year, we had a snapshot of the entire enlisted workforce on the last day of the fiscal year with each member’s service information (e.g. time in service, occupational specialty, rank) and demographic information (e.g. gender, marital status, education level, etc).

The overall goal of the research was to forecast this unknown quantity in a robust and repeatable manner. We used Gaussian Mixture Models (GMMs), clustering, multi-level logistic regression, and other methods in order to forecast attrition.


Analyzing Dunkin' Donuts' business with scraped data

I used data scraped from the Dunkin' Donuts online store locator along with openly available corporate disclosure documents and government data to delve into some of the fundamentals of Dunkin's franchising. For instance, a simple linear model with a mashup of store locations and US census data helps suggest which areas might support new, profitable franchises.

The post also demonstrates the use of some helpful Python packages like Basemap for GIS-like plotting, as well as command line tools such as pdftotext for getting useful information out of non-text documents.


django-hide-herokuapp

Django app for Heroku users designed to hide their *.herokuapp.com site from search engine results. Uses routing rules or custom middleware to make sure that the actual domain name gets into the SERPs and not the backup URL.

Star


dotfiles-copier

A single script to copy all of your scattered dotfiles and config folders to a specified directory. Can automatically git commit changes.

Built on the lightweight click command line framework, and using an extremely simple YAML configuration inspired by Ansible's copy directive.

Star


circlemud-world-parser

Long before massively multiplayer online games with realistic 3D graphics, MUDs were multi-user text-based games that were some of the first virtual worlds to be built on the internet. But these ephemeral worlds have been increasingly dying off, and with them goes much of the creativity that went into creating the tens of thousands of lovingly crafted descriptions of spaces, people, and sometimes incredibly elaborate storylines.

CircleMUD was one of the most popular pieces of software designed for hosting a MUD. Written in 30,000+ lines of C, it was an impressive project for its time and was one of the early collaborative open source projects. It stored its objects (rooms, objects, characters, etc) in flat files based on a highly specific, custom format dating back to its predecessor DikuMUD.

This little digital archeology tool takes these old CircleMUD flat files and renders them in JSON, an approachable and widely used general purpose serialization format.

Star


Syntax highlighting for IJulia notebooks in Pelican

The Pelican plugin for blogging with IPython notebooks didn't know how to highlight notebooks in languages other than Python. To make that work, I wrote some tests and the code to make them pass, then shot a quick pull request to the pelican-plugins repo.