Data Science MVP is a template to jumpstart a minimum viable product (MVP), or rough draft of a data science project.

It is implemented as a cookiecutter template that generates a directory tree containing subdirectories that hold data, notebooks, reports, source code, etc., with code stubs that present a logical workflow for assembling a pipeline for a reproducible data science project. (In its current version it is a modification of the cookiecutter data-science template, although that is likely to change in the future.)

At its heart is a Jupyter notebook (stored as mpv.ipynb in the /notebooks directory) with markdown and code cells pre-filled with suggestions for how to proceed through building an MVP. Each code cell has a commented-out %%writefile command that will write the cell's contents to into a Python script stored in the /src directory. In the top-level directory, there is a script that executes the entire pipeline as implemented in the /src directory.

Once the mvp.ipynb notebook is completed and the code is exported to /src, the data scientist can focus on making iterative improvements to each step in the pipeline. When the project is complete, it should be easy to export the models and source code into a Python package that can be shared with others.

How do I use this?

Install cookiecutter and generate the template project files

First you will need to install cookiecutter if you haven't already

$ pip install cookiecutter

and then run the following command from the terminal to create a new project directory loaded with the files and folders from the template:

$ cookiecutter

The prompt in the terminal will guide you through naming your project and its filepath, and then populate the project directories.

Once it's all set up, initialize a git repo in the project's top level directory and you can get to work.

Read through and the default scripts in /src

After you've built the template, you'll want to go into the /notebooks folder and start editing mvp.ipynb. But first take a look at in the top level directory, which is reproduced here:

import src.features.build_features
import src.models.train_model
import src.models.predict_model
import src.visualization.visualize

# Usage: 
# $ python
if __name__ == "__main__":

Note that is importing all of the scripts that will be built by the %%writefile commands in mvp.ipynb, and that each of these scripts contains a run() function that executes a module of the pipeline. If you run this script at the commandline right out of the box, it will produce this output:

 $ python
...processing raw data
...saving interim data
...processing interim data
...processing external data
...reading interim data
...building features dataframe
...performing train/test split
...saving processed data
...importing ml models
...building ml model
...reading train/test datasets ml model
...saving ml model
...loading trained model
...loading new data
...making some predictions
...drawing some charts
...saving the charts

As you can see, the default run() functions provided in the repo each print out their name and some suggestions for the types of actions you will want to implement inside them. Note that with no inputs and outputs for any of these run() functions, these are expected to run independently of each other. But you are free to reimplement these however you see fit.

Work through mvp.ipynb and build the first draft of your pipeline

With the dummy code in the included scripts, and the code stubs in the mvp.ipynb notebook, you should have a pretty good idea what steps to take to get a baseline model up and running.

Keep in mind that right now we're just concerned with engineering a pipeline that builds a model, any model. It doesn't have to be a great model. We're just engineering a system that can quickly translate ideas into results. Your best ideas will come later, and they'll come faster if you have the right tools in place to quickly test and refine them.

Interpret your results and iterate on your pipeline

When you're done with your MVP, you should have a trained model, and some charts and reports to interpret how well your model performs. This gives you everything you need to go back and make changes at any step where you think improvements can be made, whether it's in collecting more data, engineering more features, choosing a new model, and so on.

If all of the steps are running independently, you can re-run your pipeline after any changes and see how they impact the end-to-end workflow. Take care to save any data that you've downloaded or formatted, and re-load the data from disc any time you re-run your pipeline... unless you're working on the Obtaining or Scrubbing data stages. Quick iterations are key to staying productive.

Save your work!

At this point, how you proceed is up to you. Your reproducible pipeline lives in the scripts in the /src directory and any updates need to be added to those files in order to run the pipeline with You can use Jupyter notebooks, an interactive Python environment, an IDE, or a text editor and the commandline to play with new ideas -- /src is a Python package so you can import any of the functions you've written from the top directory.

Just make sure that every time you find something that works, you edit the relevant file in /src and run again make sure it doesn't break anything. And commit those changes to git right away!

Come back later and check for updates

This is still a work in progress. I'm playing around with it, and my students at Metis are working with it and providing feedback. I'd love to hear your suggestions for how to make this workflow even better.

I wouldn't recommend applying any of these changes to a datasciencemvp project that you're already working on -- once you've built a project from the template you're free to make your own adjustments. But if everything is working as it should, the hope is that this will help you finish projects faster and be ready to use the latest version on your next one.


comments powered by Disqus