The purpose of this post is to propose a template for machine learning projects that strives to follow these principles:
- All data scientists can quickly setup an identical development environment based on Docker that encourages good software engineering practices.
- Dependency management is handled during the environment’s startup by Miniconda and requires minimal manual changes.
- Notebooks are encouraged for exploration. However, for production purposes notebooks must be version controlled, parametrized and run using Papermill.
The template is available on github adamnovotnycom/machine-learning-docker-template. The general template structure looks as follows:
- Dockerfile defines the development environment and uses Miniconda as base image
RUN conda env create -f conda.yml
RUN echo "source activate dev" > ~/.bashrc
2. conda.yaml is used for dependency management and includes standard data science packages.
3. ml_docker_template package should include all production code that can be installed and run by an external system. As a result, the code can be developed locally but also easily runs on an external machine when additional compute power is needed for model training or when additional permissions are required for deployment.