Data Science workflow recommendation
A workflow recommended for data science.
Repository of this workflow is stored here
Production data science template
The template of this repository follows production-data-science workflow,
which focuses on productionizing data scientist’s work, make the analysis or research to be reusable, applicable to
production.
The workflow is separated into 2 phases:
-
exploration phase
is where data scientist explores the project, mainly work with jupyter notebook. All the work in this phase will be stored inexploration
folder. -
production phase
is where data scientists’ works are refactored into packages so it can be reuse, imported. All the work in this phase will be stored inyour_package
folder.
How to setup a new repository - for maintainer
git clone https://gitlab.com/Phuc_Su/production_data_science_template.git
git clone <your_project_repository>
cd <your_project_name>
git checkout -b product-initial-setup
# open Finder, copy all content of production_data_science_template into your project repository, except .git and .idea folder
conda create --name <environment_name> python=3.6
source activate <environment_name>
pip install git-lfs
# in case you want to add some large file extension other than .jpg, .pdf, .csv, .xlsx
git lfs track <add large file path>
# rename <your package> folder and modify setup.py, most importance is require_packages. See example below
# write something about your project in README.md
pip install -e .
pip freeze | grep -v <package_name> > requirements.txt
git add .
git commit -m "First commit"
git push -u origin HEAD
Example of setup.py
setup(
name='your_project',
version='v0.1',
description='',
long_description=readme(),
classifiers=[
'Programming Language :: Python :: 3',
],
url='https://github.com/phucnsp/production_data_science_template',
author='Phuc_Su',
author_email='',
license='',
packages=['your_package'],
install_requires=[
'pypandoc>=1.4',
'watermark>=1.5.0',
'pandas>=0.20.3',
'scikit-learn>=0.19.0',
'scipy>=0.19.1',
'matplotlib>=2.1.0',
'pytest>=3.2.3',
'pytest-runner>=2.12.1',
'click>=6.7'
],
setup_requires=['pytest-runner'],
tests_require=['pytest'],
)
and you are ready~! 🎉
Note: if you want to setup notification on slack for merge request from gitlab, reference here
How to contribute - for developers
Setup first time
```bash
conda create --name <environment_name> python=3.6
source activate <environment_name>
git clone <repository url>
cd to/the/project/directory
pip install -r requirements.txt
pip install -e .
```
For a private repository accessible only through an SSH authentication, substitute
https://github.com/
withgit@github.com:
.
Returning to work
- Some rules:
1 branch/1 exploration/1 folder
- branch-name convention:
explore-* for exploration, refactor-* for refactor
git checkout master
git pull --all
# if you continue to work on old branch
git checkout <branch>
# if you want to start a new exploration
git checkout -b <new_branch>
# if your branch is far behind master and you want to merge
git merge master
#####################
Start working
#####################
git add <path_to_work_files/folder>
git commit -m "some message"
git push -u origin HEAD
Notes
-
requirements.txt
helps to setup your virtual environment, to make sure all contributors working on the same environments.
So whenever you have a new libraries need to install, after installing you need to add it intorequirements.txt
bypip freeze | grep -v <package_name> > requirements.txt
-
setup.py
allows you to create packages that you can redistribute. This script is meant to install your package on the end user’s system, not to prepare the development environment.-
packages
- in-house development packages. -
install_requires
- packages that our development packages dependence on. -
py_modules=['new_module']
- in-house development modules need to install (placed in root directory)
-
-
pip install -e .
- to install packages/modules fromsetup.py
, in the editable mode. - If you want to add large file into working repository:
pip install git-lfs git lfs install # Tell LFS to track files with given path git lfs track "path_to_large_file" # Tell LFS to track files with format "*.jpg" git lfs track "*.jpg" # Tell LFS to track content of the whole directory git lfs track "data/*"
How to use the package - for users
Install the library
conda create --name <environment_name> python=3.6
source activate <environment_name>
pip install -e 'git+https://github.com/phucnsp/production_data_science_template.git'
For a private repository accessible only through an SSH authentication, substitute
git+https://github.com
withgit+ssh://git@github.com
.
Note that-e
argument above to make the installation editable.