2 Machine Learning Project Workflow
What are the main phases of a ML Project?
- Frame problem
- Get data
- Gain insights
- Prepare data
- Choose models
- Fine-tune and combine models
- Present solutions
- Launch, monitor and maintain system
Remember: For all tasks, automate as much as possible
2.1 Frame problem
- Objective and current solutions?
- New solutions: how to use
- Depend on type of problems: possible models, performance measuring
- Minimum needed performance
- Comparable problems -> Can reuse experiment, tools?
- List and verify assumptions (if available)
2.2 Get data
- List data: where to get, how much (features, instances), storage space
- Get and convert data if necessary
- Anonymize sensitive information
- Recheck data
2.3 Gain Insights
- Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
- Create a notebook to keep a record of data exploration.
- Study each attribute and its characteristics:
- Name
- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
- % of missing values
- Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
- Usefulness for the task
- Type of distribution (Gaussian, uniform, logarithmic, etc.)
- For supervised learning tasks, identify the target attribute(s).
- Visualize the data.
- Study the correlations/mututal information
- Identify the promising transformations/feature engineering
- Identify extra data that would be useful.
- Document what we have learned.
2.4 Prepare data
- Work on copies of the data (keep the original dataset intact).
- Write functions for all data transformations we apply, for 3 reasons:
— Easily prepare the data the next time we get a fresh dataset
— Easily to apply transformations for test set/new instances once solution is live
— Treat preparation choices as hyperparameters
- Clean the data
- Fix or remove outliers (optional).
- Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
- Perform feature selection (optional)
- Drop the attributes that provide no useful information for the task.
- Perform feature engineering, where appropriate
- Discretize continuous features.
- Decompose features (e.g., categorical, date/time, etc.).
- Add promising transformations of features (e.g., log(x), sqrt(x), x2, etc.).
- Aggregate features into promising new features.
- Perform feature scaling
- Standardize or normalize features.
2.5 Choose models
If the data is very large, it might be better to sample smaller training set to train many different models in a reasonable time, but this will affect performance of complex models such as Random Forest, Neural Networks, etc.
- Train many quick-and-dirty models from different categories (e.g., Linear, Naive Bayes, SVM, Random Forest, Neural Networks, etc.) using standard parameters.
- Measure and compare their performance: For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measures.
- Analyze the most significant variables for each algorithm.
- Analyze the types of errors the models make: What data would a human have used to avoid these errors?
- Perform a quick round of feature selection and engineering.
- Perform one or two more quick iterations of the five previous steps.
- Shortlist the top three to five most promising models, preferring models that make different types of errors.
2.6 Fine-tune and combine models
- Implementing on full training set.
- Don’t tweak the model after measuring the generalization error: it would just start overfitting the test set.
- Fine-tune the hyperparameters using cross-validation:
- Treat data transformation choices as hyperparameters, especially when we are not sure about them (e.g., if we are not sure whether to replace missing values with zeros or with the median value, or to just drop the rows).
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If the training is very long, prefer a Bayesian optimization approach (e.g., using Gaussian process priors)
- Try ensemble methods. Combining our best models will often produce better performance than running them individually.
- Once we are confident about final model, measure performance on the test set to estimate the generalization error.
2.7 Present solutions
- Document what we have done.
- Create a nice presentation: Make sure to highlight the big picture first.
- Explain why solution achieves the business objective.
- Don’t forget to present interesting points noticed along the way:
- Describe what worked and what did not.
- List our assumptions and system’s limitations.
- Ensure the key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).
2.8 Launch, monitor and maintain system
- Get the solution ready for production (plug into production data inputs, write unit tests, etc.).
- Write monitoring code to check the system’s live performance at regular intervals and trigger alerts when it drops:
- Beware of slow degradation: models tend to “rot” as data evolves.
- Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
- Also monitor the inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.
- Retrain the models on a regular basis on fresh data (automate as much as possible).