H20 comes with many features. The second part of the series H2O in practice propose a protocol to combine AutoML modeling with traditional modeling and optimization method. The goal is the workflow definition that we could apply to new use cases to get performance and delivery time.
In association with EDF Lab and ENEDISour common goal is to appreciate the difficulty of onboarding the H2O platform, to understand how it works and to find its strengths and weaknesses in the context of a real project.
In the first article, a real experience with H2O, the challenge was to build a model using AutoML, and compare it to a reference model, built using a traditional approach. For the second challenge, we were given a ready dataset from another business problem (still related to the preventive maintenance) and five days to produce the best model with H2O. To prepare for it, we developed an operational protocol that we present in this article. It helped us train a baseline model comparable to the existing one in just two days.
This protocol provides guidance on how to combine AutoML modeling with individual model algorithms for increased performance. The length of the education is analyzed over two examples to get a global picture of what can be expected.
Project 1, discovery of segments of underground low-voltage cables in need of replacement: train and testset had a little over 1 million rows and 434 columns each. For this use case we used Carbonated waterwhich combines H2O and Spark and distributes the load across a Spark cluster.
- H2O_cluster_version: 126.96.36.199
- H2O_cluster_total_nodes: 10
- H2O_cluster_free_memory: 88.9 Gb
- H2O_cluster_allowed_cores: 50
- H2O_API_Extensions: XGBoost, Algos, Amazon S3, Sparkling Water REST API Extensions, AutoML, Core V3, TargetEncoder, Core V4
- Python_version: 3.6.10 final
Project 2, Preventive Maintenance, Confidential: train and test set had about 85,000 rows each. 605 columns were used for training. In this case, we used one non-distributed version of H2Orunning on a single node.
- H2O_cluster_version: 188.8.131.52
- H2O_cluster_total_nodes: 1
- H2O_cluster_free_memory: 21.24 Gb
- H2O_cluster_allowed_cores: 144
- H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
- Python_version: 3.5.2 final
In both cases, the task was a classification of a severely unbalanced dataset (< 0.5% of class 1). AUCPR was used as an optimization measure during training. As the final model evaluation, two business metrics were calculated, both representing the number of errors on two different cumulative lengths of feeders. 5-fold cross-validation was used to validate the models. The challenge in both projects was to compare the best model from H2O with the reference model internally, which was already optimized.
We combined AutoML features with individual modeling algorithms to get the best of both worlds. After trying different approaches, we found the proposed protocol to be the most concise and simple.
- Depending on available time:
- If you have enough time, run AutoML to construct many models (30+) without time limit, to get the longest and most accurate training (
- If you are short on time, or if you are just looking for an approximate result to estimate baseline performance, define the maximum training time (
Calculate a business metric (and/or additional metrics) for each model and collect them in a custom leaderboard.
Merge the AutoML leaderboard and a custom leaderboard and inspect:
- Which model family scores the highest?
- How do business and statistical metrics correlate?
- Are some models performing so poorly that we don't want to spend more time optimizing them?
Due to the confidentiality of the project, we will only describe the results to illustrate the example. The XGBoost family performed much better than any other algorithm. Deep learning performed the worst, which is not surprising with tabular data. For example, the best XGBoost model found 3-4 times more incidents than the best deep learning model, depending on the business metric. The second best performing family was GBM, which at best found about 90% of the incidents with XGBoost. In both projects, the models with the highest AUCPR had the highest business metrics, but generally the correlation was not large.
Run an AutoML grid search on the most successful families of algorithms (
include_algos). Test many models. Select the models you want to optimize (interesting models) and save them.
Print the actual parameters of the models of interest (
Use these parameters as a basis in the manual definition of the model. Define the grid search hyperparameters (
H2OGridSearch). If you want to test many or you are short on time, use the random grid search strategy. Otherwise, build the models from all combinations of the hyperparameters (Cartesian strategy).
Compute additional metrics on grid search models and possibly inspect variable significance. The models with similar scores may be based on a different set of variables, which may be more important to the business context than just the score.
Choose the best model(s) for each family and save it.
Alternatively, you can build stacked ensembles and recalculate the additional metrics.
We used two datasets of different sizes (1 million rows x 343 columns and 85,000 rows x 605 columns). Since they were processed in two different environments, we cannot compare processing times directly. With the rough estimates of the duration of each phase, we want to give you an idea of what to expect.
Workflow in project 1:
- build 40 AutoML models → ~ 8.5 hours
- extract the parameters of the best model in each family and define the hyperparameters for optimization (in this case XGB and GBM)
- random search:
- 56 XGB models → ~ 8 hours (+ business metrics → ~ 4.5 hours)
- 6 GBM models → ~ 1 h (+ business metrics → ~ 0.5 h)
- random search:
- save the best models
Given that the dataset was not very large (< 200 MB), we were surprised that AutoML took so long to finish 40 models. Maybe they didn't converge that well, or we should optimize the cluster resources. Our preferred solution was to start the long calculations at the end of the day and have the results in the morning.
Workflow in project 2:
- run AutoML for 10 minutes (fixed time; + business statistics ~ 15 minutes)
- build 30 GBM models with AutoML → ~ 0.5 h (+ business metrics → ~ 1 h)
- extract the parameters of the best model and define the hyperparameters for grid search
- random search: 72 GBM models → ~ 1 h (+ business metrics → ~ 2 h)
- save the best model
The second data set was quite small and we manage to complete all steps within one working day.
The comparison was based solely on the values of the business metrics. We did not perform proper statistical tests on the results, although we did account for the variance in our results. It should therefore not be seen as a benchmark but as an observation. In both cases, we managed to produce models with roughly the same performance as the references. Usually we had several candidates with similar scores. Most importantly, we achieved this in just a fraction of the time needed for the reference model.
We showed two real problems how to shorten the time to build a good basic model by leveraging automated machine learning with H2O. After investing time in understanding the advantages and limitations of the platform, we were able to build a model comparable to the reference model in just a couple of days. This is a significant speedup compared to the traditional approach. In addition, the user-friendly API shortens the coding time and simplifies the maintenance of the code.
The partners who contributed to this work are:
#H2O #practice #protocol #combines #AutoML #traditional #modeling #methods