Gradient Boosting is an ensemble machine learning algorithms, it is used to build predictive models for both regression and classification problems. In gradient boosting, a weak predictive model, typically a decision tree, is trained on the input data, and the errors made by the model are then used to train a new model, which is added back to the old model. This additive process is repeated many times until the resulting model’s predictions converge to the correct output. However, since the training process is sequential, it could be very time-consuming to build an ensembled model to our satisfaction.
In SAP HANA Predictive Analysis Library(PAL for short) we offered our own version of Gradient Boosting Decision Tree(GBDT) as Hybrid Gradient Boosting Tree(HGBT). It is highly customized based on SAP business software requirements, such as hybrid data type support(both numerical and categorical), build-in missing value handling and explanability, multiple objective function support for different use cases, as well as different techniques that could potentially greatly accelerate the training process of gradient boosting, and so on.
In this blog post, we mainly focus on introducing some techniques for accelerating the training process in HGBT.
In constructing each decision tree model, each node is represented by a feature and a split value. Ideally, when a feature is selected, an optimal split value should be searched over all possible values presented in the input data. If the feature is continous and the unique values of this feature are many, since every value must be considered, the searching effort for the optimal split value could be very big, which will dramatically slow down the the construction process of the tree model.
Therefore, reducing the number of candidates for splitting values of continous features is crucial for speeding up the tree construction process. In HGBT, this is achieved by histogram spltting. It firstly discretize all unique values collected from the continous feature into a fixed number of bins, i.e. histograms, and then generate splits based on the discretized feature values.
Obviously, reducing the number of features in constructing each decision tree model is another way that can directly accelerate
training process. The could partially be achieved by considering some related features together as a whole. In HGBT, feature grouping is adopted to achieve this objective to some extent. It is a technique that can discover incoherent relations among sparse features(i.e. numerical features with many 0s), and group them together as one when splitting.
Reducing the number of weak-learners, i.e. decision trees is another way that leads to direct reduction of the training time.
In general, weak-learners trained in the earlier phase is more likely to reduce the true prediction error, yet the ones trained in the later phase could potentially make the ensemble model over-complicated and overfit the training data. To address this issue, we need to evaluate continously monitoring the the generalization performance of the ensembled model during the training process, where a separate validation set is needed to achieve. When the generalization performance of the trained model gets deteriorated for a few consecutive iterations, the traing process should be stopped immediately. This is exactly the early stop mechanism that is adopted in HGBT.
Parallelization and Multi-threading
Though the gradient boosting process is iterative and sequential in general, there are still a lot of operations in each iteration that can be parallellized. HGBT utilized HANA internal thread management capabilities for more balanced parallelization tasks in training. It allows user to control the parallelization number based on their system requirement, also it fully supports HANA workload management mechanism for users to balance the workload between ML jobs and other database transactions. Additional acceleration mechanisms such thread pools are also introduced into HGBT to reduce the thread operation overhead in such a high concurrency environment.
HGBT in PAL can be called throught Python interface via hana-ml library within hana_ml.algorithms.pal.trees module. There are two python classes that the aforementioned techniques for Gradient Boosting — HybridGraidentBoostingClassifier and HybridGradientBoostingRegressor, they share almost the same parameter set(with only a few exceptions).
In the following context we describe briefly the related parameters of the aforementioned techniques for speeding up the training process of building efficient ensembled models in HGBT.
1. Histogram Splitting
- split_method: set split_method=’histogram‘ to trigger histogram splitting
- max_bin_num: specifies the maximum number of bins while building histogram during splitting.
2. Feature Grouping
- feature_grouping: set feature_grouping=True to trigger feature grouping
- tolerant_rate: set the maximum ratio of rows that can violate the strict grouping requirement
3. Early Stop
- validation_set_rate: specifies the sampling rate of validation set
- stratified_validation_set: specifies whether or not to apply stratified method when sampling the validation set
- tolerant_iter_num: specifies the number of successive deteriorating iterations before stopping
4. Parallelization and Multi-threading
- thread_ratio: specifies the ratio of available threads used for training
In this section, we show users how to use histogram splitting for accelerating the training process of HGBT through Python API.
The training dataset is for regression purpose. The number of records in the dataset is 500000. It contain one ID column(with name ‘ID’), five numerical feature columns of continous data type(with names ‘X0’ ~ ‘X4’), and one dependent column(with name ‘Y’) that contains target values for regression.
>>> train_data.count() 500000 >>> train_data.columns ['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'Y']
As a baseline, we firstly train an HGBT regression model with 100 base learners and exact splitting. The training process is repeated 10 times with average running time calculated, illustrated as follows:
>>> import timeit >>> from hana_ml.algorithms.pal.trees import HybridGradientBoostingRegressor as HGR >>> hgr = HGR(n_estimators=100, split_method='exact')#using exact splitting >>> agg_time = 0#for time aggregation >>> counter = 0 >>> while counter < 10: start = timeit.default_timer() hgr.fit(data=train_data, key='ID', label='Y') stop = timeit.default_timer() agg_time = agg_time + stop - start counter = counter + 1 >>> print('AVG Time(in seconds): ', agg_time / 10)
On our sever, the above lines of code produced some result like the following:
AVG Time(in seconds): 10.319527600000129
In contrast, if we train an HGBT regression model with the same number of base learners but histogram splitting. Again the training process is repeated 10 times with average running time calculated, shown as follows:
>>> import timeit >>> from hana_ml.algorithms.pal.trees import HybridGradientBoostingRegressor as HGR >>> hgr2 = HGR(n_estimators=100, split_method='histogram')#using histogram splitting >>> agg_time = 0#for time aggregation >>> counter = 0 >>> while counter < 10: start = timeit.default_timer() hgr2.fit(data=train_data, key='ID', label='Y') stop = timeit.default_timer() agg_time = agg_time + stop - start counter = counter + 1 >>> print('AVG Time(in seconds): ', agg_time / 10)
On our sever, the above lines of code produced some result like the following:
AVG Time(in seconds): 7.150385869999991
This example shows that the use of histogram splitting can potentially save a large ratio of training time compared to exact splitting(here the ratio is roughly 0.3).
Besides the aforementioned techniques for accelerating the training process, HGBT in SAP HANA PAL also offers some post-training functionalities like model compression. Interested readers can refer to related blogs for more details.