How to select best hyperparameters in tree based models?

measure performance over training data
measure performance over validation data
both of these
random selection of hyper parameters

The correct answer is C. Both of these.

Hyperparameters are the parameters that control the learning process of a machine learning model. They are not part of the model itself, but they can have a significant impact on the model’s performance.

There are a number of different ways to select hyperparameters. One common approach is to use a validation set. The validation set is a set of data that is held out from the training process. The model is trained on the training set and then its performance is evaluated on the validation set. The hyperparameters are then adjusted to improve the model’s performance on the validation set.

Another approach is to use cross-validation. Cross-validation is a technique that divides the training data into multiple subsets. The model is trained on one subset and then its performance is evaluated on the remaining subsets. This process is repeated multiple times, with each subset used once as the validation set. The hyperparameters are then adjusted to improve the model’s average performance over all of the validation sets.

Random selection of hyperparameters is not a good approach. It is unlikely that the randomly selected hyperparameters will result in a good model. It is better to use a more systematic approach, such as using a validation set or cross-validation.

Here are some additional details about each option:

  • Option A: Measure performance over training data. This is not a good approach because the training data is used to learn the model. The model will always perform well on the training data, regardless of the hyperparameters.
  • Option B: Measure performance over validation data. This is a good approach because the validation data is held out from the learning process. The model’s performance on the validation data is a good indication of how well it will perform on new data.
  • Option C: Both of these. This is the best approach because it combines the benefits of both options. The model is trained on the training data and then its performance is evaluated on the validation set. The hyperparameters are then adjusted to improve the model’s performance on the validation set.