trw.train.utils_dataset
¶
Module Contents¶
Functions¶
|
Calculate the counts for each class and add calculate weights that compensate for class imbalances |
|
In classification tasks, we may not have a perfectly balanced dataset. This is problematic, |
- trw.train.utils_dataset.calculate_weight_by_class(split, output_classification_name, max_weight=None, normalize_by_max_weight=False)¶
Calculate the counts for each class and add calculate weights that compensate for class imbalances
- Parameters
split – the data to be used
output_classification_name – the classification name to be used to calculate the class frequencies
max_weight – the maximum weight possible for a class (unnormalized). This is to avoid the case where
we have an outlier (i.e., a class with several order of magnitude smaller that the other), we don’t want to have an “infinite weight” for this class but an acceptable maximum instead. If None, there is no maximum weight :param normalize_by_max_weight: if False, the class with the highest number of instances will have a weight of 1.0, the others > 1.0. if True, same weight ratio as if False, but the class with minimum instances will have weight = 1.0, while the others will have < 1.0. Rational: it is safer to have weight < 1.0 so that we don’t blow up the gradient :return: a dictionary for each class with a weight for the training (less counts means higher weight for the training to compensate)
- trw.train.utils_dataset.set_weight_scaled_by_inverse_class_frequency(split, output_classification_name, weights_by_class=None, max_weight=5, normalize_by_max_weight=False)¶
In classification tasks, we may not have a perfectly balanced dataset. This is problematic, in particular in highly unbalanced datasets, since the classifier may just end up learning the ratio of the classes and not what we really care about (i.e., how to discriminate).
A possible solution is to weight the sample according to the inverse class distribution.
- Parameters
max_weight – the maximum weight a class may have relative to the other classes
split – the split to use
output_classification_name – the name of the classification feature (i.e, must be an 1D array of class)
weights_by_class – if specified, use the provided weight. If not, automatically calculate the weights
using calculate_weight_by_class :param normalize_by_max_weight: if False, the class with the highest number of instances will have a weight of 1.0, the others > 1.0. if True, same weight ratio as if False, but the class with minimum instances will have weight = 1.0, while the others will have < 1.0. Rational: it is safer to have weight < 1.0 so that we don’t blow up the gradient :return: the weights by class