trw.train.utils_dataset

Module Contents

Functions

calculate_weight_by_class(split, output_classification_name, max_weight=None, normalize_by_max_weight=False)

Calculate the counts for each class and add calculate weights that compensate for class imbalances

set_weight_scaled_by_inverse_class_frequency(split, output_classification_name, weights_by_class=None, max_weight=5, normalize_by_max_weight=False)

In classification tasks, we may not have a perfectly balanced dataset. This is problematic,

trw.train.utils_dataset.calculate_weight_by_class(split, output_classification_name, max_weight=None, normalize_by_max_weight=False)

Calculate the counts for each class and add calculate weights that compensate for class imbalances

Parameters
  • split – the data to be used

  • output_classification_name – the classification name to be used to calculate the class frequencies

  • max_weight – the maximum weight possible for a class (unnormalized). This is to avoid the case where

we have an outlier (i.e., a class with several order of magnitude smaller that the other), we don’t want to have an “infinite weight” for this class but an acceptable maximum instead. If None, there is no maximum weight :param normalize_by_max_weight: if False, the class with the highest number of instances will have a weight of 1.0, the others > 1.0. if True, same weight ratio as if False, but the class with minimum instances will have weight = 1.0, while the others will have < 1.0. Rational: it is safer to have weight < 1.0 so that we don’t blow up the gradient :return: a dictionary for each class with a weight for the training (less counts means higher weight for the training to compensate)

trw.train.utils_dataset.set_weight_scaled_by_inverse_class_frequency(split, output_classification_name, weights_by_class=None, max_weight=5, normalize_by_max_weight=False)

In classification tasks, we may not have a perfectly balanced dataset. This is problematic, in particular in highly unbalanced datasets, since the classifier may just end up learning the ratio of the classes and not what we really care about (i.e., how to discriminate).

A possible solution is to weight the sample according to the inverse class distribution.

Parameters
  • max_weight – the maximum weight a class may have relative to the other classes

  • split – the split to use

  • output_classification_name – the name of the classification feature (i.e, must be an 1D array of class)

  • weights_by_class – if specified, use the provided weight. If not, automatically calculate the weights

using calculate_weight_by_class :param normalize_by_max_weight: if False, the class with the highest number of instances will have a weight of 1.0, the others > 1.0. if True, same weight ratio as if False, but the class with minimum instances will have weight = 1.0, while the others will have < 1.0. Rational: it is safer to have weight < 1.0 so that we don’t blow up the gradient :return: the weights by class