imblearn.under_sampling.InstanceHardnessThreshold¶
-
class
imblearn.under_sampling.InstanceHardnessThreshold(estimator=None, ratio=’auto’, return_indices=False, random_state=None, cv=5, n_jobs=1, **kwargs)[source][source]¶ Class to perform under-sampling based on the instance hardness threshold.
Read more in the User Guide.
Parameters: estimator : object, optional (default=RandomForestClassifier())
Classifier to be used to estimate instance hardness of the samples. By default a
sklearn.ensemble.RandomForestClassiferwill be used. Ifstr, the choices using a string are the following:'knn','decision-tree','random-forest','adaboost','gradient-boosting'and'linear-svm'. If object, an estimator inherited fromsklearn.base.ClassifierMixinand having an attributepredict_proba.Deprecated since version 0.2:
estimatoras a string object is deprecated from 0.2 and will be replaced in 0.4. Usesklearn.base.ClassifierMixinobject instead.ratio : str, dict, or callable, optional (default=’auto’)
Ratio to use for resampling the data set.
- If
str, has to be one of: (i)'minority': resample the minority class; (ii)'majority': resample the majority class, (iii)'not minority': resample all classes apart of the minority class, (iv)'all': resample all classes, and (v)'auto': correspond to'all'with for over-sampling methods and'not minority'for under-sampling methods. The classes targeted will be over-sampled or under-sampled to achieve an equal number of sample with the majority or minority class. - If
dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples. - If callable, function taking
yand returns adict. The keys correspond to the targeted classes. The values correspond to the desired number of samples.
Warning
This algorithm is a cleaning under-sampling method. When providing a
dict, only the targeted classes will be used; the number of samples will be discarded.return_indices : bool, optional (default=False)
Whether or not to return the indices of the samples randomly selected from the majority class.
random_state : int, RandomState instance or None, optional (default=None)
If int,
random_stateis the seed used by the random number generator; IfRandomStateinstance, random_state is the random number generator; IfNone, the random number generator is theRandomStateinstance used bynp.random.cv : int, optional (default=5)
Number of folds to be used when estimating samples’ instance hardness.
n_jobs : int, optional (default=1)
The number of threads to open if possible.
**kwargs:
Option for the different classifier.
Deprecated since version 0.2:
**kwargshas been deprecated from 0.2 and will be replaced in 0.4. Usesklearn.base.ClassifierMixinobject instead to pass parameter associated to an estimator.Notes
The method is based on [R45].
Supports mutli-class resampling. A one-vs.-rest scheme is used when sampling a class as proposed in [R45].
See Instance Hardness Threshold.
References
[R45] (1, 2, 3) D. Smith, Michael R., Tony Martinez, and Christophe Giraud-Carrier. “An instance level analysis of data complexity.” Machine learning 95.2 (2014): 225-256. Examples
>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imblearn.under_sampling import InstanceHardnessThreshold >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape {}'.format(Counter(y))) Original dataset shape Counter({1: 900, 0: 100}) >>> iht = InstanceHardnessThreshold(random_state=42) >>> X_res, y_res = iht.fit_sample(X, y) >>> print('Resampled dataset shape {}'.format(Counter(y_res))) Resampled dataset shape Counter({1: 840, 0: 100})
-
__init__(estimator=None, ratio=’auto’, return_indices=False, random_state=None, cv=5, n_jobs=1, **kwargs)[source][source]¶
-
fit(X, y)[source]¶ Find the classes statistics before to perform sampling.
Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: self : object,
Return self.
-
fit_sample(X, y)[source]¶ Fit the statistics and resample the data directly.
Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : array-like, shape (n_samples_new,)
The corresponding label of X_resampled
-
get_params(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params : mapping of string to any
Parameter names mapped to their values.
-
sample(X, y)[source]¶ Resample the dataset.
Parameters: X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: X_resampled : {ndarray, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : ndarray, shape (n_samples_new)
The corresponding label of X_resampled
- If