

.. _sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py:


====================================================
Comparison of the different over-sampling algorithms
====================================================

The following example attends to make a qualitative comparison between the
different over-sampling algorithms available in the imbalanced-learn package.




.. code-block:: python


    # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
    # License: MIT

    from collections import Counter

    import matplotlib.pyplot as plt
    import numpy as np

    from sklearn.datasets import make_classification
    from sklearn.svm import LinearSVC

    from imblearn.pipeline import make_pipeline
    from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler
    from imblearn.base import SamplerMixin
    from imblearn.utils import hash_X_y

    print(__doc__)








The following function will be used to create toy dataset. It using the
``make_classification`` from scikit-learn but fixing some parameters.



.. code-block:: python


    def create_dataset(n_samples=1000, weights=(0.01, 0.01, 0.98), n_classes=3,
                       class_sep=0.8, n_clusters=1):
        return make_classification(n_samples=n_samples, n_features=2,
                                   n_informative=2, n_redundant=0, n_repeated=0,
                                   n_classes=n_classes,
                                   n_clusters_per_class=n_clusters,
                                   weights=list(weights),
                                   class_sep=class_sep, random_state=0)








The following function will be used to plot the sample space after resampling
to illustrate the characterisitic of an algorithm.



.. code-block:: python



    def plot_resampling(X, y, sampling, ax):
        X_res, y_res = sampling.fit_sample(X, y)
        ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor='k')
        # make nice plotting
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.get_xaxis().tick_bottom()
        ax.get_yaxis().tick_left()
        ax.spines['left'].set_position(('outward', 10))
        ax.spines['bottom'].set_position(('outward', 10))
        return Counter(y_res)








The following function will be used to plot the decision function of a
classifier given some data.



.. code-block:: python



    def plot_decision_function(X, y, clf, ax):
        plot_step = 0.02
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                             np.arange(y_min, y_max, plot_step))

        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, alpha=0.4)
        ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor='k')








Illustration of the influence of the balancing ratio
##############################################################################


We will first illustrate the influence of the balancing ratio on some toy
data using a linear SVM classifier. Greater is the difference between the
number of samples in each class, poorer are the classfication results.



.. code-block:: python


    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

    ax_arr = (ax1, ax2, ax3, ax4)
    weights_arr = ((0.01, 0.01, 0.98), (0.01, 0.05, 0.94),
                   (0.2, 0.1, 0.7), (0.33, 0.33, 0.33))
    for ax, weights in zip(ax_arr, weights_arr):
        X, y = create_dataset(n_samples=1000, weights=weights)
        clf = LinearSVC().fit(X, y)
        plot_decision_function(X, y, clf, ax)
        ax.set_title('Linear SVC with y={}'.format(Counter(y)))
    fig.tight_layout()




.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_001.png
    :align: center




Random over-sampling to balance the data set
##############################################################################


Random over-sampling can be used to repeat some samples and balance the
number of samples between the dataset. It can be seen that with this trivial
approach the boundary decision is already less biaised toward the majority
class.



.. code-block:: python


    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))
    X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))
    clf = LinearSVC().fit(X, y)
    plot_decision_function(X, y, clf, ax1)
    ax1.set_title('Linear SVC with y={}'.format(Counter(y)))
    pipe = make_pipeline(RandomOverSampler(random_state=0), LinearSVC())
    pipe.fit(X, y)
    plot_decision_function(X, y, pipe, ax2)
    ax2.set_title('Decision function for RandomOverSampler')
    fig.tight_layout()




.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_002.png
    :align: center




More advanced over-sampling using ADASYN and SMOTE
##############################################################################


Instead of repeating the same samples when over-sampling, we can use some
specific heuristic instead. ADASYN and SMOTE can be used in this case.



.. code-block:: python



    # Make an identity sampler
    class FakeSampler(SamplerMixin):

        def fit(self, X, y):
            self.ratio_ = 1
            self.X_hash_ = hash_X_y(X, y)
            return self

        def sample(self, X, y):
            return X,

        def _sample(self, X, y):
            pass

        def fit_sample(self, X, y):
            return X, y


    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 15))
    X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))
    sampler = FakeSampler()
    clf = make_pipeline(sampler, LinearSVC())
    plot_resampling(X, y, sampler, ax1)
    ax1.set_title('Original data - y={}'.format(Counter(y)))

    ax_arr = (ax2, ax3, ax4)
    for ax, sampler in zip(ax_arr, (RandomOverSampler(random_state=0),
                                    SMOTE(random_state=0),
                                    ADASYN(random_state=0))):
        clf = make_pipeline(sampler, LinearSVC())
        clf.fit(X, y)
        plot_resampling(X, y, sampler, ax)
        ax.set_title('Resampling using {}'.format(sampler.__class__.__name__))
    fig.tight_layout()




.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png
    :align: center




The following plot illustrate the difference between ADASYN and SMOTE. ADASYN
will focus on the samples which are difficult to classify with a
nearest-neighbors rule while regular SMOTE will not make any distinction.
Therefore, the decision function depending of the algorithm.



.. code-block:: python


    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 6))
    X, y = create_dataset(n_samples=10000, weights=(0.01, 0.05, 0.94))

    clf = LinearSVC().fit(X, y)
    plot_decision_function(X, y, clf, ax1)
    ax1.set_title('Linear SVC with y={}'.format(Counter(y)))
    sampler = SMOTE()
    clf = make_pipeline(sampler, LinearSVC())
    clf.fit(X, y)
    plot_decision_function(X, y, clf, ax2)
    ax2.set_title('Decision function for {}'.format(sampler.__class__.__name__))
    sampler = ADASYN()
    clf = make_pipeline(sampler, LinearSVC())
    clf.fit(X, y)
    plot_decision_function(X, y, clf, ax3)
    ax3.set_title('Decision function for {}'.format(sampler.__class__.__name__))
    fig.tight_layout()




.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_004.png
    :align: center




Due to those sampling particularities, it can give rise to some specific
issues as illustrated below.



.. code-block:: python


    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 15))
    X, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94),
                          class_sep=0.8)

    ax_arr = ((ax1, ax2), (ax3, ax4))
    for ax, sampler in zip(ax_arr, (SMOTE(random_state=0),
                                    ADASYN(random_state=0))):
        clf = make_pipeline(sampler, LinearSVC())
        clf.fit(X, y)
        plot_decision_function(X, y, clf, ax[0])
        ax[0].set_title('Decision function for {}'.format(
            sampler.__class__.__name__))
        plot_resampling(X, y, sampler, ax[1])
        ax[1].set_title('Resampling using {}'.format(
            sampler.__class__.__name__))
    fig.tight_layout()




.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_005.png
    :align: center




SMOTE proposes several variants by identifying specific samples to consider
during the resampling. The borderline version will detect which point to
select which are in the border between two classes. The SVM version will use
the support vectors found using an SVM algorithm to create new samples.



.. code-block:: python


    fig, ((ax1, ax2), (ax3, ax4),
          (ax5, ax6), (ax7, ax8)) = plt.subplots(4, 2, figsize=(15, 30))
    X, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94),
                          class_sep=0.8)

    ax_arr = ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8))
    string_add = ['regular', 'borderline-1', 'borderline-2', 'SVM']
    for str_add, ax, sampler in zip(string_add,
                                    ax_arr,
                                    (SMOTE(random_state=0),
                                     SMOTE(random_state=0, kind='borderline1'),
                                     SMOTE(random_state=0, kind='borderline2'),
                                     SMOTE(random_state=0, kind='svm'))):
        clf = make_pipeline(sampler, LinearSVC())
        clf.fit(X, y)
        plot_decision_function(X, y, clf, ax[0])
        ax[0].set_title('Decision function for {} {}'.format(
            str_add, sampler.__class__.__name__))
        plot_resampling(X, y, sampler, ax[1])
        ax[1].set_title('Resampling using {} {}'.format(
            str_add, sampler.__class__.__name__))
    fig.tight_layout()

    plt.show()



.. image:: /auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_006.png
    :align: center




**Total running time of the script:** ( 0 minutes  24.983 seconds)



.. container:: sphx-glr-footer


  .. container:: sphx-glr-download

     :download:`Download Python source code: plot_comparison_over_sampling.py <plot_comparison_over_sampling.py>`



  .. container:: sphx-glr-download

     :download:`Download Jupyter notebook: plot_comparison_over_sampling.ipynb <plot_comparison_over_sampling.ipynb>`

.. rst-class:: sphx-glr-signature

    `Generated by Sphinx-Gallery <https://sphinx-gallery.readthedocs.io>`_
