Remote sensing examples

The following examples make use of scikit-learn’s Forest Cover Type dataset and the Indian Pines dataset.

# Authors: Joao Fonseca <jpmrfonseca@gmail.com>
#          Manvel Khudinyan <armkhudinyan@gmail.com>
#          Georgios Douzas <gdouzas@icloud.com>
# Licence: MIT

from collections import Counter
import itertools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml, fetch_covtype
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, make_scorer, cohen_kappa_score
from imblearn.metrics import classification_report_imbalanced
from imblearn.pipeline import make_pipeline, Pipeline

from gsmote import GeometricSMOTE

print(__doc__)

RANDOM_STATE = 5


def print_class_counts(y):
    """Print the class counts."""
    counts = dict(Counter(y))
    class_counts = pd.DataFrame(counts.values(), index=counts.keys(), columns=['Count']).sort_index()
    print(class_counts)


def print_classification_report(clf, X_train, X_test, y_train, y_test):
    """Fit classifier and print classification report."""
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    clf_name = clf.__class__.__name__
    div = '=' * len(clf_name)
    title = f'\n{div}\n{clf_name}\n{div}\n'
    print(title, classification_report_imbalanced(y_test, y_pred))


def plot_confusion_matrix(cm, classes):
    """This function prints and plots the
    normalized confusion matrix."""
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], '.2f'),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

Forest Cover Type

The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.

Dataset

The function sklearn.datasets.fetch_covtype() will load dataset. It will be downloaded from the web if necessary. This dataset is clearly imbalanced.
X, y = fetch_covtype(return_X_y=True)
print_class_counts(y)

Out:

    Count
1  211840
2  283301
3   35754
4    2747
5    9493
6   17367
7   20510

Classification

Below we use the Random Forest Classifier to predict the forest type of each patch of forest. Two experiments are ran: One using only the classifier and another that creates a pipeline of Geometric SMOTE and the classifier. A classification report is printed for both experiments.
splitted_data = train_test_split(X, y, test_size=0.95, random_state=RANDOM_STATE, shuffle=True)

clf = RandomForestClassifier(bootstrap=True, n_estimators=10, random_state=RANDOM_STATE)
ovs_clf = make_pipeline(GeometricSMOTE(random_state=RANDOM_STATE), clf)

print_classification_report(clf, *splitted_data)
print_classification_report(ovs_clf, *splitted_data)

Out:

======================
RandomForestClassifier
======================
                    pre       rec       spe        f1       geo       iba       sup

          1       0.82      0.84      0.89      0.83      0.87      0.75    201463
          2       0.85      0.86      0.86      0.86      0.86      0.74    268975
          3       0.78      0.86      0.98      0.82      0.92      0.83     33905
          4       0.82      0.55      1.00      0.66      0.74      0.53      2619
          5       0.81      0.35      1.00      0.49      0.59      0.33      9037
          6       0.73      0.50      0.99      0.60      0.71      0.47     16533
          7       0.90      0.76      1.00      0.82      0.87      0.73     19430

avg / total       0.83      0.83      0.89      0.83      0.86      0.73    551962


========
Pipeline
========
                    pre       rec       spe        f1       geo       iba       sup

          1       0.82      0.85      0.90      0.83      0.87      0.75    201463
          2       0.85      0.87      0.86      0.86      0.86      0.75    268975
          3       0.78      0.86      0.98      0.82      0.92      0.84     33905
          4       0.78      0.58      1.00      0.67      0.76      0.56      2619
          5       0.83      0.33      1.00      0.47      0.58      0.31      9037
          6       0.75      0.49      0.99      0.59      0.70      0.46     16533
          7       0.89      0.75      1.00      0.81      0.86      0.72     19430

avg / total       0.83      0.84      0.89      0.83      0.86      0.74    551962

Indian Pines

This hyperspectral data set has 220 spectral bands and 20 m spatial resolution. The imagery was collected on 12 June 1992 and represents a 2.9 by 2.9 km area in Tippecanoe County, Indiana, USA. The area is agricultural and eight classes as land-use types are presented: alfalfa, corn, grass, hay, oats, soybeans, trees, and wheat. The Indian Pines data set has been used for testing and comparing algorithms. The number of samples varies greatly among the classes, which is known as an imbalanced training set. Data are made available by Purdue University (https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html).

Dataset

This dataset provides the data in numpay arrays. Predictor and target variables are already split (X and y accordingly). Predictor data consists of 220 features. Target attributes are the land cover classes. Dataset has 9144 samples.
X, y, *_ = fetch_openml('Indian_pines').values()
print_class_counts(y)

Out:

          Count
Alfalfa      54
Corn       2502
Grass       523
Hay         489
Oats         20
Soybeans   4050
Trees      1294
Wheat       212

Classification

Below we use the Geometric SMOTE oversampler and Decision Tree classifier, combined by a pipeline. GridSearchCV class from scikit-learn is used to find the best parameters of the oversampler.
splitted_data = train_test_split(X, y, test_size=0.5, random_state=RANDOM_STATE, shuffle=True)

param_grid = {
    'gsmote__deformation_factor': [0.25, 0.50, 0.75],
    'gsmote__truncation_factor': [-0.5, 0.0, 0.5]
}
clf = DecisionTreeClassifier(random_state=RANDOM_STATE)
ovs_clf = Pipeline([
    ('gsmote', GeometricSMOTE(random_state=RANDOM_STATE)),
    ('dt', DecisionTreeClassifier(random_state=RANDOM_STATE)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = make_scorer(cohen_kappa_score)
gscv = GridSearchCV(ovs_clf, param_grid, scoring=scoring, refit=True, cv=cv, n_jobs=-1)

print_classification_report(clf, *splitted_data)
print_classification_report(gscv, *splitted_data)

Out:

======================
DecisionTreeClassifier
======================
                    pre       rec       spe        f1       geo       iba       sup

    Alfalfa       0.52      0.63      1.00      0.57      0.79      0.60        27
       Corn       0.69      0.71      0.88      0.70      0.79      0.61      1223
      Grass       0.90      0.84      0.99      0.87      0.91      0.82       291
        Hay       0.94      0.92      1.00      0.93      0.96      0.91       239
       Oats       0.33      0.20      1.00      0.25      0.45      0.18        10
   Soybeans       0.82      0.80      0.85      0.81      0.83      0.68      2039
      Trees       0.94      0.98      0.99      0.96      0.99      0.97       633
      Wheat       0.97      0.89      1.00      0.93      0.94      0.88       110

avg / total       0.81      0.81      0.90      0.81      0.85      0.73      4572


============
GridSearchCV
============
                    pre       rec       spe        f1       geo       iba       sup

    Alfalfa       0.53      0.70      1.00      0.60      0.84      0.68        27
       Corn       0.69      0.71      0.88      0.70      0.79      0.62      1223
      Grass       0.87      0.86      0.99      0.86      0.92      0.84       291
        Hay       0.93      0.92      1.00      0.93      0.96      0.91       239
       Oats       0.64      0.90      1.00      0.75      0.95      0.89        10
   Soybeans       0.82      0.79      0.86      0.81      0.83      0.68      2039
      Trees       0.95      0.98      0.99      0.96      0.98      0.97       633
      Wheat       0.97      0.93      1.00      0.95      0.96      0.92       110

avg / total       0.81      0.81      0.90      0.81      0.86      0.73      4572

Confusion matrix

To describe the performance of the classification models per classes you can create the normalized confusion matrix. Particularly, this matrix represented the predictive power of the classifier LR with G-SMOTE oversampler in the discrimination of eight classes using 220 Band AVIRIS Hyperspectral Image Data Set (Indian Pine Test Site 3). The values of the diagonal elements represented the degree of correctly predicted classes.
_, X_test, _, y_test = splitted_data
conf_matrix = confusion_matrix(y_test, gscv.predict(X_test), labels = np.unique(y_test))
plot_confusion_matrix(conf_matrix, classes=np.unique(y_test))
plot remote sensing example

Total running time of the script: ( 4 minutes 15.336 seconds)

Gallery generated by Sphinx-Gallery