Note
Click here to download the full example code
Remote sensing examples¶
The following examples make use of scikit-learn’s Forest Cover Type dataset and the Indian Pines dataset.
# Authors: Joao Fonseca <jpmrfonseca@gmail.com>
# Manvel Khudinyan <armkhudinyan@gmail.com>
# Georgios Douzas <gdouzas@icloud.com>
# Licence: MIT
from collections import Counter
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml, fetch_covtype
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, make_scorer, cohen_kappa_score
from imblearn.metrics import classification_report_imbalanced
from imblearn.pipeline import make_pipeline, Pipeline
from gsmote import GeometricSMOTE
print(__doc__)
RANDOM_STATE = 5
def print_class_counts(y):
"""Print the class counts."""
counts = dict(Counter(y))
class_counts = pd.DataFrame(counts.values(), index=counts.keys(), columns=['Count']).sort_index()
print(class_counts)
def print_classification_report(clf, X_train, X_test, y_train, y_test):
"""Fit classifier and print classification report."""
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
clf_name = clf.__class__.__name__
div = '=' * len(clf_name)
title = f'\n{div}\n{clf_name}\n{div}\n'
print(title, classification_report_imbalanced(y_test, y_pred))
def plot_confusion_matrix(cm, classes):
"""This function prints and plots the
normalized confusion matrix."""
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
Forest Cover Type¶
The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.
Dataset¶
The functionsklearn.datasets.fetch_covtype()
will load dataset. It will be downloaded from the web if necessary. This dataset is clearly imbalanced.
X, y = fetch_covtype(return_X_y=True)
print_class_counts(y)
Out:
Count
1 211840
2 283301
3 35754
4 2747
5 9493
6 17367
7 20510
Classification¶
Below we use the Random Forest Classifier to predict the forest type of each patch of forest. Two experiments are ran: One using only the classifier and another that creates a pipeline of Geometric SMOTE and the classifier. A classification report is printed for both experiments.
splitted_data = train_test_split(X, y, test_size=0.95, random_state=RANDOM_STATE, shuffle=True)
clf = RandomForestClassifier(bootstrap=True, n_estimators=10, random_state=RANDOM_STATE)
ovs_clf = make_pipeline(GeometricSMOTE(random_state=RANDOM_STATE), clf)
print_classification_report(clf, *splitted_data)
print_classification_report(ovs_clf, *splitted_data)
Out:
======================
RandomForestClassifier
======================
pre rec spe f1 geo iba sup
1 0.82 0.84 0.89 0.83 0.87 0.75 201463
2 0.85 0.86 0.86 0.86 0.86 0.74 268975
3 0.78 0.86 0.98 0.82 0.92 0.83 33905
4 0.82 0.55 1.00 0.66 0.74 0.53 2619
5 0.81 0.35 1.00 0.49 0.59 0.33 9037
6 0.73 0.50 0.99 0.60 0.71 0.47 16533
7 0.90 0.76 1.00 0.82 0.87 0.73 19430
avg / total 0.83 0.83 0.89 0.83 0.86 0.73 551962
========
Pipeline
========
pre rec spe f1 geo iba sup
1 0.82 0.85 0.90 0.83 0.87 0.75 201463
2 0.85 0.87 0.86 0.86 0.86 0.75 268975
3 0.78 0.86 0.98 0.82 0.92 0.84 33905
4 0.78 0.58 1.00 0.67 0.76 0.56 2619
5 0.83 0.33 1.00 0.47 0.58 0.31 9037
6 0.75 0.49 0.99 0.59 0.70 0.46 16533
7 0.89 0.75 1.00 0.81 0.86 0.72 19430
avg / total 0.83 0.84 0.89 0.83 0.86 0.74 551962
Indian Pines¶
This hyperspectral data set has 220 spectral bands and 20 m spatial resolution. The imagery was collected on 12 June 1992 and represents a 2.9 by 2.9 km area in Tippecanoe County, Indiana, USA. The area is agricultural and eight classes as land-use types are presented: alfalfa, corn, grass, hay, oats, soybeans, trees, and wheat. The Indian Pines data set has been used for testing and comparing algorithms. The number of samples varies greatly among the classes, which is known as an imbalanced training set. Data are made available by Purdue University (https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html).
Dataset¶
This dataset provides the data in numpay arrays. Predictor and target variables are already split (X and y accordingly). Predictor data consists of 220 features. Target attributes are the land cover classes. Dataset has 9144 samples.
X, y, *_ = fetch_openml('Indian_pines').values()
print_class_counts(y)
Out:
Count
Alfalfa 54
Corn 2502
Grass 523
Hay 489
Oats 20
Soybeans 4050
Trees 1294
Wheat 212
Classification¶
Below we use the Geometric SMOTE oversampler and Decision Tree classifier, combined by a pipeline. GridSearchCV class from scikit-learn is used to find the best parameters of the oversampler.
splitted_data = train_test_split(X, y, test_size=0.5, random_state=RANDOM_STATE, shuffle=True)
param_grid = {
'gsmote__deformation_factor': [0.25, 0.50, 0.75],
'gsmote__truncation_factor': [-0.5, 0.0, 0.5]
}
clf = DecisionTreeClassifier(random_state=RANDOM_STATE)
ovs_clf = Pipeline([
('gsmote', GeometricSMOTE(random_state=RANDOM_STATE)),
('dt', DecisionTreeClassifier(random_state=RANDOM_STATE)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = make_scorer(cohen_kappa_score)
gscv = GridSearchCV(ovs_clf, param_grid, scoring=scoring, refit=True, cv=cv, n_jobs=-1)
print_classification_report(clf, *splitted_data)
print_classification_report(gscv, *splitted_data)
Out:
======================
DecisionTreeClassifier
======================
pre rec spe f1 geo iba sup
Alfalfa 0.52 0.63 1.00 0.57 0.79 0.60 27
Corn 0.69 0.71 0.88 0.70 0.79 0.61 1223
Grass 0.90 0.84 0.99 0.87 0.91 0.82 291
Hay 0.94 0.92 1.00 0.93 0.96 0.91 239
Oats 0.33 0.20 1.00 0.25 0.45 0.18 10
Soybeans 0.82 0.80 0.85 0.81 0.83 0.68 2039
Trees 0.94 0.98 0.99 0.96 0.99 0.97 633
Wheat 0.97 0.89 1.00 0.93 0.94 0.88 110
avg / total 0.81 0.81 0.90 0.81 0.85 0.73 4572
============
GridSearchCV
============
pre rec spe f1 geo iba sup
Alfalfa 0.53 0.70 1.00 0.60 0.84 0.68 27
Corn 0.69 0.71 0.88 0.70 0.79 0.62 1223
Grass 0.87 0.86 0.99 0.86 0.92 0.84 291
Hay 0.93 0.92 1.00 0.93 0.96 0.91 239
Oats 0.64 0.90 1.00 0.75 0.95 0.89 10
Soybeans 0.82 0.79 0.86 0.81 0.83 0.68 2039
Trees 0.95 0.98 0.99 0.96 0.98 0.97 633
Wheat 0.97 0.93 1.00 0.95 0.96 0.92 110
avg / total 0.81 0.81 0.90 0.81 0.86 0.73 4572
Confusion matrix¶
To describe the performance of the classification models per classes you can create the normalized confusion matrix. Particularly, this matrix represented the predictive power of the classifier LR with G-SMOTE oversampler in the discrimination of eight classes using 220 Band AVIRIS Hyperspectral Image Data Set (Indian Pine Test Site 3). The values of the diagonal elements represented the degree of correctly predicted classes.
_, X_test, _, y_test = splitted_data
conf_matrix = confusion_matrix(y_test, gscv.predict(X_test), labels = np.unique(y_test))
plot_confusion_matrix(conf_matrix, classes=np.unique(y_test))
Total running time of the script: ( 4 minutes 15.336 seconds)