Classifying Palmer Penguins

In this blog post, I work through a complete example of the standard machine learning workflow. My goal is to determine the smallest number of measurements necessary to confidently determine the species of a penguin.
Author

Sally Liu

Published

March 10, 2023

Objective:

This blog post consists of three parts

  1. My random exploration of penguins dataset, including the construction of some displayed figure and displayed table.

  2. My feature selection process and the choosing and training of the model on the selected features, which achieves 100% accuracy.

  3. My evaluation of model performance by plotting the decision regions of my fitted model.

First, access the training data, and see how it looks

import pandas as pd

train_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)
train.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 27 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N46A1 Yes 11/29/07 44.5 14.3 216.0 4100.0 NaN 7.96621 -25.69327 NaN
1 PAL0708 22 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N41A2 Yes 11/27/07 45.1 14.5 215.0 5000.0 FEMALE 7.63220 -25.46569 NaN
2 PAL0910 124 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N67A2 Yes 11/16/09 41.4 18.5 202.0 3875.0 MALE 9.59462 -25.42621 NaN
3 PAL0910 146 Adelie Penguin (Pygoscelis adeliae) Anvers Dream Adult, 1 Egg Stage N82A2 Yes 11/16/09 39.0 18.7 185.0 3650.0 MALE 9.22033 -26.03442 NaN
4 PAL0708 24 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N85A2 No 11/28/07 50.6 19.4 193.0 3800.0 MALE 9.28153 -24.97134 NaN

Part 1. Explore the dataset

Below shows the basic information of the dataframe, from which we can see all the features and their data type (categorical or numerical):

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275 entries, 0 to 274
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   studyName            275 non-null    object 
 1   Sample Number        275 non-null    int64  
 2   Species              275 non-null    object 
 3   Region               275 non-null    object 
 4   Island               275 non-null    object 
 5   Stage                275 non-null    object 
 6   Individual ID        275 non-null    object 
 7   Clutch Completion    275 non-null    object 
 8   Date Egg             275 non-null    object 
 9   Culmen Length (mm)   273 non-null    float64
 10  Culmen Depth (mm)    273 non-null    float64
 11  Flipper Length (mm)  273 non-null    float64
 12  Body Mass (g)        273 non-null    float64
 13  Sex                  265 non-null    object 
 14  Delta 15 N (o/oo)    262 non-null    float64
 15  Delta 13 C (o/oo)    263 non-null    float64
 16  Comments             22 non-null     object 
dtypes: float64(6), int64(1), object(10)
memory usage: 36.6+ KB

We can also take a look at the number of each species of the penguins

#number of species
train['Species'].value_counts()
Adelie Penguin (Pygoscelis adeliae)          118
Gentoo penguin (Pygoscelis papua)            101
Chinstrap penguin (Pygoscelis antarctica)     56
Name: Species, dtype: int64

It seems that this dataset does not contain an even distribution of penguin species, which might potentially affect the training of the model.

(1) Construct displayed figure

Below I construct a displayed figure to visualize the body mass of female and male penguins in all the three islands:

import seaborn as sns

sns.catplot(data=train, x="Sex", y="Body Mass (g)", hue="Island", kind="swarm")

  • According to the graph, both male and female penguins in the Biscoe Island have greater body mass overall.

  • The penguins in the Torgersen island and Dream island have similar overall body mass.

Then I visualize the relationship between culmen length and culmen depth of the three penguin species:

sns.scatterplot(data=train, x="Culmen Length (mm)", y="Culmen Depth (mm)", hue="Species")

  • From the plot above, we can easily distinguish three clusters representing the three penguin species.

  • Therefore, it seems possible to classify them from the combination of culmen depth and culmen length.

(2) Construct displayed table

Then, I construct a displayed table to see how Culmen Length and Body Mass correspond to the islands penguins live and their sex.

import numpy as np

train.groupby(['Island','Sex'])[['Culmen Length (mm)','Body Mass (g)' ]].aggregate([np.mean, len]).round(2)
Culmen Length (mm) Body Mass (g)
mean len mean len
Island Sex
Biscoe . 44.50 1 4875.00 1
FEMALE 42.94 61 4253.28 61
MALE 47.54 70 5169.29 70
Dream FEMALE 42.45 49 3435.20 49
MALE 46.49 47 3975.53 47
Torgersen FEMALE 37.61 18 3380.56 18
MALE 40.66 19 3990.79 19
  • According to the table above, the penguins living in the Biscoe Island generally have the longest culmen, while the penguins living in the Torgersen Island generally have the shortest culmen.

  • Also, female penguins tend to have shorter culmen and smaller body mass than male penguins.

Part 2. Model

In this part, I find three good features of the data and train a model on those features, which eventually achieves 100% testing accuracy.

Data Preparation:

I first prepare the data by encoding the categorical features in numerical columns

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train_raw, y_train_raw = prepare_data(train) 
X_train, y_train = prepare_data(train) # Will be normalized later

Check the result of encoding and verify that all the catagorical features are encoded into numerical values.

X_train
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo) Island_Biscoe Island_Dream Island_Torgersen Stage_Adult, 1 Egg Stage Clutch Completion_No Clutch Completion_Yes Sex_FEMALE Sex_MALE
1 45.1 14.5 215.0 5000.0 7.63220 -25.46569 1 0 0 1 0 1 1 0
2 41.4 18.5 202.0 3875.0 9.59462 -25.42621 0 0 1 1 0 1 0 1
3 39.0 18.7 185.0 3650.0 9.22033 -26.03442 0 1 0 1 0 1 0 1
4 50.6 19.4 193.0 3800.0 9.28153 -24.97134 0 1 0 1 1 0 0 1
5 33.1 16.1 178.0 2900.0 9.04218 -26.15775 0 1 0 1 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
269 41.1 17.5 190.0 3900.0 8.94365 -26.06943 0 1 0 1 0 1 0 1
270 45.4 14.6 211.0 4800.0 8.24515 -25.46782 1 0 0 1 0 1 1 0
271 36.2 17.2 187.0 3150.0 9.04296 -26.19444 0 0 1 1 1 0 1 0
272 50.0 15.9 224.0 5350.0 8.20042 -26.39677 1 0 0 1 0 1 0 1
273 48.2 14.3 210.0 4600.0 7.68870 -25.50811 1 0 0 1 0 1 1 0

256 rows × 14 columns

Now we can see that all the qualitative columns, such as Island, Sex, and Clutch Completion are all encoded with numerical values.

Feature selection

Now I want to select three good features (two quantative features and one qualitative feature) that can be used for prediction of penguin species.

  • I will first determine some relatively good quantitative features based on the visualization of data points.

  • Then I will find out the best combination of three features through the search and comparison of all those relatively good features.

(1) Visualize feature relationships

First, I use seaborn to visualize all pairs of quantitative features and see which ones of them can be used for prediction:

sns.pairplot(train, hue="Species", height=3 ,diag_kind="hist")
plt.show()

  • By analyzing the scatter matrix above, I find the combinations of “Culmen Depth”, “Culmen Length”, “Flipper Length” already show three distinctive groups.

  • I can exclude the features “Body Mass”, “Delta 15N”, “Delta 13C”, as the points are highly overlapping and therefore cannot generate good separable datapoints.

(2) Combination Method for feature selection

Once I sift out some relativaly better fatures, I then want to determine the best two quantitative features together with one qualitative feature.

  • Specifically, I use the combinations method to try all combinations of (1 qualitative features + 2 selected quantitative features) to see which combination achieves the best training accuracy.
from itertools import combinations
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

# train the three models
svc = SVC(kernel='rbf', C = 800)
dtc = DecisionTreeClassifier(max_depth = 5)
rf = RandomForestClassifier(max_depth=5,n_estimators = 11,random_state=0)

# select relatively good quantative columns and qualitative columns
all_qual_cols = [ "Sex", "Island"]
all_quant_cols = ['Culmen Length (mm)', 'Flipper Length (mm)','Culmen Depth (mm)']

for qual in all_qual_cols: 
  qual_cols = [col for col in X_train.columns if qual in col ]
  for pair in combinations(all_quant_cols, 2):
    cols = qual_cols + list(pair) 
    print(cols)
     
    #get accuracy for all three models
    svc.fit(X_train[cols], y_train)
    dtc.fit(X_train[cols], y_train)
    rf.fit(X_train[cols], y_train)
    print("svc score:", svc.score(X_train[cols], y_train))
    print("dtc score:", dtc.score(X_train[cols], y_train))
    print("rf score:", rf.score(X_train[cols], y_train))
    
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Flipper Length (mm)']
svc score: 0.95703125
dtc score: 0.9921875
rf score: 0.9921875
['Sex_FEMALE', 'Sex_MALE', 'Culmen Length (mm)', 'Culmen Depth (mm)']
svc score: 0.99609375
dtc score: 0.99609375
rf score: 0.99609375
['Sex_FEMALE', 'Sex_MALE', 'Flipper Length (mm)', 'Culmen Depth (mm)']
svc score: 0.796875
dtc score: 0.83984375
rf score: 0.84765625
['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Culmen Length (mm)', 'Flipper Length (mm)']
svc score: 0.95703125
dtc score: 0.99609375
rf score: 1.0
['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Culmen Length (mm)', 'Culmen Depth (mm)']
svc score: 1.0
dtc score: 1.0
rf score: 1.0
['Island_Biscoe', 'Island_Dream', 'Island_Torgersen', 'Flipper Length (mm)', 'Culmen Depth (mm)']
svc score: 0.87109375
dtc score: 0.90625
rf score: 0.9140625

By comparing the three models’ training accuracies for all those combinations, I find that “Island”, “Culmen Length (mm)” and “Culmen Depth (mm)” are the three best features, under which the SVC, DecisionTree, and RandomForest Classifiers all can achieve 100% training accuracy.

cols = ["Culmen Depth (mm)", "Culmen Length (mm)", "Island_Biscoe","Island_Dream", "Island_Torgersen"]

Take a look at the selected columns:

df = pd.DataFrame(X_train_raw, columns = ["Culmen Depth (mm)", "Culmen Length (mm)", "Island_Biscoe","Island_Dream", "Island_Torgersen"])
df
Culmen Depth (mm) Culmen Length (mm) Island_Biscoe Island_Dream Island_Torgersen
1 14.5 45.1 1 0 0
2 18.5 41.4 0 0 1
3 18.7 39.0 0 1 0
4 19.4 50.6 0 1 0
5 16.1 33.1 0 1 0
... ... ... ... ... ...
269 17.5 41.1 0 1 0
270 14.6 45.4 1 0 0
271 17.2 36.2 0 0 1
272 15.9 50.0 1 0 0
273 14.3 48.2 1 0 0

256 rows × 5 columns

Model Choices

Now I will decide which model I will use for the training.

  • According to the previous analysis and visualization, the data do not seem to be linearly separable. Therefore, the linear model may not be appropriate to model this set of data.

To test the model, I download the test data set and prepare it using the prepare_data function.

test_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
test = pd.read_csv(test_url)

X_test, y_test = prepare_data(test)
To train my model, I will perform the following steps:
  1. Train the model with the train dataset.
  2. Check its training scores and testing scores.
  3. Adjust the hyperparameters using cross validation method.
  4. Test the trained model with the test dataset.
  5. Construct the confusion matrix to validate the model performance.

Support Vector Machine

I choose the SVC model, which takes advantage of kernels to separate and classify data by defining hyperplanes in higher dimensional spaces.

  • Here I use the “rbf” kernal

Hyperparameters

  • I take the following two hyperparameters into consideration:
  1. The parameter C indicates how much we care about misclassifications. For large values of C, the algorithm will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the algorithm to look for a larger-margin separating hyperplane, even if that causes more misclassifications.

  2. The parameter gamma determines the reach of each training sample with regards to the boundry. For low values of gamma, every train sample will reach the boundry. For high values of gamma, only values close to the boundry will reach it, and consequently, only these values will influence the boundry.

Data Normalization

The variables in the dataset are measured in different units and therefore have different scales. We would like to have them in a similar scale so that we can compare them in the modelling without giving more importance to the features that have a bigger magnitude. To achieve that, we standardize the numeric columns to have mean zero and variance one.

X_test =(X_test-X_train_raw.mean())/X_train_raw.std()
X_train =(X_train_raw-X_train_raw.mean())/X_train_raw.std()

Take a look at the statistics after normalization

X_train[cols].describe()
Culmen Depth (mm) Culmen Length (mm) Island_Biscoe Island_Dream Island_Torgersen
count 2.560000e+02 2.560000e+02 2.560000e+02 2.560000e+02 2.560000e+02
mean 1.124101e-15 1.568190e-15 5.551115e-17 -6.938894e-18 7.632783e-17
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -2.063704e+00 -2.187067e+00 -1.013763e+00 -7.538723e-01 -3.839323e-01
25% -7.857451e-01 -8.226041e-01 -1.013763e+00 -7.538723e-01 -3.839323e-01
50% 5.770783e-02 1.725984e-01 9.825705e-01 -7.538723e-01 -3.839323e-01
75% 7.989240e-01 8.300625e-01 9.825705e-01 1.321303e+00 -3.839323e-01
max 2.230238e+00 2.766429e+00 9.825705e-01 1.321303e+00 2.594452e+00

Fit my SVC model:

from sklearn.svm import SVC
cols = ["Culmen Depth (mm)", "Culmen Length (mm)", "Island_Biscoe","Island_Dream", "Island_Torgersen"]

svc = SVC(kernel='rbf')
svc.fit(X_train[cols], y_train)
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Without adjusting any hyperparameters, see the training and testing accuracy of the model

print(svc.score(X_train[cols], y_train))
print(svc.score(X_test[cols], y_test))
0.99609375
0.9852941176470589

The accuracy is pretty high, but is still less than 1.0.

Now I will adjust some hyperparameters of my SVM model by cross validation to let it achieve the best accuracy. - C - gamma

First, I want to find the best value for C:

  1. I create a loop to keep track of the value of C along with the change in cross valization scores
C_range=np.arange(1,300)
acc_score=[]

for c in C_range:
    svc = SVC(kernel='rbf', C = c)
    cv_scores = cross_val_score(svc, X_train[cols], y_train, cv=10, scoring='accuracy')
    mean_score = cv_scores.mean()
    #print(f"C = {c}, score = {mean_score.round(5)}")
    acc_score.append(mean_score)


import matplotlib.pyplot as plt
%matplotlib inline

C_range=np.arange(1,300)

# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_range,acc_score)
plt.xlabel('Value of C for SVC ')
plt.ylabel('Cross-Validated Accuracy')
Text(0, 0.5, 'Cross-Validated Accuracy')

Based on the plot, it seems that the cross-validated accuracy achieves the highest score between C = 50-175.

So I will set C = 100.

  1. Now I want to find the best gamma value for my SVM using the same method:
gamma_range=[0.0001, 0.0005, 0.001, 0.01,0.02,0.03,0.04,0.05,0.06,0.1,0.3,0.5,1]
acc_score=[]

for g in gamma_range:
    svc = SVC(kernel='rbf', C = 100, gamma=g)
    cv_scores = cross_val_score(svc, X_train[cols], y_train, cv=10, scoring='accuracy')
    mean_score = cv_scores.mean()
    print(f"gamma = {g}, score = {mean_score.round(5)}")
    acc_score.append(mean_score)


import matplotlib.pyplot as plt
%matplotlib inline

gamma_range=[0.0001, 0.0005, 0.001, 0.01,0.02,0.03,0.04,0.05,0.06,0.1,0.3,0.5,1]

# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')
gamma = 0.0001, score = 0.98446
gamma = 0.0005, score = 0.99615
gamma = 0.001, score = 0.99615
gamma = 0.01, score = 0.98446
gamma = 0.02, score = 0.99231
gamma = 0.03, score = 0.99231
gamma = 0.04, score = 0.99615
gamma = 0.05, score = 0.99615
gamma = 0.06, score = 0.99615
gamma = 0.1, score = 0.99615
gamma = 0.3, score = 0.98846
gamma = 0.5, score = 0.98846
gamma = 1, score = 0.98446
Text(0, 0.5, 'Cross-Validated Accuracy')

From the graph above which traces the cross-valudation accuracy along with the change in gamma, I find that gamma between 0.04 and 0.1 will achieve the best cross-validation accuracy.

So I will set gamme = 0.04.

Tuning

Now I will add the C and gamma value into my SVM model and see how it performs:

svc = SVC(kernel='rbf',C = 100, gamma =0.04)
svc.fit(X_train[cols], y_train)
SVC(C=100, gamma=0.04)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(svc.score(X_train[cols], y_train))
print(svc.score(X_test[cols], y_test))

X_train[cols].var()
1.0
1.0
Culmen Depth (mm)     1.0
Culmen Length (mm)    1.0
Island_Biscoe         1.0
Island_Dream          1.0
Island_Torgersen      1.0
dtype: float64

The accuracy reaches 100% !

Construct Confusion Matrix

from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
y_pred = svc.predict(X_test[cols])

cm  = confusion_matrix(y_test, y_pred)
cm
array([[33,  0,  0],
       [ 0, 12,  0],
       [ 0,  0, 23]])

Visualize the confusion matrix:

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svc.classes_)
disp.plot()
plt.show()

There are no misclassifications.

Part 3. Plotting Decision Regions

Now I will visualize the prediction of my model:

from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Patch

def plot_regions(model, X, y):
    
    x0 = X[X.columns[0]]
    x1 = X[X.columns[1]]
    qual_features = X.columns[2:]
    
    fig, axarr = plt.subplots(1, len(qual_features), figsize = (7, 3))

    # create a grid
    grid_x = np.linspace(x0.min(),x0.max(),501)
    grid_y = np.linspace(x1.min(),x1.max(),501)
    xx, yy = np.meshgrid(grid_x, grid_y)
    
    
    XX = xx.ravel()
    YY = yy.ravel()

    for i in range(len(qual_features)):
      XY = pd.DataFrame({
          X.columns[0] : XX,
          X.columns[1] : YY
      })

      for j in qual_features:
        XY[j] = 0

      XY[qual_features[i]] = 1

      p = model.predict(XY)
      p = p.reshape(xx.shape)
      
      
      # use contour plot to visualize the predictions
      axarr[i].contourf(xx, yy, p, cmap = "jet", alpha = 0.2, vmin = 0, vmax = 2)
      
      ix = X[qual_features[i]] == 1
      # plot the data
      axarr[i].scatter(x0[ix], x1[ix], c = y[ix], cmap = "jet", vmin = 0, vmax = 2)
      
      axarr[i].set(xlabel = X.columns[0], 
            ylabel  = X.columns[1])
      
      patches = []
      for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
        patches.append(Patch(color = color, label = spec))

      plt.legend(title = "Species", handles = patches, loc = "best")
      
      plt.tight_layout() 

Plot the decision regions for my SVC model based on the three features I selected:

svc = SVC(kernel='rbf',C = 100, gamma = 0.04)
svc.fit(X_train[cols], y_train)
plot_regions(svc, X_train_raw[cols], y_train_raw)

Accoridng to the plot, we can see that all the data are classified correctly!