Deep Music Genre Classification – My Awesome Machine Learning Blog

Here is the link to my source code:

models.py: https://github.com/Sallyliubj/Sallyliubj.github.io/blob/main/posts/Deep%20Music%20Genre%20Classification/models.py

source.py: https://github.com/Sallyliubj/Sallyliubj.github.io/blob/main/posts/Deep%20Music%20Genre%20Classification/source.py

Objective:

This blog post consists of five parts:

Download the music dataframe, select text features and engineered features used for classification, and convert target labels (genres) into integers.
Prepare the dataloader class for my networks, create pipelines, text vectorization, and batch collation functions to return batches of data as tensor.
Implement three Neural Networks for different classifications, implement training loop and evaluation loop to train the data and compare each of the three models on validation data.
Visualize the word embeddings, and explore words that are related to emotions.
(Optional) Create some interesting visualizations that might highlight differences between genres in terms of some of the engineered features.

import pandas as pd
import torch
! pip3 install torchinfo
from torchinfo import summary
from torch import nn
import numpy as np

# for embedding visualization later
! pip3 install plotly
import plotly.express as px 
import plotly.io as pio

# for VSCode plotly rendering
pio.renderers.default = "plotly_mimetype+notebook"

# for appearance
pio.templates.default = "plotly_white"

# for train-test split
from sklearn.model_selection import train_test_split

# for suppressing bugged warnings from torchinfo
import warnings 
warnings.filterwarnings("ignore", category = UserWarning)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

criterion=nn.BCEWithLogitsLoss()
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Requirement already satisfied: torchinfo in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (1.7.2)
Requirement already satisfied: plotly in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (5.14.1)
Requirement already satisfied: tenacity>=6.2.0 in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (from plotly) (8.2.2)
Requirement already satisfied: packaging in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (from plotly) (22.0)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Part 1: Prepare Dataset

I download a Pandas dataframe containing information on 28,000 musical tracks produced between the years 1950 and 2019.

import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)
#saving original dataframe
df_raw = df.copy()
df.head()

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
0	0	mukesh	mohabbat bhi jhoothi	1950	pop	hold time feel break feel untrue convince spea...	95	0.000598	0.063746	0.000598	...	0.380299	0.117175	0.357739	0.454119	0.997992	0.901822	0.339448	0.137110	sadness	1.0
1	4	frankie laine	i believe	1950	pop	believe drop rain fall grow believe darkest ni...	51	0.035537	0.096777	0.443435	...	0.001284	0.001284	0.331745	0.647540	0.954819	0.000002	0.325021	0.263240	world/life	1.0
2	6	johnnie ray	cry	1950	pop	sweetheart send letter goodbye secret feel bet...	24	0.002770	0.002770	0.002770	...	0.002770	0.225422	0.456298	0.585288	0.840361	0.000000	0.351814	0.139112	music	1.0
3	10	pérez prado	patricia	1950	pop	kiss lips want stroll charm mambo chacha merin...	54	0.048249	0.001548	0.001548	...	0.225889	0.001548	0.686992	0.744404	0.083935	0.199393	0.775350	0.743736	romantic	1.0
4	12	giorgos papadopoulos	apopse eida oneiro	1950	pop	till darling till matter know till dream live ...	48	0.001350	0.001350	0.417772	...	0.068800	0.001350	0.291671	0.646489	0.975904	0.000246	0.597073	0.394375	romantic	1.0

5 rows × 31 columns

Specify the engineered features used for classification:

engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']  
len(engineered_features)

There are 22 engineered features.

Find out all the genres in the dataframe:

genre_df = df.groupby("genre").size()
genre_df

genre
blues      4604
country    5445
hip hop     904
jazz       3845
pop        7042
reggae     2498
rock       4034
dtype: int64

There are seven genres in the dataframe. Here, I want to reduce the data size by only selecting three genres for classification.

I give each of the genres an integer that I’ll use to encode the genre in the target variable.

genres = {
    "blues" : 0,
    "pop" : 1, 
    "country" : 2
    }
df = df[df["genre"].apply(lambda x: x in genres.keys())]
df["genre"] = df["genre"].apply(genres.get)

Drop the unused columns from the original dataframe:

I select only text features, engineered features, and genres for the dataframe to be used for training:

genres_df= df[["genre","lyrics"]]
engineered_df = df[engineered_features]
df = pd.concat([genres_df,engineered_df],axis = 1)
df.head()

	genre	lyrics	dating	violence	world/life	night/time	shake the audience	family/gospel	romantic	communication	...	family/spiritual	like/girls	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy
0	1	hold time feel break feel untrue convince spea...	0.000598	0.063746	0.000598	0.000598	0.000598	0.048857	0.017104	0.263751	...	0.000598	0.000598	0.380299	0.117175	0.357739	0.454119	0.997992	0.901822	0.339448	0.137110
1	1	believe drop rain fall grow believe darkest ni...	0.035537	0.096777	0.443435	0.001284	0.001284	0.027007	0.001284	0.001284	...	0.051124	0.001284	0.001284	0.001284	0.331745	0.647540	0.954819	0.000002	0.325021	0.263240
2	1	sweetheart send letter goodbye secret feel bet...	0.002770	0.002770	0.002770	0.002770	0.002770	0.002770	0.158564	0.250668	...	0.002770	0.002770	0.002770	0.225422	0.456298	0.585288	0.840361	0.000000	0.351814	0.139112
3	1	kiss lips want stroll charm mambo chacha merin...	0.048249	0.001548	0.001548	0.001548	0.021500	0.001548	0.411536	0.001548	...	0.001548	0.081132	0.225889	0.001548	0.686992	0.744404	0.083935	0.199393	0.775350	0.743736
4	1	till darling till matter know till dream live ...	0.001350	0.001350	0.417772	0.001350	0.001350	0.001350	0.463430	0.001350	...	0.029755	0.001350	0.068800	0.001350	0.291671	0.646489	0.975904	0.000246	0.597073	0.394375

5 rows × 24 columns

Now we can see that genres have been encoded into integers.

Part 2: Prepare DataLoader

Next I want to wrap the Pandas data frame as a torch data set.

I implement my own Dataset named DataFromDF, where I implement getitem() to help separate the input data into the text features, engineered features, and labels.

from source import DataFromDF
from torch.utils.data import Dataset, DataLoader

Perform train test split and make Datasets from each one:

df_train, df_val = train_test_split(df, shuffle = True, test_size = 0.2)

train_data = DataFromDF(df_train)
val_data  = DataFromDF(df_val)

Text Vectorization

Here, I build a mapping from words to integers using “build_vocab_from_iterator” function.

#! pip3 install torchtext
import torchtext
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from source import Helper
tokenizer = get_tokenizer('basic_english')
Helper = Helper()

vocab = build_vocab_from_iterator(Helper.yield_tokens(train_data), specials=["<unk>"], min_freq = 50)
vocab.set_default_index(vocab["<unk>"])

Creat separate pipelines for text and label:

Represent each headline as a sequence of integers using the vocab.
Pad the headlines with an unused integer index if necessary so that all headlines have the same length.

max_len= 30   
num_tokens = len(vocab.get_itos())

def text_pipeline(x):
        tokens = vocab(tokenizer(x))
        
        y = torch.zeros(max_len, dtype=torch.int64) + num_tokens
        if len(tokens) > max_len:
            tokens = tokens[0:max_len]
        y[0:len(tokens)] = torch.tensor(tokens,dtype=torch.int64)
        return y
    
label_pipeline = lambda x: int(x)

Create collate_batch function that returns the batch of heeadlines as a consolidated tensor:

def collate_batch(batch):
        label_list, text_list, engineered_list = [], [], []
        for (_text, _engineered, _label) in batch:

            # add label to list
            label_list.append(label_pipeline(_label))

            # add text (as sequence of integers) to list
            processed_text = text_pipeline(_text)    
            text_list.append(processed_text)

            # add engineered features to list
            engineered_list.append(_engineered)

        label_list = torch.tensor(label_list, dtype=torch.int64)
        engineered_list = torch.tensor(engineered_list)
        text_list = torch.stack(text_list)
        return label_list.to(device), text_list.to(device), engineered_list.to(device)

Now we can create batches of data containing both the text features and the engineered features:

train_loader = DataLoader(train_data, batch_size=4, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=4, shuffle=True, collate_fn=collate_batch)

Part 3: Modeling

I implement three Deep Neural Networks for different classifications:

from models import TextClassificationModel
from models import EngineeredClassificationModel
from models import CombinedNet

Training loop:

In my train() function, I pass the arguments that state whether the model being trained should use only the text features, only the engineered features, or both.

from torch.nn.modules.module import T
import time

def train(dataloader, text = True, engineered = False, k_epochs=5):

    loss_fn = torch.nn.CrossEntropyLoss()

    # Use only text features
    if (text == True) and (engineered == False):
      model = TextClassificationModel(vocab_size, embedding_dim, max_len, 3).to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)
      
      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
          # zero gradients
          optimizer.zero_grad()       
          # form prediction on batch
          predicted_label = model(text)
          # evaluate loss on prediction
          loss = loss_fn(predicted_label, label)
          # compute gradient
          loss.backward()
          # take an optimization step
          optimizer.step()

          # for printing accuracy
          total_acc   += (predicted_label.argmax(1) == label).sum().item()
          total_count += label.size(0)
      
        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

    # Use only engineered features
    elif (text == False) and (engineered == True):
      model = EngineeredClassificationModel().to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)
        
      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
          # zero gradients
          optimizer.zero_grad()
        
          # form prediction on batch
          predicted_label = model(engineered)
          # evaluate loss on prediction
          loss = loss_fn(predicted_label, label)
          # compute gradient
          loss.backward()
          # take an optimization step
          optimizer.step()

          # for printing accuracy
          total_acc   += (predicted_label.argmax(1) == label).sum().item()
          total_count += label.size(0)

        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

    # Use both text and engineered features    
    elif (text==True) and (engineered == True):
      model = CombinedNet(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)

      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
            # zero gradients
            optimizer.zero_grad()
        
            # form prediction on batch
            predicted_label = model([text, engineered])
            # evaluate loss on prediction
            loss = loss_fn(predicted_label, label)
            # compute gradient
            loss.backward()
            # take an optimization step
            optimizer.step()

            # for printing accuracy
            total_acc   += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
        
        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

Evaluation Loop:

In my evaludate() function, I use the same mechanism as the training loop to determine which part of the data should be passed to the model.

def evaluate(dataloader, text = True, engineered = False):

    total_acc, total_count = 0, 0

    # Use only text features
    if (text == True) and (engineered == False):
      model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model(text)
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    # Use only engineered features
    elif (text == False) and (engineered == True):
      model = EngineeredClassificationModel().to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model(engineered)
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    # Use both text and engineered features
    elif (text == True) and (engineered == True):
      model = CombinedNet(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model([text,engineered])
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    return total_acc/total_count

First Network:

Using only the lyrics to perform the classification task.

vocab_size = len(vocab)
embedding_dim = 3
max_len = 30
num_class = 3

train(train_loader, text=True, engineered=False, k_epochs=10)

| epoch   1 | train accuracy    0.491 | time:  7.36s
| epoch   2 | train accuracy    0.560 | time:  8.58s
| epoch   3 | train accuracy    0.585 | time:  7.60s
| epoch   4 | train accuracy    0.600 | time:  7.23s
| epoch   5 | train accuracy    0.611 | time:  7.17s
| epoch   6 | train accuracy    0.615 | time:  7.55s
| epoch   7 | train accuracy    0.622 | time:  7.28s
| epoch   8 | train accuracy    0.629 | time:  7.36s
| epoch   9 | train accuracy    0.625 | time:  7.13s
| epoch  10 | train accuracy    0.630 | time:  7.47s

Compare the model on validation data:

evaluate(val_loader,text=True, engineered=False)

0.3737935068733548

The accuracy on evaluation data is much lower than what we achieve on training data, which suggests that there might be overfitting.

Regardless, the model has made significant improvement on base rate (41%), as computed below:

df_train.groupby("genre").size() / len(df_train)

genre
0    0.265360
1    0.412815
2    0.321826
dtype: float64

Second Network:

Using only the engineered features to perform the classification task.

train(train_loader, text=False, engineered=True, k_epochs=10)

| epoch   1 | train accuracy    0.435 | time:  8.34s
| epoch   2 | train accuracy    0.447 | time:  8.61s
| epoch   3 | train accuracy    0.439 | time:  8.06s
| epoch   4 | train accuracy    0.429 | time:  8.12s
| epoch   5 | train accuracy    0.454 | time:  8.52s
| epoch   6 | train accuracy    0.437 | time:  8.02s
| epoch   7 | train accuracy    0.449 | time:  7.72s
| epoch   8 | train accuracy    0.439 | time:  7.98s
| epoch   9 | train accuracy    0.419 | time:  7.98s
| epoch  10 | train accuracy    0.457 | time:  7.81s

Compare the model on validation data:

evaluate(val_loader,text=False, engineered=True)

0.40889148873939746

The accuracy on evaluation data is slightly lower than what we achieve on training data, so we can conclude that there is no overfitting.

It can be seen that the model has made a slight improvement on the base rate.

Third Network:

Using both the lyrics and the engineered features to perform the classification task.

train(train_loader, text=True, engineered=True, k_epochs=10)

| epoch   1 | train accuracy    0.408 | time:  9.13s
| epoch   2 | train accuracy    0.409 | time:  8.54s
| epoch   3 | train accuracy    0.411 | time:  8.80s
| epoch   4 | train accuracy    0.407 | time:  8.45s
| epoch   5 | train accuracy    0.409 | time:  8.45s
| epoch   6 | train accuracy    0.410 | time:  8.94s
| epoch   7 | train accuracy    0.412 | time:  8.67s
| epoch   8 | train accuracy    0.409 | time:  8.49s
| epoch   9 | train accuracy    0.409 | time:  8.86s
| epoch  10 | train accuracy    0.410 | time:  8.82s

Compare the model on validation data:

evaluate(val_loader,text=True, engineered=True)

0.4205908160280784

The accuracy on evaluation data is very close to what we achieve on training data, so we can conclude that there are no overfitting.

It can be seen that the model has made a slight improvement on the base rate, but the training scores are relatively consistent.

Part 4: Visualize Word Embeddings

I extract the embedding matrix from the corresponding layer of my model, and the words from vocabular:

embedding_matrix = model.embedding.cpu().weight.data.numpy()
tokens = vocab.get_itos()

Then I use PCA to extract a 2-dimensional representation of the matrix that we can plot:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(embedding_matrix)

Create the word embedding data frame for plotting:

tokens = vocab.get_itos()
tokens.append(" ")
embedding_df = pd.DataFrame({
    'word' : tokens, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})

Visualize words related to emotions

I’m interested in finding what kinds of words my model associates with the two opposite emotions – happiness and sadness, so I choose several words which I conjectured to be related to happiness and sadness and then visualize them in the scatter plot:

sad = ["sadness", "upset", "depression"]
happy = ["happy", "joy", "delightful"]

highlight_1 = ["laugh","smile","sun","rainbow","light","sweet","chocolate","hug","sing","dance"]
highlight_2 = ["rain","cry","heartbroken","melancholy","cloudy","dark","dim","storm"]

def emotion_mapper(x):
    if x in sad:
        return 1
    elif x in happy:
        return 4
    elif x in highlight_1:
        return 3
    elif x in highlight_2:
        return 2
    else:
        return 0

embedding_df["highlight"] = embedding_df["word"].apply(emotion_mapper)
embedding_df["size"]      = np.array(1.0 + 50*(embedding_df["highlight"] > 0))

sub_df = embedding_df[embedding_df["highlight"] > 0]

import plotly.express as px 

fig = px.scatter(sub_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "highlight",
                 size = list(sub_df["size"]),
                 size_max = 10,
                 hover_name = "word", 
                 text = "word")

fig.update_traces(textposition='top center')


fig.show()

To my surprise, my guessings for words of sadness and happiness do not seem to be unrelated. For example, from the plot, the distance between {“cry”, “heartbroken”, “depression”} and {“sweet”, “smile”, “joy”} are very small, suggesting that they have similarities.

On the other hands, I do find the distance between words {“dance”, “light”, “delightful”, “hug”} very small, which verifies my conjecture that those words are related to happinesss.

Part 5: Optional Extras

I want to address the following two questions:

1. Does blues music tend to have more sadness than other genres? Does pop or rock have more energy?

df_plot = df_raw.groupby(["genre"])[["sadness","energy"]].aggregate(np.mean)

import seaborn as sns
from matplotlib import pyplot as plt

fig, axs = plt.subplots(ncols=2,figsize= (15,8))
ax = sns.barplot(data=df_plot,x=df_plot.index,y="sadness",ax = axs[0])
ax = sns.barplot(data=df_plot,x=df_plot.index,y="energy",ax = axs[1])

Based on the plot above, it seems that blues does not have more sadness than other genres. In fact, country music has the most sadness among all. However, pop and rock do have more energy than other genres.

2. Are acousticness and instrumentalness similar features? Can you find any patterns in when they disagree?

Instrumentalness measures how likely the music contains no spoken word vocals
Acousticness measures how likely the music contains no “electrical amplifications of sounds”.

Theoretically, the two features are opposite: the higher the acousticness, the lower the instrumentalness, and vice versa.

df_raw.corr()
df_merge = pd.concat([df_raw["acousticness"],df_raw["instrumentalness"]],axis = 1)
df_merge.corr()

/var/folders/h4/g2m40gs10fl3yvzdr50l4kb80000gn/T/ipykernel_88446/3954789753.py:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

	acousticness	instrumentalness
acousticness	1.000000	0.007762
instrumentalness	0.007762	1.000000

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(20, 10))
df_merge["average"] = (df_merge['acousticness']+df_merge['instrumentalness'])/2
sns.lineplot(data=df_merge[50:200],ax=ax, marker=".")

According to the plot, it seems that generally, when acousticness is high, instrumentalness tends to be low.

Conclusion

Initially, I wanted to select all the seven genres to train my model for music classification. However, I found that the model became very slow (more than 20 seconds each epoch) and achieved low accuracy (appriximately 0.2), so I then decided to select only three genres for classification. The accuracy did increase, and it took less time to complete each training epoch.
However, the accuracy achieved by the model still could not be considered as high. Specifically, the model for engineered features could only achieve less than 0.4 accuracy (not much greater than random guesswork), which suggested that the model still need improvement.
Then I revised my models by adding nn.Dropout layers and increasing hidden layers. While my models still could not achieve very high accuracy, they all consistently scored above the base rate after training. So I would consider my models to be successful. Comparing all the three models, the first model (textClassification) had the best performance and could reach the accuracy of over 60%. However, the third model(combinedNet) did not show any significant improvement on training score.