Deep Music Genre Classification

In this blog post, I use PyTorch to perform classification on a data set of song attributes, comparing approaches that use only lyrics, approaches that use only quantitative audio features, and approaches that use both.
Author

Sally Liu

Published

May 10, 2023

Objective:

This blog post consists of five parts:

  1. Download the music dataframe, select text features and engineered features used for classification, and convert target labels (genres) into integers.

  2. Prepare the dataloader class for my networks, create pipelines, text vectorization, and batch collation functions to return batches of data as tensor.

  3. Implement three Neural Networks for different classifications, implement training loop and evaluation loop to train the data and compare each of the three models on validation data.

  4. Visualize the word embeddings, and explore words that are related to emotions.

  5. (Optional) Create some interesting visualizations that might highlight differences between genres in terms of some of the engineered features.

import pandas as pd
import torch
! pip3 install torchinfo
from torchinfo import summary
from torch import nn
import numpy as np

# for embedding visualization later
! pip3 install plotly
import plotly.express as px 
import plotly.io as pio

# for VSCode plotly rendering
pio.renderers.default = "plotly_mimetype+notebook"

# for appearance
pio.templates.default = "plotly_white"

# for train-test split
from sklearn.model_selection import train_test_split

# for suppressing bugged warnings from torchinfo
import warnings 
warnings.filterwarnings("ignore", category = UserWarning)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

criterion=nn.BCEWithLogitsLoss()
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
Requirement already satisfied: torchinfo in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (1.7.2)
Requirement already satisfied: plotly in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (5.14.1)
Requirement already satisfied: tenacity>=6.2.0 in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (from plotly) (8.2.2)
Requirement already satisfied: packaging in /Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages (from plotly) (22.0)
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Part 1: Prepare Dataset

  • I download a Pandas dataframe containing information on 28,000 musical tracks produced between the years 1950 and 2019.
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)
#saving original dataframe
df_raw = df.copy()
df.head()
Unnamed: 0 artist_name track_name release_date genre lyrics len dating violence world/life ... sadness feelings danceability loudness acousticness instrumentalness valence energy topic age
0 0 mukesh mohabbat bhi jhoothi 1950 pop hold time feel break feel untrue convince spea... 95 0.000598 0.063746 0.000598 ... 0.380299 0.117175 0.357739 0.454119 0.997992 0.901822 0.339448 0.137110 sadness 1.0
1 4 frankie laine i believe 1950 pop believe drop rain fall grow believe darkest ni... 51 0.035537 0.096777 0.443435 ... 0.001284 0.001284 0.331745 0.647540 0.954819 0.000002 0.325021 0.263240 world/life 1.0
2 6 johnnie ray cry 1950 pop sweetheart send letter goodbye secret feel bet... 24 0.002770 0.002770 0.002770 ... 0.002770 0.225422 0.456298 0.585288 0.840361 0.000000 0.351814 0.139112 music 1.0
3 10 pérez prado patricia 1950 pop kiss lips want stroll charm mambo chacha merin... 54 0.048249 0.001548 0.001548 ... 0.225889 0.001548 0.686992 0.744404 0.083935 0.199393 0.775350 0.743736 romantic 1.0
4 12 giorgos papadopoulos apopse eida oneiro 1950 pop till darling till matter know till dream live ... 48 0.001350 0.001350 0.417772 ... 0.068800 0.001350 0.291671 0.646489 0.975904 0.000246 0.597073 0.394375 romantic 1.0

5 rows × 31 columns

Specify the engineered features used for classification:

engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']  
len(engineered_features) 
22

There are 22 engineered features.

Find out all the genres in the dataframe:

genre_df = df.groupby("genre").size()
genre_df
genre
blues      4604
country    5445
hip hop     904
jazz       3845
pop        7042
reggae     2498
rock       4034
dtype: int64

There are seven genres in the dataframe. Here, I want to reduce the data size by only selecting three genres for classification.

  • I give each of the genres an integer that I’ll use to encode the genre in the target variable.
genres = {
    "blues" : 0,
    "pop" : 1, 
    "country" : 2
    }
df = df[df["genre"].apply(lambda x: x in genres.keys())]
df["genre"] = df["genre"].apply(genres.get)

Drop the unused columns from the original dataframe:

  • I select only text features, engineered features, and genres for the dataframe to be used for training:
genres_df= df[["genre","lyrics"]]
engineered_df = df[engineered_features]
df = pd.concat([genres_df,engineered_df],axis = 1)
df.head()
genre lyrics dating violence world/life night/time shake the audience family/gospel romantic communication ... family/spiritual like/girls sadness feelings danceability loudness acousticness instrumentalness valence energy
0 1 hold time feel break feel untrue convince spea... 0.000598 0.063746 0.000598 0.000598 0.000598 0.048857 0.017104 0.263751 ... 0.000598 0.000598 0.380299 0.117175 0.357739 0.454119 0.997992 0.901822 0.339448 0.137110
1 1 believe drop rain fall grow believe darkest ni... 0.035537 0.096777 0.443435 0.001284 0.001284 0.027007 0.001284 0.001284 ... 0.051124 0.001284 0.001284 0.001284 0.331745 0.647540 0.954819 0.000002 0.325021 0.263240
2 1 sweetheart send letter goodbye secret feel bet... 0.002770 0.002770 0.002770 0.002770 0.002770 0.002770 0.158564 0.250668 ... 0.002770 0.002770 0.002770 0.225422 0.456298 0.585288 0.840361 0.000000 0.351814 0.139112
3 1 kiss lips want stroll charm mambo chacha merin... 0.048249 0.001548 0.001548 0.001548 0.021500 0.001548 0.411536 0.001548 ... 0.001548 0.081132 0.225889 0.001548 0.686992 0.744404 0.083935 0.199393 0.775350 0.743736
4 1 till darling till matter know till dream live ... 0.001350 0.001350 0.417772 0.001350 0.001350 0.001350 0.463430 0.001350 ... 0.029755 0.001350 0.068800 0.001350 0.291671 0.646489 0.975904 0.000246 0.597073 0.394375

5 rows × 24 columns

Now we can see that genres have been encoded into integers.

Part 2: Prepare DataLoader

Next I want to wrap the Pandas data frame as a torch data set.

  • I implement my own Dataset named DataFromDF, where I implement getitem() to help separate the input data into the text features, engineered features, and labels.
from source import DataFromDF
from torch.utils.data import Dataset, DataLoader
  1. Perform train test split and make Datasets from each one:
df_train, df_val = train_test_split(df, shuffle = True, test_size = 0.2)

train_data = DataFromDF(df_train)
val_data  = DataFromDF(df_val)
  1. Text Vectorization
  • Here, I build a mapping from words to integers using “build_vocab_from_iterator” function.
#! pip3 install torchtext
import torchtext
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from source import Helper
tokenizer = get_tokenizer('basic_english')
Helper = Helper()

vocab = build_vocab_from_iterator(Helper.yield_tokens(train_data), specials=["<unk>"], min_freq = 50)
vocab.set_default_index(vocab["<unk>"])
  1. Creat separate pipelines for text and label:
  • Represent each headline as a sequence of integers using the vocab.

  • Pad the headlines with an unused integer index if necessary so that all headlines have the same length.

max_len= 30   
num_tokens = len(vocab.get_itos())

def text_pipeline(x):
        tokens = vocab(tokenizer(x))
        
        y = torch.zeros(max_len, dtype=torch.int64) + num_tokens
        if len(tokens) > max_len:
            tokens = tokens[0:max_len]
        y[0:len(tokens)] = torch.tensor(tokens,dtype=torch.int64)
        return y
    
label_pipeline = lambda x: int(x)
  1. Create collate_batch function that returns the batch of heeadlines as a consolidated tensor:
def collate_batch(batch):
        label_list, text_list, engineered_list = [], [], []
        for (_text, _engineered, _label) in batch:

            # add label to list
            label_list.append(label_pipeline(_label))

            # add text (as sequence of integers) to list
            processed_text = text_pipeline(_text)    
            text_list.append(processed_text)

            # add engineered features to list
            engineered_list.append(_engineered)

        label_list = torch.tensor(label_list, dtype=torch.int64)
        engineered_list = torch.tensor(engineered_list)
        text_list = torch.stack(text_list)
        return label_list.to(device), text_list.to(device), engineered_list.to(device)
  1. Now we can create batches of data containing both the text features and the engineered features:
train_loader = DataLoader(train_data, batch_size=4, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=4, shuffle=True, collate_fn=collate_batch)

Part 3: Modeling

I implement three Deep Neural Networks for different classifications:

from models import TextClassificationModel
from models import EngineeredClassificationModel
from models import CombinedNet

Training loop:

  • In my train() function, I pass the arguments that state whether the model being trained should use only the text features, only the engineered features, or both.
from torch.nn.modules.module import T
import time

def train(dataloader, text = True, engineered = False, k_epochs=5):

    loss_fn = torch.nn.CrossEntropyLoss()

    # Use only text features
    if (text == True) and (engineered == False):
      model = TextClassificationModel(vocab_size, embedding_dim, max_len, 3).to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)
      
      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
          # zero gradients
          optimizer.zero_grad()       
          # form prediction on batch
          predicted_label = model(text)
          # evaluate loss on prediction
          loss = loss_fn(predicted_label, label)
          # compute gradient
          loss.backward()
          # take an optimization step
          optimizer.step()

          # for printing accuracy
          total_acc   += (predicted_label.argmax(1) == label).sum().item()
          total_count += label.size(0)
      
        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

    # Use only engineered features
    elif (text == False) and (engineered == True):
      model = EngineeredClassificationModel().to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)
        
      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
          # zero gradients
          optimizer.zero_grad()
        
          # form prediction on batch
          predicted_label = model(engineered)
          # evaluate loss on prediction
          loss = loss_fn(predicted_label, label)
          # compute gradient
          loss.backward()
          # take an optimization step
          optimizer.step()

          # for printing accuracy
          total_acc   += (predicted_label.argmax(1) == label).sum().item()
          total_count += label.size(0)

        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

    # Use both text and engineered features    
    elif (text==True) and (engineered == True):
      model = CombinedNet(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      optimizer = torch.optim.Adam(model.parameters(), lr=.1)

      for epoch in range(1, k_epochs+1): 
        epoch_start_time = time.time()
        # keep track of some counts for measuring accuracy
        total_acc, total_count = 0, 0
        log_interval = 300
        start_time = time.time()  
          
        for idx, (label, text, engineered) in enumerate(dataloader):
            # zero gradients
            optimizer.zero_grad()
        
            # form prediction on batch
            predicted_label = model([text, engineered])
            # evaluate loss on prediction
            loss = loss_fn(predicted_label, label)
            # compute gradient
            loss.backward()
            # take an optimization step
            optimizer.step()

            # for printing accuracy
            total_acc   += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
        
        print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')

Evaluation Loop:

  • In my evaludate() function, I use the same mechanism as the training loop to determine which part of the data should be passed to the model.
def evaluate(dataloader, text = True, engineered = False):

    total_acc, total_count = 0, 0

    # Use only text features
    if (text == True) and (engineered == False):
      model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model(text)
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    # Use only engineered features
    elif (text == False) and (engineered == True):
      model = EngineeredClassificationModel().to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model(engineered)
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    # Use both text and engineered features
    elif (text == True) and (engineered == True):
      model = CombinedNet(vocab_size, embedding_dim, max_len, num_class).to(device)
      model.double()
      with torch.no_grad():
          for idx,(label, text, engineered) in enumerate(dataloader):
              predicted_label = model([text,engineered])
              total_acc += (predicted_label.argmax(1) == label).sum().item()
              total_count += label.size(0)
    
    return total_acc/total_count

First Network:

  • Using only the lyrics to perform the classification task.
vocab_size = len(vocab)
embedding_dim = 3
max_len = 30
num_class = 3
train(train_loader, text=True, engineered=False, k_epochs=10)
| epoch   1 | train accuracy    0.491 | time:  7.36s
| epoch   2 | train accuracy    0.560 | time:  8.58s
| epoch   3 | train accuracy    0.585 | time:  7.60s
| epoch   4 | train accuracy    0.600 | time:  7.23s
| epoch   5 | train accuracy    0.611 | time:  7.17s
| epoch   6 | train accuracy    0.615 | time:  7.55s
| epoch   7 | train accuracy    0.622 | time:  7.28s
| epoch   8 | train accuracy    0.629 | time:  7.36s
| epoch   9 | train accuracy    0.625 | time:  7.13s
| epoch  10 | train accuracy    0.630 | time:  7.47s

Compare the model on validation data:

evaluate(val_loader,text=True, engineered=False)
0.3737935068733548

The accuracy on evaluation data is much lower than what we achieve on training data, which suggests that there might be overfitting.

Regardless, the model has made significant improvement on base rate (41%), as computed below:

df_train.groupby("genre").size() / len(df_train)
genre
0    0.265360
1    0.412815
2    0.321826
dtype: float64

Second Network:

  • Using only the engineered features to perform the classification task.
train(train_loader, text=False, engineered=True, k_epochs=10)
| epoch   1 | train accuracy    0.435 | time:  8.34s
| epoch   2 | train accuracy    0.447 | time:  8.61s
| epoch   3 | train accuracy    0.439 | time:  8.06s
| epoch   4 | train accuracy    0.429 | time:  8.12s
| epoch   5 | train accuracy    0.454 | time:  8.52s
| epoch   6 | train accuracy    0.437 | time:  8.02s
| epoch   7 | train accuracy    0.449 | time:  7.72s
| epoch   8 | train accuracy    0.439 | time:  7.98s
| epoch   9 | train accuracy    0.419 | time:  7.98s
| epoch  10 | train accuracy    0.457 | time:  7.81s

Compare the model on validation data:

evaluate(val_loader,text=False, engineered=True)
0.40889148873939746

The accuracy on evaluation data is slightly lower than what we achieve on training data, so we can conclude that there is no overfitting.

It can be seen that the model has made a slight improvement on the base rate.

Third Network:

  • Using both the lyrics and the engineered features to perform the classification task.
train(train_loader, text=True, engineered=True, k_epochs=10)
| epoch   1 | train accuracy    0.408 | time:  9.13s
| epoch   2 | train accuracy    0.409 | time:  8.54s
| epoch   3 | train accuracy    0.411 | time:  8.80s
| epoch   4 | train accuracy    0.407 | time:  8.45s
| epoch   5 | train accuracy    0.409 | time:  8.45s
| epoch   6 | train accuracy    0.410 | time:  8.94s
| epoch   7 | train accuracy    0.412 | time:  8.67s
| epoch   8 | train accuracy    0.409 | time:  8.49s
| epoch   9 | train accuracy    0.409 | time:  8.86s
| epoch  10 | train accuracy    0.410 | time:  8.82s

Compare the model on validation data:

evaluate(val_loader,text=True, engineered=True)
0.4205908160280784

The accuracy on evaluation data is very close to what we achieve on training data, so we can conclude that there are no overfitting.

It can be seen that the model has made a slight improvement on the base rate, but the training scores are relatively consistent.

Part 4: Visualize Word Embeddings

  • I extract the embedding matrix from the corresponding layer of my model, and the words from vocabular:
embedding_matrix = model.embedding.cpu().weight.data.numpy()
tokens = vocab.get_itos()
  • Then I use PCA to extract a 2-dimensional representation of the matrix that we can plot:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(embedding_matrix)
  • Create the word embedding data frame for plotting:
tokens = vocab.get_itos()
tokens.append(" ")
embedding_df = pd.DataFrame({
    'word' : tokens, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})

Part 5: Optional Extras

  • I want to address the following two questions:

1. Does blues music tend to have more sadness than other genres? Does pop or rock have more energy?

df_plot = df_raw.groupby(["genre"])[["sadness","energy"]].aggregate(np.mean)
import seaborn as sns
from matplotlib import pyplot as plt

fig, axs = plt.subplots(ncols=2,figsize= (15,8))
ax = sns.barplot(data=df_plot,x=df_plot.index,y="sadness",ax = axs[0])
ax = sns.barplot(data=df_plot,x=df_plot.index,y="energy",ax = axs[1])

  • Based on the plot above, it seems that blues does not have more sadness than other genres. In fact, country music has the most sadness among all. However, pop and rock do have more energy than other genres.

2. Are acousticness and instrumentalness similar features? Can you find any patterns in when they disagree?

  • Instrumentalness measures how likely the music contains no spoken word vocals

  • Acousticness measures how likely the music contains no “electrical amplifications of sounds”.

Theoretically, the two features are opposite: the higher the acousticness, the lower the instrumentalness, and vice versa.

df_raw.corr()
df_merge = pd.concat([df_raw["acousticness"],df_raw["instrumentalness"]],axis = 1)
df_merge.corr()
/var/folders/h4/g2m40gs10fl3yvzdr50l4kb80000gn/T/ipykernel_88446/3954789753.py:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
acousticness instrumentalness
acousticness 1.000000 0.007762
instrumentalness 0.007762 1.000000
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(20, 10))
df_merge["average"] = (df_merge['acousticness']+df_merge['instrumentalness'])/2
sns.lineplot(data=df_merge[50:200],ax=ax, marker=".")

According to the plot, it seems that generally, when acousticness is high, instrumentalness tends to be low.

Conclusion

  • Initially, I wanted to select all the seven genres to train my model for music classification. However, I found that the model became very slow (more than 20 seconds each epoch) and achieved low accuracy (appriximately 0.2), so I then decided to select only three genres for classification. The accuracy did increase, and it took less time to complete each training epoch.

  • However, the accuracy achieved by the model still could not be considered as high. Specifically, the model for engineered features could only achieve less than 0.4 accuracy (not much greater than random guesswork), which suggested that the model still need improvement.

  • Then I revised my models by adding nn.Dropout layers and increasing hidden layers. While my models still could not achieve very high accuracy, they all consistently scored above the base rate after training. So I would consider my models to be successful. Comparing all the three models, the first model (textClassification) had the best performance and could reach the accuracy of over 60%. However, the third model(combinedNet) did not show any significant improvement on training score.