Auditing Allocative Bias

In this blog post, I fit a classifier using data from folktables and perform a bias audit for the algorithm.
Author

Sally Liu

Published

April 11, 2023

Objective:

This Blog Post consists of 6 parts:

  1. Downloading the folktables dataset, choosing the prediction problem, a list of features, and a choice of group with respect to which to evaluate bias.

  2. Using simple descriptive analysis to address some questions on the dataset.

  3. Training my model on training data, incorporate a tunable model complexity and use cross-validation in order to select a good choice for the model complexity.

  4. Auditing my model: perform an audit in which I address some questions based on the test data.

  5. Conclusion Discussion

  6. (Optional) Checking for Intersectional Bias in my model.

Part 1: Preparation

Download folktables

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "AL"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()
RT SERIALNO DIVISION SPORDER PUMA REGION ST ADJINC PWGTP AGEP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2018GQ0000049 6 1 1600 3 1 1013097 75 19 ... 140 74 73 7 76 75 80 74 7 72
1 P 2018GQ0000058 6 1 1900 3 1 1013097 75 18 ... 76 78 7 76 80 78 7 147 150 75
2 P 2018GQ0000219 6 1 2000 3 1 1013097 118 53 ... 117 121 123 205 208 218 120 19 123 18
3 P 2018GQ0000246 6 1 2400 3 1 1013097 43 28 ... 43 76 79 77 80 44 46 82 81 8
4 P 2018GQ0000251 6 1 2701 3 1 1013097 16 25 ... 4 2 29 17 15 28 17 30 15 1

5 rows × 286 columns

Select possible features

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
#acs_data[possible_features].head()
features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

Perform Train-Test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Choose the Problem:

  • Predict employment status on the basis of demographics excluding race, and audit for racial bias.
Choose a Machine Learning Model:
  • DecisionTree Classifier

Here I implement my DecisionTree Classifier from sklearn:

from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), DecisionTreeClassifier())
model.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier', DecisionTreeClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Part 2: Basic Descriptives

Turn my training data back into a data frame for easy analysis

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train
df[:5]
AGEP SCHL MAR RELP DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX group label
0 46.0 20.0 2.0 0.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 1 True
1 38.0 21.0 3.0 0.0 2.0 0.0 1.0 3.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 2 True
2 22.0 19.0 5.0 2.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 2.0 2 True
3 12.0 9.0 5.0 2.0 2.0 1.0 1.0 1.0 0.0 1.0 1.0 2.0 2.0 2.0 1.0 1 False
4 40.0 22.0 1.0 0.0 2.0 0.0 1.0 1.0 4.0 1.0 1.0 2.0 2.0 2.0 1.0 1 True
1. How many individuals are in the data?
sum_individ =df.shape[0]
print("There are", sum_individ, "individuals in the data.")
There are 38221 individuals in the data.
2. Of these individuals, what proportion have target label equal to 1? In employment prediction, these would correspond to employed individuals
num_label_1 = np.sum(df["label"] == 1)
print(num_label_1/sum_individ,"of individuals have target label equal to 1 (who are employeed).")
0.4091468041129222 of individuals have target label equal to 1 (who are employeed).
3. Of these individuals, how many are in each of the groups?
print("Number of individuals that are White: ", np.sum(df["group"]==1))
print("Number of individuals that are Black or African American: ", np.sum(df["group"]==2))
print("Number of individuals that are other self-identified racial groups: ", np.sum(df["group"]>2))
Number of individuals that are White:  28441
Number of individuals that are Black or African American:  8070
Number of individuals that are other self-identified racial groups:  1710
4. In each group, what proportion of individuals have target label equal to 1?
df["group"] == df["label"] 
group_label_df = df.groupby(['group','label'])[[]].aggregate(len).round(2)
group_label_df
group  label
1      False    16421
       True     12020
2      False     5094
       True      2976
3      False       88
       True        68
4      True         1
5      False       23
       True         6
6      False      226
       True       226
7      False        8
       True         3
8      False      251
       True       147
9      False      472
       True       191
dtype: int64

The table above shows the number of employeed and unemployeed individuals in each group.

  • I write a for-loop to print out the proportion of employeed individual in each group:
for i in range(1,10):
    prop_group_i = group_label_df[i][True]/group_label_df[i].sum() 
    percentage = round(prop_group_i*100,2)
    print("In group "+ str(i)+', '+ str(percentage)+'%'+" of individuals have target label equal to 1.")
In group 1, 42.26% of individuals have target label equal to 1.
In group 2, 36.88% of individuals have target label equal to 1.
In group 3, 43.59% of individuals have target label equal to 1.
In group 4, 100.0% of individuals have target label equal to 1.
In group 5, 20.69% of individuals have target label equal to 1.
In group 6, 50.0% of individuals have target label equal to 1.
In group 7, 27.27% of individuals have target label equal to 1.
In group 8, 36.93% of individuals have target label equal to 1.
In group 9, 28.81% of individuals have target label equal to 1.

Part 3: Train the Model

  • I train my DecisionTree Classifier on the training data, and selet a good choice for model complexity:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), DecisionTreeClassifier())
model.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier', DecisionTreeClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Before tuning my model:
  • The overall accuracy in predicting whether someone is employed:
y_hat = model.predict(X_test)
print("Overall accuracy before tuning:",(y_hat == y_test).mean())
Overall accuracy before tuning: 0.7663248221012976
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_hat, normalize = 'true')
array([[0.81331198, 0.18668802],
       [0.3007366 , 0.6992634 ]])

Tuning the max_depth in my DecisionTree Classifier:

  • I write a function to trace the cross validation scores under different max_depth, and find the best choice of max_depth:
from sklearn.model_selection import cross_val_score

acc_score = []
max_depth_range = np.arange(1,20)
max_score = 0.0
max_depth = 0

for n in max_depth_range:
    model = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth=n))
    model.fit(X_train, y_train)         
    cv_scores = cross_val_score(model, X_train,y_train,cv=10,scoring ='accuracy')
    mean_score = cv_scores.mean()
    acc_score.append(mean_score)
    
    #keep track of the maximum cross_validation score and the corresponding depth
    max_score = max(max_score, mean_score)
    if max_score == mean_score:
        max_depth = n

print(f"Max_depth: {max_depth}, Max_score: {max_score}")


import matplotlib.pyplot as plt
%matplotlib inline

max_depth_range = np.arange(1,20)

# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(max_depth_range,acc_score)
plt.xlabel('max_depth')
plt.ylabel('Cross-Validated Accuracy')
Max_depth: 7, Max_score: 0.8202822282658613
Text(0, 0.5, 'Cross-Validated Accuracy')

  • Based on the plot, when max_depth = 7, the cross-validated score achieves the highest accuracy.

Compare the overall accuracy for predication and confusion matrix after tuning:

model = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth=7))
model.fit(X_train, y_train)

y_hat = model.predict(X_test)
print("Overall accuracy after tuning:", (y_hat == y_test).mean())
confusion_matrix(y_test,y_hat, normalize = 'true')
Overall accuracy after tuning: 0.8205316031812474
array([[0.8247019 , 0.1752981 ],
       [0.18542037, 0.81457963]])

It seems that the model, after tuning, can predict the dataset better.

Part 4: Audit the Model

  • Now, I perform an audit in which I will address the following questions (all on test data):
Overall Measures
  1. What is the overall accuracy of your model?
y_hat = model.predict(X_test)
print("The overall accuracy of the model is:", (y_hat == y_test).mean())
The overall accuracy of the model is: 0.8205316031812474
  1. What is the positive predictive value (PPV) of your model?

I compute TN,FN,TP,FP of my model and calculate the PPV:

cm = confusion_matrix(y_test, y_hat)
TN,FN,TP,FP = cm[0][0], cm[1][0], cm[1][1], cm[0][1]
PPV = TP/(TP+FP)
print("The Positive Preducted Value (PPV) of my model is:", PPV)
The Positive Preducted Value (PPV) of my model is: 0.7650286259541985
  1. What are the overall false negative and false positive rates (FNR and FPR) for your model?
FNR = FN/(TP+FN)
FPR = FP/(FP+TN)
print("The overall false negative rate of my model is:", FNR)
print("The overall false positive rate of my model is:", FPR)
The overall false negative rate of my model is: 0.18542037084074167
The overall false positive rate of my model is: 0.17529809574657412
By-Group Measures
  1. What is the accuracy of your model on each subgroup?
print("The accuracy of my model on group 1 is:",(y_hat == y_test)[group_test == 1].mean())
print("The accuracy of my model on group 2 is:",(y_hat == y_test)[group_test == 2].mean())
print("The accuracy of my model on group 3 and above is:",(y_hat == y_test)[group_test > 2].mean())
The accuracy of my model on group 1 is: 0.819549929676512
The accuracy of my model on group 2 is: 0.8220048899755501
The accuracy of my model on group 3 and above is: 0.830423940149626
  1. What is the PPV of your model on each subgroup?
cm1 = confusion_matrix(y_test[group_test == 1], y_hat[group_test == 1])
cm2 = confusion_matrix(y_test[group_test == 2], y_hat[group_test == 2])
cm3 = confusion_matrix(y_test[group_test > 2], y_hat[group_test > 2])

TN1,FN1,TP1,FP1 = cm1[0][0], cm1[1][0], cm1[1][1], cm1[0][1]
TN2,FN2,TP2,FP2 = cm2[0][0], cm2[1][0], cm2[1][1], cm2[0][1]
TN3,FN3,TP3,FP3 = cm3[0][0], cm3[1][0], cm3[1][1], cm3[0][1]

print("The PPV of my model on group 1 is:", TP1/(TP1+FP1))
print("The PPV of my model on group 2 is:", TP2/(TP2+FP2))
print("The PPV of my model on group 3 and above is:", TP3/(TP3+FP3))
The PPV of my model on group 1 is: 0.7759051186017478
The PPV of my model on group 2 is: 0.7263922518159807
The PPV of my model on group 3 and above is: 0.7469135802469136
  1. What are the FNR and FPR on each subgroup?
print("The FNR of group 1 is:",FN1/(TP1+FN1),"; The FPR of group 1 is:", FP1/(FP1+TN1))
print("The FNR of group 2 is:",FN2/(TP2+FN2),"; The FPR of group 2 is:", FP2/(FP2+TN2))
print("The FNR of group 3 and above is:",FN3/(TP3+FN3),"; The FPR of group 3 and above is:", FP3/(FP3+TN3))
The FNR of group 1 is: 0.18518518518518517 ; The FPR of group 1 is: 0.1768908598176891
The FNR of group 2 is: 0.18699186991869918 ; The FPR of group 2 is: 0.1729150726855394
The FNR of group 3 and above is: 0.18243243243243243 ; The FPR of group 3 and above is: 0.16205533596837945
Bias Measures
  1. Is my model approximately calibrated?

(A model is calibrated if I binned the test samples based on their predicted probabilities, each bin’s true outcomes has a proportion close to the probabilities in the bin.)

  • Yes.
from sklearn.calibration import calibration_curve
model = make_pipeline(StandardScaler(), DecisionTreeClassifier(max_depth=7))
model.fit(X_train, y_train)

prob=model.predict_proba(X_test)[:,1]

x, y = calibration_curve(y_test, prob, n_bins = 15, normalize = True)

plt.plot([0, 1], [0, 1], linestyle = '--', label = 'Ideally Calibrated')
 
# Plot model's calibration curve
plt.plot(y, x, marker = '.', label = 'DecisionTree Classifier')
 
leg = plt.legend(loc = 'upper left')
plt.xlabel('Predicted Probability')
plt.ylabel('Ratio of positives')
plt.show()
/Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning: The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
  warnings.warn(

According to the calibration curve, the model is approximately calibrated.

  1. Does my model satisfy approximate error rate balance?

(FPR and FNR are the same for all groups?)

  • Yes.

Based on the calculation above, FPR and FNR are approximately the same for all groups. So my model satisfies approximate error rate balance.

  1. Does my model satisfy statistical parity?

(PPV is the same across all groups?)

  • No.

Based on the calculation above, group 1 has the highest PPV (0.77), while group 2 has significantly lower PPV (0.72). So my model does not satisfies statistical disparity.

Part 5: Concluding Discussion

  1. What groups of people could stand to benefit from a system that is able to predict the label you predicted, such as income or employment status? For example, what kinds of companies might want to buy your model for commercial use?
  • As my model is used to predict employment status among different racial groups, I believe government and diversity orginizations might want to buy my model to keep track of the employment status and maintain the balance of employment rate among different races. Some companies might also need my model if they care about racial equality and want to promote the diversity in their firms.
  1. Based on your bias audit, what could be the impact of deploying your model for large-scale prediction in commercial or governmental settings?
  • Since my model could only reach the accuracy of around 80%, there is still a 20% chance that my model may predict incorrectly. So if my model is deployed for large-scale prediction in commercial or governemnt settings, it may lead to a significant amount of misclassifications.
  1. Based on your bias audit, do you feel that your model displays problematic bias? What kind (calibration, error rate, etc)?
  • Based on the analysis above, my model approximately satisfies calibration and error rate balance, but it displays problematic bias in statistical disparity.
  1. Beyond bias, are there other potential problems associated with deploying your model that make you uncomfortable? How would you propose addressing some of these problems?
  • When dealing with the self identified groups, my model merge all of them together. As I noticed there are not enough data for every single self identified group (compared with the amount of data in the group 1 and group 2), I decided that it would be more efficient to combine all self-identified groups as one group. Because of that, my model can only predict all self identified groups in general, but might not be good at predicting one specific self identified group. To address this problem, I may need more data for each self identified group, so that I will be able to train my model on each group and make better prediction.

Part 6: (Optional) Intersectional Bias

  • Is the FNR significantly higher for Black women than it is for Black men?

First, prepare the data frame:

import pandas as pd
df = pd.DataFrame(X_test, columns = features_to_use)
df["group"] = group_test
df["label"] = y_test

Get the confusion matrix for Black Men and Black Women:

y_hat = model.predict(X_test)

cm2_men = confusion_matrix(y_test[(group_test == 2)& (df['SEX'] == 1.0)], y_hat[(group_test == 2)& (df['SEX'] == 1.0)])
cm2_women =confusion_matrix(y_test[(group_test == 2)& (df['SEX'] == 2.0)], y_hat[(group_test == 2)& (df['SEX'] == 2.0)])

print("Confusion matrix for Black Men:")
print(cm2_men)

print("Confusion matrix for Black Women:")
print(cm2_women)
Confusion matrix for Black Men:
[[513  83]
 [ 62 261]]
Confusion matrix for Black Women:
[[568 143]
 [ 76 339]]

Calculate and compare the FNR for Black Man and Black Women in my model:

FNR_black_men = cm2_men[1][0]/(cm2_men[1][0] + cm2_men[1][1])
FNR_black_women = cm2_women[1][0]/(cm2_women[1][0] + cm2_women[1][1])

print(f"The FNR for Black Men is {FNR_black_men}.")
print(f"The FNR for Black Women is {FNR_black_women}.")
The FNR for Black Men is 0.19195046439628483.
The FNR for Black Women is 0.18313253012048192.
Conclusion:

Based on the calculation, 19% of employeed Black men are misclassified as unemployeed, while 18% of employeed Black women are misclassified as unemployeed.

The FNR is for Black Women is approximately the same as it is for Black Man, which implies that employeed Black women are as likely to be misclassified as unemployeed as employeed Black men. There does not exist a gender bias among Black/African Americans.