Researcher ORCID Identifier
https://orcid.org/0009-0004-5367-1745
Graduation Year
2026
Date of Submission
12-2025
Document Type
Campus Only Senior Thesis
Degree Name
Bachelor of Arts
Department
Economics
Reader 1
Michael Gelman
Abstract
Machine learning models are increasingly used in high-stakes employment decisions, including salary prediction and compensation benchmarking. When these models learn from historically biased labor market data, they risk perpetuating discrimination across demographic groups. Despite growing concern about algorithmic fairness, limited empirical work examines how the choice of encoding method and machine learning model jointly affects both predictive accuracy and fairness in salary prediction.
I compare ten model-encoding combinations using salary data from Kaggle spanning multiple countries (N = 6,699). Three encoding strategies—One-Hot, Target Mean, and CatBoost encoding—are paired with four model types: Linear Regression, Random Forest, XGBoost, and CatBoost Regressor. Predictive accuracy is measured using root mean squared error (RMSE) across five-fold cross-validation. Fairness is assessed through normalized prediction residuals on a held-out test set, examining whether models systematically over- or underpredict salaries for women and racial minorities. Two parallel experiments test whether removing protected attributes ("fairness through unawareness") reduces bias.
I find that One-Hot encoding paired with XGBoost achieves both the highest predictive accuracy and the least bias across gender and racial groups—challenging the assumption that accuracy and fairness are necessarily in tension. All models systematically overpredicted women's salaries, consistent with omitted variable bias. Removing gender and race from the models did not reduce this bias; for One-Hot + XGBoost, female overprediction nearly tripled, and all models showed statistically significant gender gaps when protected attributes were removed compared to only two when included. Despite balanced racial representation, 4 of 17 country-race groups showed significant residual disparities—all minority groups—and removing protected attributes preserved the direction of racial bias in 88% of groups while introducing new biases in others. Target encoding paired with XGBoost produced the largest fairness gaps, while One-Hot encoding consistently minimized bias.
These findings demonstrate that encoding choice matters more than model complexity for fairness outcomes, and that attempts to achieve fairness by excluding protected attributes are not only ineffective but counterproductive.
Recommended Citation
Zhou, Joshua, "Encoding Fairness: How Model and Feature Encoding Choices Affect Bias in Salary Prediction" (2026). CMC Senior Theses. 4274.
https://scholarship.claremont.edu/cmc_theses/4274
This thesis is restricted to the Claremont Colleges current faculty, students, and staff.