Week 14: Concluding the second quarter
This week was finals in school, so I only had one brief meeting with my mentor and Caitlin. Caitlin and I presented our idea about mapping the reliability curves of protected groups to a compromised line and adjusting the model accordingly. Caitlin had spent time considering the same problem but in the context of equalized odds. We found in the meeting that we were saying many of the same things, and that we were unclear as to the distinction between the two (equalized odds and calibration). Here's what we found:
Equalized odds and calibration both deal with minimizing the discrepancy of error in prediction. (note that they are not about minimizing the error). What does that mean? Consider the following three scenarios (I created these contrived examples in Microsoft Excel for purpose of discussion):
For each of these scenarios, one could make the argument that the discrepancy of error is the same. In Figure 1, the error in one is underestimated while the other is overestimated, but the total amount of error is the same. In Figure 2, the distribution of the error is different, but again the total amount of error is the same. In Figure 3, the amount of error is different, but the shape of the error (distribution) is the same. This can be summed up into three main forces: direction, distribution, and magnitude.
Equalized odds and calibration both deal with minimizing the discrepancy of error in prediction. (note that they are not about minimizing the error). What does that mean? Consider the following three scenarios (I created these contrived examples in Microsoft Excel for purpose of discussion):
![]() |
Figure 1 |
![]() |
Figure 2 |
![]() |
Figure 3 |
Direction
Let's continue using the example I introduced last week: evaluating bias in a model that assigns salaries for new employees. In that example, the direction of the bias is important. If female candidates are systemically underestimated while male candidates are systemically overestimated, then the model would be biased against females because it would pay a female candidate less than an equally qualified male candidate. Equalized odds might care about this, if it is comparing the amounts of overestimation and also (independently) the amounts of underestimation. However, calibration might care about it less, since the total accuracy (deviation from the true value) would be the same for both groups.
Distribution
In Figure 2, the shape of the histogram, i.e. distribution of the error, is different between groups. While the total error is the same, one could say that low-valued candidates in one group are penalized more than the other group, while the reverse can be said for high-valued candidates. Equalized odds might care about this force: if you consider the error of each group at any one score bin, they are unequal (e.g. at score bin 6, Series 1 has an error value of 2 while Series 2 has an error value of 7). However, calibration might also care about distribution. In the reliability plot from last week, we want both curves to have the same deviation from the line at any one place.
Magnitude
The third force to consider is seen in Figure 3. While the shape (distribution) is the same in both groups and the net error is the same (if underestimation and overestimation cancel each other), the magnitude of the error is different. Equalized odds would care that the amount of error compared across bins is different. However, calibration might not find this model biased because the model evenly splits the amount of over and underestimation in a way that the best accuracy is obtained.
Summary
While considering these three forces (in the meeting with Caitlin and Elke and also just now while writing), I found it difficult to distinguish equalized odds and calibration. Kleinberg et. al. proved in their paper that the two were disjoint, yet I found it overlap when applying the criteria to the regression/ranking context. Both certainly have to do with mitigating error, though equalized odds might care more about the type of error while calibration might care about the total accuracy (blind to over or underestimation). In this analysis, equalized odds is concerned with all three forces, but calibration is only really concerned with distribution. I'm not sure this is entirely accurate, since it means that calibration is kind of consumed within equalized odds.
We decided in the meeting that we need to be more precise with our definitions of equalized odds and calibration, especially in the new context of ranking. Ideally, the two would be disjoint, since it might be possible to correct for any one (but not all) of these three forces in such a way that meets a definition of fairness without entirely losing utility in the model.
Next Steps
I am officially half way through this project, but I feel like much time has been spent exploring definitions and asking questions. I have a couple weeks for break before the school year starts again, and I hope to find a way to start answering these questions. It is not likely that I will post during the break as I will be spending time with my family, but I expect to be giving a recap of what work I did accomplish at the start of the next quarter.
Happy holidays, and until January!
Comments
Post a Comment