Week 20: Experimental plan
Hello! Full disclosure, last week's blog post contains information on some of the things I did this week. I wrote it with the conclusion in mind, so (I think) it sounds pretty well thought-out. However, last week's work did not go as smoothly as I wrote it to sound. For one, after I collected 11 different datasets on which to test our models, Caitlin realized that we weren't collecting the right kind of data. The datasets I chose didn't have a clear outcome variable (which I tried to write about in last week's post), but we definitely need one to test the models (or so we think).
That wrench in our plans led Caitlin and I to having a long discussion about how the ranking actually works. We drew lots on the whiteboard, and then I went home and tried thinking through the process myself. The rest of this post is a walkthrough of my logic steps as to how an experiment like this might work.
I started by thinking about scoring in the realm of statistical parity. Statistical parity, to recall, is the fairness criteria that says that, if two groups are equal, then they should be represented in the same proportion. For example, if males and females have the same potential for success in college and 50% of the student who apply are male (and 50% female), then the group of accepted students should be 50% male and 50% female.
There are two slightly different ways of defining what is the "right" proportion, though. One way is to say that, in the world, there are 50% males and 50% females (I'm not trying to harass the transgender community, I'm just simplifying the problem for the sake of hypothetical analysis). Thus, the accepted students should follow the true proportion of males and females. This is called the "We're All Equal" paradigm in this paper. The other way of defining proportion is the proportion of students who applied. If 70% of the students who applied are male, then the proportion of accepted male students should be 0.7 or 70%. This is the "What You See Is What You Get" paradigm as described in the aforementioned paper.
Back to my working example, let's consider that the protected attribute, a, has two groups: a1 and a2. The ground truth I will use is that a1 and a2 are equal and that there are the same number of a1 people as there are a2 people (proportion of each is 0.5).
Let's say that we were using a model, m, and we came up with the following scoring:
Ordering this by the score, we get the following rank:
That wrench in our plans led Caitlin and I to having a long discussion about how the ranking actually works. We drew lots on the whiteboard, and then I went home and tried thinking through the process myself. The rest of this post is a walkthrough of my logic steps as to how an experiment like this might work.
I started by thinking about scoring in the realm of statistical parity. Statistical parity, to recall, is the fairness criteria that says that, if two groups are equal, then they should be represented in the same proportion. For example, if males and females have the same potential for success in college and 50% of the student who apply are male (and 50% female), then the group of accepted students should be 50% male and 50% female.
There are two slightly different ways of defining what is the "right" proportion, though. One way is to say that, in the world, there are 50% males and 50% females (I'm not trying to harass the transgender community, I'm just simplifying the problem for the sake of hypothetical analysis). Thus, the accepted students should follow the true proportion of males and females. This is called the "We're All Equal" paradigm in this paper. The other way of defining proportion is the proportion of students who applied. If 70% of the students who applied are male, then the proportion of accepted male students should be 0.7 or 70%. This is the "What You See Is What You Get" paradigm as described in the aforementioned paper.
Back to my working example, let's consider that the protected attribute, a, has two groups: a1 and a2. The ground truth I will use is that a1 and a2 are equal and that there are the same number of a1 people as there are a2 people (proportion of each is 0.5).
Let's say that we were using a model, m, and we came up with the following scoring:
Ordering this by the score, we get the following rank:
The question then becomes, is this ranking fair?
To evaluate this, I'll consider the difference in the cumulative score between groups. In the following table, the score of each group is aggregated, and the difference between groups is shown in the rightmost column. when a1-a2 is positive, then a1 is favored by the ranking. When a1-a2 is negative, then a2 is favored by the ranking.
As you can see, the original ranking definitely favors the group a1.
To correct for this, we could try to make sure that a1-a2 is always close to 0. Consider the following corrected ranking:
This ranking certainly improves the difference between a1-a2. Now the model takes turns favoring group a1 and group a2.
However, now we need to consider the tradeoff between fairness and utility.
Interlude: Fairness/Utility tradeoff
In our example, we assume the score algorithm is the best that is can be, given the circumstance. For whatever reasons, biased or otherwise, this is the scoring algorithm that was learned. Evaluating the model with the addition of some other information (the ground truth/outcome variable) in some ways weakens the model. Moving observations around to achieve statistical parity reduces the usefulness of the model because the score carries less meaning. Thus, fairness and utility have a tradeoff. Except where the model is already perfectly fair, some amount of utility has to be lost to regain fairness. Going the other way, some bias may have to be tolerated for the model to still be useful at making decisions. Since we haven't figured out a metric for this yet, talking about the tradeoff is kind of abstract.
[End interlude]
Here is the difference between the first (most useful) model and the second (most fair) model:
It is clear that the model changes. If we give a point penalty for every row that the observation moved, then the "most fair" model would have a penalty of 6. This metric is very rough, and it is unclear what a proper threshold might be between utility and fairness. If the threshold were 4, for example, maybe o3 and o5 could be swapped in the "most fair" model.
Conclusion
This is one example of how a scoring model could be evaluated under the criteria of statistical parity. Statistical parity is, possibly, the easiest of the three criteria, since it relies on some ground truth (e.g. proportion of a1 and a2 in the world). Equalized odds and Calibration are more challenging, because they require some knowledge of the outcome so as to assign a prediction error to each observation. The next step will be to work through those scenarios, and hopefully find some outcome attributes for the datasets I mentioned last week so that we can evaluate fairness.
Comments
Post a Comment