Week 11: Game Plan

This week, I met with the graduate student, Caitlin, to discuss next steps. (Unfortunately, I missed the weekly meeting with the professor because I was at an interview.) We decided on a strategy for addressing each aspect of ranking and fairness. This strategy is what we are calling 3x3x2: 3 fairness criteria, 3 applications, and 2 tasks.

3 Fairness Criteria

In my last post, I summarized a series of papers related to fairness criteria for classification. After my summary, I introduced three distinct ideas: statistical parity, calibration, and equalized odds.

Statistical Parity

Statistical parity is the idea that two groups should have equal outcomes. For example, if you assume that women are just as qualified to attend graduate school as men, then an equal number of women should be admitted to graduate programs as men. This criteria was described by Friedler et. al as the "We're All Equal" (WAE) axiom. In short, the WAE axiom says that with respect to the decision at hand, two groups are essentially equal. With respect to graduate school admissions, men and women are equally qualified.

Calibration

Calibration is the idea that your predictions should mean something. In the classification setting, you might assign a probabilistic predictor to individuals you are trying to classify. Your probabilities should be accurate; of all individuals with predictors in a certain bin, that proportion of them should be true positives.

For a motivating example (one different from the COMPAS one I've discussed before), consider predicting the effect of smoking on lung cancer. Let's say you have two groups of people: smokers and non-smokers. You are trying to identify individuals with the highest risk of getting lung cancer so that you can spend extra resources in the clinic monitoring these people (to detect the cancer as early as possible). Unfortunately, you only have the resources to monitor a limited number of people.

To illustrate, here's a series of images I used to describe calibration in a presentation. Imagine this blob represented a single group (smokers) for whom you knew whether or not they developed lung cancer (supervised learning). You want to find a method to identify individuals most likely to develop lung cancer.




You start with unlabeled data (pictured above), and then assign each individual (small blue circle) a probability of developing lung cancer. Probabilistic predictors are from a continuous distribution, so you bin your individuals into probability groups. In this picture (below), 0% represents individuals with a 0-33% chance of developing lung cancer. 50% represents those with 34-66% chance, and 100% represents those with 67-100% chance.


In the picture above, you identify 6 individuals who have the highest likelihood of developing lung cancer. (Note that this example is slightly misleading; the true bin probability label would probably be an average of the probability estimates for individuals, so it would be somewhere closer to 80%. In this example, let's assume the average probability for the upper class was in fact 100%). Now that you've built a classifier for assigning probabilities, you want to test it against the true data.



In this picture, the green circles represent individuals who later developed lung cancer, and red represents those who didn't. This example is well-calibrated; all of the people you said had a 100% chance of developing lung cancer actually did. Same with those you assigned a 0% chance. Of all the individuals you assigned a 50% chance, half of them developed cancer.

To imagine a poorly calibrated example, imagine that half of the 100% chance people did not develop cancer. This would suggest that the model overestimates cancer prevalence, and the model is not well-calibrated at the 100% bin. Why is this important? Fairness says that calibration should be the same in both groups. If you accurately predict cancer in smokers but not in non-smokers, you are being unfair to the non-smokers. Of your limited clinic resources, there may be more non-smokers who will develop cancer that you will miss.

Your models are fair if they make the same amount of error in both. Said a different way, having a 100% chance of getting lung cancer should mean the same thing regardless of whether you were a smoker or a non-smoker.

Equalized Odds

Equalized odds is similar to calibration in that it is a way of evaluating how good your model fits. Specifically, our definition of equalized odds is that the false positive rate or the false negative rate should be the same regardless of group membership. In the example above, let's say you incorrectly predict 10% of smokers as going to develop lung cancer. Your false positive rate should be the same for non-smokers; 10% should be incorrect positive predictions. (In this case, I'm considering a prediction as being binary; predicted positive and predicted negative. However, you could get this binary classifier from the probabilistic predictor by setting a threshold, like 50% likelihood. Alternatively, you could slide that threshold and learn an RC curve). 

Kleinberg et. al. show that you can have equal false positive rate and false negative rate across groups (or one or the other), but that you can't have both of these and calibration.

3 Applications

We want to consider three distinct applications for ranking: scoring, categorical ranking, and ordering.

Scoring

Scoring is basically linear regression. The idea is that individuals can be assigned a score and that the score can tell you something about their relative merit. For example, let's say you were ranking colleges based on their research contributions. You assign We Publish Institution (WPI) a score of 10 and Recent Publication Institute (RPI) a score of 9.5, but Mostly Is Technical-papers school a score of 2. When ordering, you would say WPI > RPI > MIT, but scoring tells you that RPI is pretty similar to WPI while MIT is very different. If you were using this model to make a decision, you might consider choosing WPI or RPI, but definitely not MIT. 

Categorical ranking

Categorical ranking assigns individuals to ordered bins. For example, assigning 1 to 5 stars to restaurants to represent quality. You might not care about which 5-star restaurant you go to, but you would prefer going to a 5-star over a 2-star.

Ordering

Ordering is the same as scoring, except the score has no meaning. Instead, you are only interested in the relative order of individuals. If you are looking for the best applicants to interview for a job, you might assign each applicant a score based on their skills and qualities and then pick the 10 with the highest score to interview. You don't really know if #2 is way better than #7, you just know these are the 10 best.

2 Tasks


Each of the 3 fairness criteria should be evaluated for each of the three applications. To evaluate, we consider two tasks: audit and correct.

Audit

To audit, we need to find some way to apply the fairness criteria to the application and determine whether or not it is fair.

Correct

To correct, we need to find some way to make the application fair according to criteria.

Summary

The fairness criteria, applications, and tasks are organized in the following table:



The green ones are the ones Caitlin and I will be addressing in the next week.

Comments

Popular posts from this blog

Week 31: Datasets for fairness

Week 32-33: Wrapping up

Week 30: Returning to Fairness