Week 19: Datasets

Welcome back me! I can't believe we are already halfway through this project. Over break, I was able to review some of my blog posts and notes. It was pretty amazing to see how this project has transformed. At the beginning, I had a vague idea of how research worked. I thought I would be reading about rankings and fairness and then magically solve the fairness problem for ranking with some pithy formulas. What I've realized, however, is that research, at least the type I've been doing, is an iterative process. We (Caitlin and I) would read fairness papers, discuss it, come up with an idea of what it means in ranking, try to explain it to our advisor, realize we don't understand fairness, and repeat. Two terms in, we have now identified three distinct fairness criteria to be applied to three different types of ranking problems.

Now that we have a pretty good idea of the problem, we would like to apply it to some real data. The IQP team, Diana and the others (see diana.matters-creu.com), have been using several datasets as a use case for designing their product and facilitating the user study (such as datasets about colleges, movies, video games, FIFA, and states). Not all of these datasets include a ranking or scoring attribute; they don't need to. The IQP team only needs datasets with clear observation names (such as the names of states) so that users can build a ranking from them. In contrast, to evaluate the fairness criteria that Caitlin and I have defined, we need datasets with a pre-made ranking or scoring attribute, a protected attribute, and some ground truth or outcome attribute that is fair. For example, if we were using the COMPAS dataset that I've described before, we would need the COMPAS score attribute, a protected attribute like sex, and some outcome such as months until the defendant was arrested for another crime.

This week, I started searching for datasets that we could use. I looked in previous research papers for open datasets, as well as on open source repositories such as UCI's Machine Learning Repository and Kaggle. The following sections describe some of the datasets that I found (I found 3-4 for each ranking type) and identify the ranking/scoring attribute, one or more potentially protected attributes, and the ground truth/outcome attribute.

Additionally, the datasets are organized by ranking type (scoring, ordering, and categorical). To refresh, scoring is when a continuous score is assigned to each observation, giving some idea of distance between observations. Ordering is the same except that there is no knowledge of distance. An example is US News college rankings; we don't know if College #1 is as different from College #2 as #2 is from #3, so we can't use that for our analysis. Categorical is like ordering, except that there are discrete bins, for example, a 5-star rating system. In this scenario, two observations with the same number of stars are ranked the same. The following figure shows how these ranking paradigms are related. A scoring application has the most information on the observations, but you could ignore the score and only consider the ordering of the observations. Further, you could bin observations into categories like "best", "medium", and "worst". In this way, every dataset can be thought of as a categorical problem, but only some datasets have enough information to be scoring.
All datasets can be categorical, but only some can be scoring.

Scoring

German Credit Data

This dataset was mentioned in the FA*IR A Top-k Ranking Algorithm paper, mentioned before and available at this link. The dataset is available on the UCI Machine Learning Repository. While the algorithm for creating the credit score is not available, there is an attribute called "credit amount" that can be used to indirectly measure the customer's score. Protected attributes could be marital status, sex, or age (discretized, for example, "old" and "young"). It is difficult to know what the outcome/ground truth variable would be. Ideally, we would have some kind of long-term data, for example, whether or not they paid off their credit card in full. As of right now, I don't know if that data is available.

Ordering

US News College Rankings

Contempt for the US News College Rankings is evident in the media. Malcolm Gladwell, author of Outliers, wrote a piece in the New Yorker about this ranking system. (You can find more about it here.) Others have written criticisms about this ranking system, saying that it favors colleges that already have a reputation (ahem, Harvard) and is somewhat of a self-fulfilling prophecy. There are several datasets that include college statistics, most notably is the College Scorecard dataset collected by the US government. In this example, the ranking attribute is the US News Ranking (or some other published ranking) and the protected attributes can be, for example, number of students with pell grants. Unfortunately, this dataset also doesn't have an outcome variable, but we could use something like graduation rate to measure the success of the students/success of the school. 

Categorical

Restaurant Ratings

An example of a categorical dataset is restaurant ratings available from sources like yelp.com. Protected attributes could be things like location of the restaurant (rich part of town vs poor part of town) and ethnicity of the food (see this paper). The outcome variable could be something like their restaurant grade (an official grade from the health department, can be A, B, or C).

Future steps

I think that it makes sense for the protected attribute to be binary (male/female, poor/rich), but I don't know if it needs to be. This next week, we are planning to design the experimental process for testing our datasets, so we will evaluate whether or not the protected attribute needs to be binary.

Comments

Popular posts from this blog

Week 31: Datasets for fairness

Week 32-33: Wrapping up

Week 30: Returning to Fairness