Week 12: Error metrics for calibration and equalized odds
At the end of the last post, I mentioned that we would be looking at the Audit task for Calibration and Equalized Odds under the Scoring application. Caitlin and I divided these tasks; I started looking at error metrics for Calibration and thinking about how we could apply it to scoring. Caitlin did the same for Equalized Odds.
In this example, for individuals for whom the predictor assigned a probabilistic score of 0.7 (70% likely to be in the positive class), about 80% were positive. This means the model under-predicted these individuals. For individuals assigned a probabilistic score of 0.4 (40% likely to be in positive class), about 40% were truly positive. The model was well calibrated at this threshold.
Calibration in Classification
In my previous post, I described calibration under classification. In this paradigm, one is trying to predict whether or not an individual is in the positive class. Each individual is assigned a probability; the probability that the individual is in the positive class. If a classifier is well-calibrated, approximately x% of people assigned a probability of x% will be positive. Basically, calibration means that the probability estimate means what is says it means.
Problems of applying directly to Scoring
Scoring can be thought of as linear regression. In this paradigm, individuals are assigned a score which represents their value in the ranking (i.e. a high score is better than a low score). In a supervised learning area, we would know the true score for each individual. This is not really the same problem as classification. We are not looking for a probability estimate of whether someone is in the positive class, because there is no positive class.
How I addressed it
The first thing I considered doing was trying to force each individual to have a probability value. This could be done by assuming the true scores follow some kind of underlying distribution and trying to learn a p-value that represents the likelihood that the score we assigned the individual was really the true score. This doesn't make much sense for two reasons: one, we are assuming that individuals are individual, and we can't learn a p-value on just one data point. Second, we won't be able to determine if the probability is well-calibrated. My guess is that a well-calibrated version of this model means that all p-values are the same.
I decided to turn to Google and look for alternative definitions of calibration that could make more sense for regression. One of the top hits was this post on StatsExchange. The poster, Berk Ustun, was looking for an alternative method for testing performance. He later resolved his own post and cited a couple of papers that describe error metrics, specifically calibration.
Root Mean Squared Error (RMSE) and Reliability Diagram
In this survey paper mentioned in the StatsExchange resolution post, the authors mention a calibration metric that is based on reliability diagrams.
A reliability diagram compares the predicted value to the proportion of true positives. In classification, the X axis could be thought of as the probabilistic prediction for each individual, ordered from lowest to highest. In a perfect prediction (perfectly calibrated), the plot would be a straight line. In this image, the blue line represents the reliability of their predictor.
![]() |
Source: https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/25704/versions/4/screenshot.png |
I believe this same model can work in a regression setting. On the X axis, I would have the scores of individuals ordered from lowest to highest. On the Y axis, I would have the true scores (supervised learning). I could fit a line of these points that would look very similar to the one above.
In the paper, the authors suggest using a Mean Calibration Error to evaluate the accuracy of a predictor. This metric is based off of root mean squared error (RMSE). I think this same metric can be applied directly to calibration.
I discussed this with Caitlin and with our mentor, and it seems to be like a good way to go. The next step is to determine how to fix a model that is not well calibrated. Until next week!
Great blog!!! This is a very different and unique content. I am waiting for your another post...
ReplyDeleteSocial Media Marketing Courses in Chennai
Social Media Marketing Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
Tableau Training in Chennai
Spark Training in Chennai
Oracle Training in Chennai
Oracle DBA Training in Chennai
Social Media Marketing Courses in Chennai
Social Media Marketing Training in Chennai
It is going nice to go through your blog. Thank you for the information.
ReplyDeleteWeb Designing Course in Madurai
Web Designing Training in Madurai
Web Designing Course in Madurai
Web Designing Course in Coimbatore
Best Web Designing Institute in Coimbatore
Web Design Training in Coimbatore