Human Differences in Judgment Lead to Problems for AI
by Mayank Kejriwal at Singularity Hub: Many people understand the concept of bias at some intuitive level. In society, and in artificial intelligence systems, racial and gender biases are well documented.
If society could somehow remove bias, would all problems go away? The late Nobel laureate Daniel Kahneman, who was a key figure in the field of behavioral economics, argued in his last book that bias is just one side of the coin. Errors in judgments can be attributed to two sources: bias and noise.
Bias and noise both play important roles in fields such as law, medicine, and financial forecasting, where human judgments are central. In our work as computer and information scientists, my colleagues and I have found that noise also plays a role in AI.
Statistical Noise
Noise in this context means variation in how people make judgments of the same problem or situation. The problem of noise is more pervasive than initially meets the eye. A seminal work, dating back all the way to the Great Depression, has found that different judges gave different sentences for similar cases.
Worryingly, sentencing in court cases can depend on things such as the temperature and whether the local football team won. Such factors, at least in part, contribute to the perception that the justice system is not just biased but also arbitrary at times.
Other examples: Insurance adjusters might give different estimates for similar claims, reflecting noise in their judgments. Noise is likely present in all manner of contests, ranging from wine tastings to local beauty pageants to college admissions.
Noise in the Data
On the surface, it doesn’t seem likely that noise could affect the performance of AI systems. After all, machines aren’t affected by weather or football teams, so why would they make judgments that vary with circumstance? On the other hand, researchers know that bias affects AI, because it is reflected in the data that the AI is trained on.
For the new spate of AI models like ChatGPT, the gold standard is human performance on general intelligence problems such as common sense. ChatGPT and its peers are measured against human-labeled commonsense datasets.
Put simply, researchers and developers can ask the machine a commonsense question and compare it with human answers: “If I place a heavy rock on a paper table, will it collapse? Yes or No.” If there is high agreement between the two—in the best case, perfect agreement—the machine is approaching human-level common sense, according to the test.
So where would noise come in?
More here.