Study reveals why AI models that analyze medical images can be biased

45 points by wglb a year ago

amluto a year ago

I’m suspicious that there’s another factor in play: images being correctly labeled in the training set. Even from high-end hospitals, my personal experience leads me to believe that radiologists make both major types of error on a regular basis: calling out abnormalities in an image that are entirely irrelevant and missing actual relevant problems in an image that are subsequently easily seen by a doctor who is aware of what seems to be actually wrong. To top it off, sure many patients end up being diagnosed with whatever the radiologist saw, and no one ever confirms that the diagnosis was really correct. (And how could they? A lot of conditions are hard to diagnose!)

So the training data is probably hugely biased, and the models will learn to predict the training labels as opposed to any magically correct “ground truth”. And internally detecting demographics and producing an output biased by demographics may well result in a better match to the training data than a perfect, unbiased output would be.

KingOfCoders a year ago

We recently had two ultrasonic examinations, both by specialists, both saw different things.
resource_waste a year ago

Yeah this is actually it.
Its much easier for physicians to blame technology than their profession.
- nradov a year ago
  
  I'm not sure it's a matter of "blame" as such. While there are occasional cases of incompetence or malpractice, biology is inherently messy and uncertain.

teruakohatu a year ago

I have not been able to fully digest this paper yet, and medical data is not my speciality. It is interesting but not surprising that models appear able to determine the demographics of the patient that even radiologists are unable to. It is also not surprising that models use this to "cheat" (find demographic shortcuts in disease classification).

My understanding is that doctors may unconsciously do this as well, ignoring a possible diagnosis because they don't expect a patient of a certain demographic to have a particular issue.

I would expect radiologists who practice in very different demographic environments would not do as well when evaluating images another environment.

At the end of the day radiology is more an art than a science, so the training data may well be faulty. Krupinski (2010) wrote in an interesting paper [1]:

"Medical images need to be interpreted because they are not self-explanatory... In radiology alone, estimates suggest that, in some areas, there may be up to a 30% miss rate and an equally high false positive rate ... interpretation errors can be caused by a host of psychophysical processes ... radiologists are less accurate after a day of reading diagnostic images and that their ability to focus on the display screen is reduced because of myopia. "

I would hope datasets included a substantial amount of images that were originally mis-classified as a human.

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881280/

carbocation a year ago

It's not too hard to train yourself to identify sex or approximate age. (I claim this because of time spent reviewing model output for work that I have done to build models to estimate age.) The reason radiologists don't do this is that there is no clinical reason to do it, so it's not a skill that they develop. (Deep neural networks can also beat radiologists at bird classification, another task that is not relevant for their job.)

ogarten a year ago

How does this surprise anyone?

Medical data for AI training is almost always sources in some more or less shady country because they lack any privacy regulations. It's then annotated by a hoard of cheap workers who may or may not have advanced medical training.

Even "normal medicine" is extremely biased towards male people fitting inside the norm which is why a lot of things are not detected early enough in women or in people who do not match that norm.

Next thing: Doctors often think that their annotations are the absolute gold standard but they don't necessarily know everything that is in an X-Ray or an MRI.

A few years ago we tried to build synthetic data for this exact purpose by simulating medical images for 3D body models with different diseases and nobody we talked to cared about it, because "we have good data".

aprilthird2021 a year ago

Yep, you nailed it. You really don't have to think hard about why AI which only learns from what we feed it and can access has gaps and biases more pronounced than the real world. AI lives in the internet world, it's trained on horrible cesspools of anonymous text like 4chan and reddit. No wonder it will be biased. If you only try to feed it sanitary data you wouldn't have enough to get the results we get now.
- DrScientist a year ago
  
  > You really don't have to think hard about why AI which only learns from what we feed it
  Sadly I'd say that people are no different.
  > it's trained on horrible cesspools of ....
  So it's really not the future of AI we should be worrying about...
  - aprilthird2021 a year ago
    
    People are different because they have human social interaction offline where terminally online stereotypes and biases often fall apart
resource_waste a year ago

I have been quite anti-HIPPA since realizing how 'privacy' was the excuse to stunt science.
My conspiracy: With massive medical data, ML/AI would have been 'discovered'/built sooner. Limiting the data makes it so only a few people can be specialists under the supervision of medical cartels.
- nradov a year ago
  
  You have misunderstood. The HIPAA (not "HIPPA") Privacy Rule doesn't stunt science. It's easy to request patient consent for using PHI, and properly de-identified data isn't even considered to be PHI.
  https://privacyruleandresearch.nih.gov/pr_08.asp
  https://www.hhs.gov/hipaa/for-professionals/privacy/index.ht...
- KingOfCoders a year ago
  
  Great, where can I find your medical data on the web? Care to give an URL? Would be perfect to include your salary.
  - zo1 a year ago
    
    Not OP. If my doctors and hospitals would give it to me in a good and easily-collatable format, then I and who knows how many other people would gladly donate it to science or for research purposes. Heck some people have this tendency to donate actual body parts and their entire bodies to science, so it's not a big stretch to say some would donate this sort of personal information for a purpose of their own choosing.
    This isn't a fully settled debate (including your salary example) so you can't just assume one side is right and argue it as if it's some unquestionable human right.
    
    nradov a year ago
    
    Many healthcare providers now have APIs for patients to download their data in standard HL7 FHIR and/or CDA format. You should ask about this, and if they don't have it then consider switching providers. All modern EHRs have that functionality built in so for providers it's simply a matter of switching it on.
    However, most of that data is useless for research purposes. Even if the format complies with industry standards the quality is often bad with many data elements lacking consistent coding. You can't just feed clinical data from a bunch of different random sources into a research project and expect to get accurate results: it's a garbage in / garbage out issue. That's why most clinical research studies involve just a few provider organizations so that the researchers can properly configure the systems and train the clinicians on consistent data entry.
    
    KingOfCoders a year ago
    
    Deanonymized? For sale?
davedx a year ago

I know GPT4o can diagnose medical images. Is their model likely to be using the same kind of datasets as these models for medical systems?

carbocation a year ago

The authors refer to a literature describing "shortcuts" as "correlations that are present in the data but have no real clinical basis, for instance deep models using the hospital as a shortcut for disease prediction". It feels like a parallel language is developing. Most of us would call such a phenomenon "overfitting" or describe some specific issue with generalization. That example is not a shortcut in any normal sense of the word unless you are providing the hospital via some extra path.

They call demographics like age and sex "shortcuts" but I find this to be a frustrating term since it seems to obscure what's happening under the hood. (They cite many papers using the same word, so I'm not blaming them for this usage.) Men are typically larger; old bones do not look like young bones. There is plenty of biology involved in what they refer to as demographic shortcuts.

I think you could take the same results and say "Models are able to distinguish men from women. For our purposes, it's important that they cannot do this. Therefore, we did XYZ on these weakly labeled public databases." But perhaps that sounds less exciting.

ivanbakel a year ago

I think you're oversimplifying the issue. It's not important that the model cannot distinguish between people of different demographics, it's important that the model does not use demographic information in place of actual diagnosis for the sake of better accuracy.
That the model can determine biological sex from X-rays wouldn't be an issue if it never shortcuts the diagnostic process by using biological sex in place of meaningful diagnostic data. I would not like a model to ignore a melanoma in my chest scan because it can deduce that I was born male and my risk of breast cancer is quite low.
The idea of penalising a model which takes such biological shortcuts (because its subgroup accuracy gets worse) seems like a good solution, and it's cool that the approach works in TFA.
- carbocation a year ago
  
  I believe that imbuing what the model does to make a prediction with the idea of "shortcuts" is oversimplifying a more complex issue. I don't think it's helpful to describe a model that can distinguish demographics as taking "shortcuts". I think that adds a layer of jargon that we then need to cut through to understand what is going on.
  There are plenty of tools that have been developed to enforce that models perform in a manner that is unbiased across some dimension (e.g., sex, or hospital, etc). (For example unsupervised domain adaptation.) I think that splitting the field with jargon makes it more difficult to follow the breadth of the field.
bravura a year ago

This is not overfitting precisely because of the bias/variance tradeoff.
A model overfits if it is unnecessarily COMPLEX for the training data.
If there is bias in the training (and validation and test) data that allows a SIMPLE model to fit the data because of a spurious correlation, that is not overfitting.
- carbocation a year ago
  
  There is no reason to believe that an X-ray model's correct estimation of age is due to "spurious correlation". Rather, it seems to be "undesirable correlation".
  - TeMPOraL a year ago
    
    > Rather, it seems to be "undesirable correlation".
    More specifically, politically undesirable correlation - as in, "it's there, but its existence upsets some people". It's pretty obvious and self-evident that there are meaningful biological differences related to age, sex, and other demographics. Whether or not they're clinically relevant for a specific diagnosis under question is one thing, but they are clinically relevant for great many diagnoses; trying to "de-bias" reality here will only lead to unnecessary suffering and loss of life.
    
    knallfrosch a year ago
    
    It seems that the models not only use bone structure (or similar) itself, but improve their prediction with "forbidden" knowledge, such as "green people have an overall lower or higher rate of bone cancer than others" or "people who come to this specialized hospital have bone cancer anyway, so I don't even look at the image"
    Now you can say that this is perfectly fine and represents the most likely real-world use case. Or you might prefer a model that looks at the image only, with the implicit assumption that this "forbidden knowledge" will be added by human doctors later on in the pipeline. This is beneficial because the "forbidden knowledge", such as whether patients from Hospital A always have bone cancer, might change overnight! Imagine the hospital gets assigned a new name in the system and the prediction is shit now.
    This second, "unbiased" AI system will always have a worse performance, because you lobotomize it when you kill the forbidden knowledge with a sledgehammer. This study just showed that "group fairness" is at odds with optimal predictions" and how much it is at odds.
    PS: You might even prefer a society where everyone is worse off, but every protected group is equally bad off. You'd also ban the humans from applying the forbidden knowledge. Whether that is desirable, is, of course, out of the scope of the paper.
    
    carbocation a year ago
    
    > Or you might prefer a model that looks at the image only
    These models are only looking at the images. They are inferring demographics.
    
    RandomLensman a year ago
    
    Yes, these things can be a factor for a specific diagnosis but why use AI when (just) going back to conditional probabilities based on groups instead of making a true individual diagnosis? You want each diagnosis to be correct and not just a good average.
    
    TeMPOraL a year ago
    
    You always diagnose on conditional probabilities. The diagnosis is conditioned on your belief in occurrence of symptoms, which is conditioned on the observations and results of tests you make. In an ideal case, you observe well enough to make a definite diagnosis; in real case, there's always some uncertainty, plus you can't do all the tests simultaneously - which is where all those proxy factors like demographics are useful: they help prioritize tests and narrow down the diagnosis quicker.
    
    RandomLensman a year ago
    
    You don't always diagnose on conditional probabilities (a simple example is looking at an x-ray of a broken bone - no priors needed to spot the broken bone).
    Your knowledge guides, but it also doesn't (or shouldn't) blind you.
  - pbhjpbhj a year ago
    
    Do the sorts of diseases ML is being used to detect have a flat incidence profile over age? Even if they do, negating other diagnoses that are age dependent would still mean ML models would acquire a measure of patient age, say.
    
    carbocation a year ago
    
    Almost every noncommunicable disease of adulthood becomes more common with age. (The diagnoses in this paper were things like "cardiomegaly" which are mostly just X-ray findings and, while they have ICD codes, are not a meaningful diagnosis that a practicing physician would care about.)

zaptrem a year ago

``` The researchers also found that they could retrain the models in a way that improves their fairness. However, their approach to "debiasing" worked best when the models were tested on the same types of patients on whom they were trained, such as patients from the same hospital. When these models were applied to patients from different hospitals, the fairness gaps reappeared.

"I think the main takeaways are, first, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population. Second, whenever sufficient data is available, you should train models on your own data," says Haoran Zhang, an MIT graduate student and one of the lead authors of the new paper. ```

This is just overfitting. Why are they training whole models on only one hospital worth of data when they appear to have access to five? They should be training on all of the data in the world they can get their hands on then maybe fine tuning on their specific hospital (maybe they have higher quality outcomes data that verifies the readings) if there are still accuracy issues. The last five years have taught us that gobbling up everything (even if it's not the best quality) is the way.

hdhshdhshdjd a year ago

It could be an instrumentation issue, different hospitals use different machines.
- zaptrem a year ago
  
  Likely this and radiologist practices and storage methods and patient demographics and a bunch of other things. However, if they're training these models with the intention of using them in other hospitals I'd still define that as overfitting.

kosh2 a year ago

I agree with many posters in here, that the cause will likely be bad data one way or another. Maybe we need to take a step back and only use data, that are almost 100% accurate.

Like the time of death after the data was collected.

If a model could with a high accuracy predict, that a patient will die within X days (without proper treatment), it will be already very valuable.

Second, as Sora has shown, going multi model can have amazing benefits.

Get a breath analysis of the patient, get a video, get a sound recording, get an MRI, get a CT, get a full blood sample and then let the model do its pattern finding magic.

throwaway22032 a year ago

The language used in the article seems political/inflammatory to me.

It's not a lack of "fairness", it's just a lack of accuracy.

Imagine that you train a model to find roofing issues or subsidence or something from aerial imagery. Maybe it performs better on Victorian terraces, because there are lots of those in the UK.

Would you call it unfair because it doesn't do so well on thatched roof properties? No, it's just inaccurate, calling it unfair is a value judgement.

Bias is better because it at least has a statistical basis but fairness is, well.. inaccurate...

KingOfCoders a year ago

I think how a scientist would use 'fairness' - e.g. mathematicians - and the general public would use 'fairness' are two different things.
On top of that fairness in the general public is undefined as well. I once wrote on the many facets just concerning fair salaries here: https://www.amazingcto.com/fair-remote-developer-salaries/
CoastalCoder a year ago

This was my general take as well.
Although now that I think of it, there could be a "fairness" angle regarding how many training data are obtained from each population group.
But if that's what the author meant by fairness, it wasn't clear.

wiradikusuma a year ago

Instead of trying to make the model "fair", can we do "model = models.getByRace(x)" so we have optimized model for each, instead of being "jack of all trades"?

teruakohatu a year ago

This paper covers many demographics, not just ethnicity. So you are wanting to really do "model = models.get(a, b, x, y, z)". It may be possible, but with medical data in such short supply, you would be lucky to get a large enough dataset.
You would also need to assume there no overlap in the distribution in the training data for each strata of training data. I would imagine that if you taught someone to detect cancer in x-rays, and only used images of people aged 80+, you might still have some success (>0.5) at detecting cancer in people aged 20-30.
- sdenton4 a year ago
  
  This seems like a place where generative augmentation could help - generate an image of a person with demographic values x, y, z and an early stage lung cancer. Then you can generate images for more corners of the days, more evenly, and train the classifier on less biased data.
  - teruakohatu a year ago
    
    If your model already understand the distribution of the data (fundamentally this is what statistical models are trying to discover) well enough for generative models then they probably can do classification well enough.
    The greater issue is that, for privacy and other reasons, there are many demographics that are not represented at all, or under represented, in medical image datasets.
  - CoastalCoder a year ago
    
    Just in case you'd want to know, the singular of "strata" is "stratum".
    (Although apparently that's sometimes debated [0].)
    [0] https://www.dictionary.com/browse/stratum
acheong08 a year ago

Race doesn’t tell you much about a patient’s possible afflictions. While there may be some predispositions, there is also a lot more nuance than the color of their skin. Environmental factors play a much larger role (e.g “Chinese” Malaysians are much more likely to have joint issues due to high humidity and overuse of ACs)
- nradov a year ago
  
  Since when do humidity and AC cause joint issues?
energy123 a year ago

You can just do one big model with static enrichment (effectively throw "race" into the model as a feature) and achieve the same thing. But the quality will always be worse for some sub-populations due to different amounts of available data. There's no easy way to fix that.
AnthonyMouse a year ago

It seems like the problem is the entire premise.
Suppose that older people are more likely to get cancer, so older people with cancer are more represented in the training data. Then we discover that it has a 20% false negative rate for older people but a 35% false negative rate for younger people, i.e. it has a better handle on what cancer looks like in older people. This is fairly intrinsic, the only real way to fix it is to provide it with more samples of images of younger people with cancer, which may not be available.
But you can also "fix" it by raising the false negative rate for older people to 35%. Doing this is idiotic and should not be done, but it allows people to claim that the model is now "fair", and so they will have the incentive to do things like that as long as we continue to demand "fairness" above other considerations.
Moreover, you're likely to see similar effects with all kinds of other groups. Maybe farm workers are more likely to get skin cancer because they spend more time in the sun, and you could see the same kind of disparity in its ability to detect cancer in farm workers because the effects of doing farm work also have other long-term physical correlates that show up in the image. Probably nobody is studying this because farm workers are not a protected class, but it's just a thing that happens whenever you divide the population into distinct groups. That doesn't mean there is inherently anything to be done about it. It's a facet of what data is available.
- CoastalCoder a year ago
  
  > Probably nobody is studying this because farm workers are not a protected class
  FYI a "protected class" is an attribute category, not an attribute value. In your example it would be something like "career" or "employment type", rather than "farmer".
  - AnthonyMouse a year ago
    
    I don't think that's right. For example, "veterans" are a protected class, but this is a category of past employment the same as "farmers" or "teachers".
    
    CoastalCoder a year ago
    
    Now I'm just confused. I found an article that might be trustworthy, [0], but even they seem to equivocate.
    [0] https://content.next.westlaw.com/practical-law/document/Ibb0...
worthless-trash a year ago

Is race really the biggest differentiator in model bias ? I would have imagined that environmental factors would have been the largest impact in most models bias.
I guess that these factors are harder to prove though.
tsimionescu a year ago

That's not how this works. If the model does something like "this image is of race X, race X has more lung cancers, therefore this image is likely lung cancer", then it's not helpful to anyone. You need the model to evaluate based on the image itself, not based on correlations that it can infer from the image - fairness is just a happy byproduct of that.
- chaorace a year ago
  
  I think what the parent comment meant was that you could force the model to divert its attention elsewhere if you removed race as a variable by making the training data uniform in terms of race. I think it's a smart thought, though I doubt it'd work due to the fuzziness of "race" as a construct. Even if you grouped people using some combination of their self-classified and/or observed racial identity, the model would probably start identifying (and thus start cheating using) even subtler "sub-racial" biomarkers.
  If you ask me, it's probably more effective to compensate for the model's learned racial bias using weights derived from the model outputs via statistical analysis.

adammarples a year ago

Is it just me or did the article at no point explain why medical models produce biased results? I seems to take this as a given, and an uninteresting one at that, and focuses on trying to correct it, without explaining why it happens in the first place. Yes, those models could use race as a shortcut to, presumably, not diagnose cancer in black people for example, but why is the race shortcut boosting model training accuracy? I am still none the wiser.

aaron695 a year ago

[dead]