Hugging Face releases a benchmark for testing generative AI on health tasks

Kyle Wiggers

18 April 2024 at 6:07 pm·4-min read

Generative AI models are increasingly being brought to healthcare settings — in some cases prematurely, perhaps. Early adopters believe that they'll unlock increased efficiency while revealing insights that'd otherwise be missed. Critics, meanwhile, point out that these models have flaws and biases that could contribute to worse health outcomes.

But is there a quantitative way to know how helpful, or harmful, a model might be when tasked with things like summarizing patient records or answering health-related questions?

Hugging Face, the AI startup, proposes a solution in a newly released benchmark test called Open Medical-LLM. Created in partnership with researchers at the nonprofit Open Life Science AI and the University of Edinburgh's Natural Language Processing Group, Open Medical-LLM aims to standardize evaluating the performance of generative AI models on a range of medical-related tasks.

New: Open Medical LLM Leaderboard! 🩺
In basic chatbots, errors are annoyances.
In medical LLMs, errors can have life-threatening consequences 🩸
It's therefore vital to benchmark/follow advances in medical LLMs before thinking about deployment.
Blog: https://t.co/pddLtkmhsz
— Clémentine Fourrier 🍊 (@clefourrier) April 18, 2024

Open Medical-LLM isn't a from-scratch benchmark, per se, but rather a stitching-together of existing test sets — MedQA, PubMedQA, MedMCQA and so on — designed to probe models for general medical knowledge and related fields, such as anatomy, pharmacology, genetics and clinical practice. The benchmark contains multiple choice and open-ended questions that require medical reasoning and understanding, drawing from material including U.S. and Indian medical licensing exams and college biology test question banks.

"[Open Medical-LLM] enables researchers and practitioners to identify the strengths and weaknesses of different approaches, drive further advancements in the field and ultimately contribute to better patient care and outcome," Hugging Face wrote in a blog post.

Image Credits: Hugging Face

Hugging Face is positioning the benchmark as a "robust assessment" of healthcare-bound generative AI models. But some medical experts on social media cautioned against putting too much stock into Open Medical-LLM, lest it lead to ill-informed deployments.

On X, Liam McCoy, a resident physician in neurology at the University of Alberta, pointed out that the gap between the "contrived environment" of medical question-answering and actual clinical practice can be quite large.

It is great progress to see these comparisons head-to-head, but important for us to also remember how big the gap is between the contrived environment of medical question answering and actual clinical practice! Not to mention the idiosyncratic risks these metrics can't capture.
— Liam McCoy, MD MSc (@LiamGMcCoy) April 18, 2024

Hugging Face research scientist Clémentine Fourrier, who co-authored the blog post, agreed.

"These leaderboards should only be used as a first approximation of which [generative AI model] to explore for a given use case, but then a deeper phase of testing is always needed to examine the model's limits and relevance in real conditions," Fourrier replied on X. "Medical [models] should absolutely not be used on their own by patients, but instead should be trained to become support tools for MDs."

It brings to mind Google's experience when it tried to bring an AI screening tool for diabetic retinopathy to healthcare systems in Thailand.

Google created a deep learning system that scanned images of the eye, looking for evidence of retinopathy, a leading cause of vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating both patients and nurses with inconsistent results and a general lack of harmony with on-the-ground practices.

Google medical researchers humbled when AI screening tool falls short in real-life testing

It's telling that of the 139 AI-related medical devices the U.S. Food and Drug Administration has approved to date, none use generative AI. It's exceptionally difficult to test how a generative AI tool's performance in the lab will translate to hospitals and outpatient clinics, and, perhaps more importantly, how the outcomes might trend over time.

That's not to suggest Open Medical-LLM isn't useful or informative. The results leaderboard, if nothing else, serves as a reminder of just how poorly models answer basic health questions. But Open Medical-LLM, and no other benchmark for that matter, is a substitute for carefully thought-out real-world testing.

Wales Online
Woman wakes up hours before life support was to be switched off
Emma's family had been told the 32-year-old was brain dead
Wales Online
DWP full list of PIP changes proposed including cutting or scrapping monthly payments
The Personal Independence Payment (PIP) overhaul consultation closed at midnight last night - we round up some of the major reforms put forward in the DWP consultation
The Independent
Tammy Duckworth eviscerates Trump for painful comments about disabled Americans
‘It’s hard to describe the pain millions of Americans with disabilities are feeling in response to Donald Trump’s newly-reported comments,’ Democratic Senator Tammy Duckworth said
Wales Online
Man caught beating dog in horrific attack filmed on camera
It was described by one police officer as a 'very upsetting case'
Sky video
Shingles vaccine on NHS could significantly delay onset of dementia, study suggests
A vaccine being used by the NHS to prevent shingles could also significantly delay the onset of dementia, according to new research that has left scientists baffled.
Wales Online
Woman, 39, died after waiting eight hours in crowded A&E
The 39-year-old suffered a “massive bleed” on her brain from an aneurysm while she waited more than eight hours to be seen
Motherly
Dad asks internet for help after his tween daughters kicked him out of bed with his wife
"I don't even go upstairs any more. The bed is full."
Prevention
Doctors Say Pooping This Many Times a Day Means You’re Healthy
A new study finds that pooping frequency could predict your overall health. Plus, easy ways to get things moving in the right direction, per the study author.
Digital Spy
Coronation Street villain Rowan Cunliffe steps up blackmail plan
Coronation Street's Nick Tilsley continues being blackmailed by Rowan Cunliffe amid Leanne Battersby's involvement in The Institute.
HuffPost UK
These Simple Exercises Can Reduce Your Risk Of Depression By Up To 23%
Research has found an association between participating in low to moderate intensity exercise and reduced rates of depression.
People
Massive Covid Spikes in These 21 States May Explain Why Everyone You Know Seems to Have It Right Now
The CDC’s wastewater monitoring program has detected a “very high” presence of the virus in nearly half the country
Time
How to Tell If You're Clenching Your Jaw—And How to Stop
It's hard—but not impossible—to break the habit.
Manchester Evening News
Family's £30,000 NHS bill 'nightmare' after beloved mum suffers stroke while visiting UK
'What she is going through right now, I don't think anyone deserves that'
BuzzFeed
If You're Experiencing Any Of These COVID Symptoms, Doctors Say You Should Seek Medical Attention ASAP
There are some COVID-19 symptoms that aren’t normal.
Business Insider
A man sued a restaurant after a bone in his 'boneless' wings got lodged in his throat. He lost.
A man in Ohio had to visit the ER after a bone from a boneless wing got lodged in his throat, a lawsuit said.
Oxford Mail
Oxfordshire canal side pub receives two-star food hygiene rating
A canal side pub near Kidlington has received a two-star food hygiene rating following a recent inspection.
Wales Online
Harvey Weinstein seriously ill in hospital
He has been moved from Rikers Island prison to a secure hospital ward
The Guardian
Inga Rublite timeline: events in run-up to death of woman in A&E waiting room
From experiencing a sudden headache at work to being found slumped under a coat in hospital with a brain haemorrhage
Men's Health UK
New Research Says Strength Training Will Make You Smarter
Research analysing the effects of strength and aerobic exercise on cognitive function has shown the beneficial effects training has on brain and muscular power
Bournemouth Echo UK
Heat health alert issued for whole of Dorset
A heat health alert has been issued for the whole of Dorset next week.

Latest stories