Article
Author(s):
Interrater agreement between SCORE-AI and human experts was notable, with almost perfect agreement for generalized epileptiform abnormalities and substantial agreement for focal epileptiform discharges, among other findings.
A recently published diagnostic study showed that SCORE-AI, an artificial intelligence (AI) model trained to interpret routine clinical electroencephalograms (EEGs), achieved diagnostic accuracy similar to human experts, and may have utility in improving efficiency and consistency in specialized epilepsy centers.1
The characteristics of the EEG data sets included a development data set (n = 30,493; median age, 25.3 years [95% CI, 1.3-76.2]), multicenter test data set (n = 100; median age, 25.8 years [95% CI, 4.1-85.5]), single-center test data set (n = 9785; median age, 35.4 years [95% CI, 0.6-87.4]), and test data set with external reference standard (n = 60; median age, 36 years [95% CI, 3-75]). Published in JAMA Neurology, the SCORE-AI achieved high accuracy, with an area under the receiver operating characteristic curve between 0.89 and 0.96 for the different categories of EEG abnormalities.
"Its application may help to provide useful clinical information in remote and underserved areas where expertise in EEG interpretation is minimal or unavailable," senior investigator Sandor Beniczky, MD, PhD, neurologist and clinical neurophysiologist at Aarhus University, and colleagues, concluded. "Importantly, it may also help reduce the potential for EEG misinterpretation and subsequent mistreatment, improve interrater agreement to optimize routine interpretation by neurologists, and increase efficiency by decompressing excessive workloads for human experts interpreting high volumes of EEGs."
In the development data set of SCORE-AI, the mean EEG duration was 33 minutes (95% CI, 20-77). It was developed in Python using TensorFlow using EEGs recorded between 2014 and 2020, with data analyzed throughout 2022. The multicenter test data set and the single-center test data set, which used EEGs of patients who were not included in the development phase, further validated the AI model.
At the conclusion of the analysis, findings showed that recordings of less than 20 minutes resulted in a lower area under the curve (AUC) while those longer than 20 minutes had an AUC that showed small relative variations related to the duration of the recording. Investigators observed mean AUCs of 0.887 and 0.903, for duration of 0 to less than 20 minutes and more than 20 minutes, respectively.
Between the 11 human experts and multicenter data set (n = 100), there was almost perfect agreement (Gwet AC1 = 0.9) concerning the presence of generalized epileptiform discharges, and substantial agreement (Gwet AC1 of 0.63-0.72) on focal epileptiform discharges, diffuse nonepileptiform abnormalities, and on recordings considered to be normal. A moderate interrater agreement (Gwet AC1 of 0.50-0.59) was observed for the presence of focal epileptiform abnormalities.
READ MORE: Obstructive Sleep Apnea Common Following Vagus Nerve Stimulation, Regardless of Sex
In the large external single-center data set (n = 9785) agreement between SCORE-AI and clinical evaluation of the recordings was within the range of the human expert interrater variability for identifying normal recordings (0.74) and recordings with generalized epileptiform abnormalities (0.95). In addition, it was significantly higher for the remaining categories (0.64-0.87).
"There is no other open-source or commercially available software package for comprehensive assessment of routine clinical EEGs," the study authors wrote. "Several AI-based models have been developed for detection of epileptiform activity on EEG. However, this aspect is only a fragment of the complete comprehensive EEG assessment. The other major limitation of the previously published models is the high number of false detections (0.73 per minute) precluding their clinical implementation."
The performance of SCORE-AI was compared with 3 previously published models (encevis, SpikeNet, Persyst) using data from a previous study that consisted of 20-minute routine clinical EEGs containing sharp transients from 60 patients. Of these, 30 had epilepsy and 30 had nonepileptic paroxysmal events. This data set had an external independent reference standard at the recording level, derived from video-EEG recordings of patients obtained during their habitual clinical episodes.
Although the specificity for the 3 previously published models were too low for clinical implementation (3%-63%), SCORE-AI still demonstrated substantially greater specificity (90%) and was more specific than the majority consensus of the 3 human experts (73.3%). The sensitivity of SCORE-AI (86.7%) was similar to that of the human experts (93.3%), higher than the sensitivity of SpikeNet (66.7%), and lower than encevis (96.7%) and Persyst (100%). All together, the overall accuracy of SCORE-AI (88.3%; 95% CI, 79.2-94.9) was greater than the 3 previously published AI models and similar to the human experts (83.3%; 95% CI, 73-91.4).