Accuracy of claims-based algorithms for epilepsy research: Revealing the unseen performance of claims-based studies



To evaluate published algorithms for the identification of epilepsy cases in medical claims data using a unique linked dataset with both clinical and claims data.


Using data from a large, regional health delivery system, we identified all patients contributing biologic samples to the health system’s Biobank (n = 36K). We identified all subjects with at least one diagnosis potentially consistent with epilepsy, for example, epilepsy, convulsions, syncope, or collapse, between 2014 and 2015, or who were seen at the epilepsy clinic (n = 1,217), plus a random sample of subjects with neither claims nor clinic visits (n = 435); we then performed a medical chart review in a random subsample of 1,377 to assess the epilepsy diagnosis status. Using the chart review as the reference standard, we evaluated the test characteristics of six published algorithms.


The best-performing algorithm used diagnostic and prescription drug data (sensitivity = 70%, 95% confidence interval [CI] 66–73%; specificity = 77%, 95% CI 73–81%; and area under the curve [AUC] = 0.73, 95%CI 0.71–0.76) when applied to patients age 18 years or older. Restricting the sample to adults aged 18–64 years resulted in a mild improvement in accuracy (AUC = 0.75,95%CI 0.73–0.78). Adding information about current antiepileptic drug use to the algorithm increased test performance (AUC = 0.78, 95%CI 0.76–0.80). Other algorithms varied in their included data types and performed worse.


Current approaches for identifying patients with epilepsy in insurance claims have important limitations when applied to the general population. Approaches incorporating a range of information, for example, diagnoses, treatments, and site of care/specialty of physician, improve the performance of identification and could be useful in epilepsy studies using large datasets.