EPIAIDEA is a digital epidemiology and population analytics platform where epidemiologic rigor meets AI-scale methods — translating messy, real-world data into evidence that holds up under scrutiny and guides decisions that affect real populations.
EPIAIDEA — Epidemiology, AI, Data, Evidence, Action — is a framework for doing population health science in an era of abundant, imperfect data. The platform integrates traditional epidemiologic theory with modern AI-enabled methods to extract valid, decision-grade evidence from sources that conventional analysis cannot handle at scale.
The core premise is simple but demanding: AI applied to health data is only as good as the epidemiologic thinking behind it. Prediction accuracy means nothing if the model is trained on biased data, applied to a different population, or used to answer a question it was not designed for. Every analytic decision here begins with the question — not the algorithm — and works backward to the data and method that give the most defensible answer.
This work operates across geographies and health domains — from overdose prevention in rural North Dakota to heat resilience infrastructure in the Southwest US, from GLP-1 misinformation surveillance to fibroid care access mapping in California. What ties it together is a consistent commitment to causal reasoning, equity-conscious design, and outputs that are deployable by the institutions making health decisions.
EPIAIDEA is directed by Professor Akshaya Bhagavathula, promoted to Full Professor at age 39 in the field of Epidemiology. The platform reflects a sustained body of methodologic and applied work across digital epidemiology, pharmacovigilance, geospatial AI, legal epidemiology, and global burden of disease research.
"The gap between knowing and acting in public health is not a data problem. It is an inference problem."
Health systems are drowning in data but starved for evidence. Electronic health records, insurance claims, syndromic surveillance feeds, digital search behavior — all of this exists, but almost none of it is structured in a way that supports valid causal inference without deliberate epidemiologic curation. EPIAIDEA exists to close that gap. Not by generating more data, but by extracting more signal from the data that already exists — and translating it into the kind of evidence that earns a seat at the policy table.
Each letter in EPIAIDEA names a pillar of the analytic framework. Together they define what it means to do this work with both scientific integrity and real-world impact.
Every analysis is structured around epidemiologic principles: confounding control, bias identification, temporality, and population heterogeneity. The question of whether an association is causal — or merely predictive, or artifactual — shapes every methodologic decision from the outset.
Scalable ML and NLP pipelines built with epidemiologic awareness from the ground up — not bolted on afterward. Models are interrogated for confounding, calibrated against external validation sets, and structured to be interpretable to the policy and clinical audiences who act on their outputs.
The platform draws from a deliberately broad data substrate — Google Search trends, social platform signals, EHR records, insurance claims, Medicare/Medicaid administrative data, and syndromic surveillance feeds — each validated for the specific inference task at hand before any modeling begins.
Health disparities are not discovered at the end of an analysis — they are built into or out of the design from the beginning. Every platform interrogates differential impact by race, geography, income, and structural access before a finding is considered complete or ready for dissemination.
Inference-ready outputs designed for the institutions that make health decisions — not just journal reviewers. This means dashboards that update automatically, maps that communicate access gaps without requiring statistical literacy, and summaries formatted for Medicaid directors, county health officers, and legislative staff.
Where something happens is often as important as what happens. County- and census tract-level spatial analysis surfaces access gaps, structural disparities, and environmental exposures that aggregate national statistics obscure. Geographic precision is what makes an analysis actionable for a local health department with a specific budget question.
The analytic pipeline is structured to preserve validity at every stage — from raw data ingestion through to the final evidence product that reaches a decision-maker.
Real-time tracking of health metrics, payment parity indices, and spatial-temporal epidemiology at the state and county level. Designed for health departments and Medicaid agencies that need continuous situational awareness — not quarterly reports that are obsolete before they are distributed.
County-level structural access mapping using geospatial distance metrics, service listing data, and population demand signals. The goal is to make the invisible visible — identifying where care need exists but care supply does not, and quantifying the magnitude of the gap in terms decision-makers can act on.
Matching investment to need by identifying where digital demand signals and structural care access are most mismatched. Evidence that answers the question every health system planner is actually asking: where should the next dollar go to produce the greatest reduction in preventable harm?
These are not aspirational values — they are operational constraints that determine what gets built, how it gets validated, and what it takes for a finding to be called evidence.
Every project begins with a population health question stated in epidemiologic terms — not with an interesting dataset looking for a use case. The question defines data requirements, method selection, and the validity criteria an output must meet before it can be called evidence.
Digital and administrative data sources are not random samples of any population. Selection bias, measurement error, and differential missingness are the norm, not the exception. Every analysis begins by enumerating the threats to validity — and either addressing them or explicitly acknowledging what the analysis cannot claim.
A model that performs well on a held-out test set but cannot explain its predictions to a county health officer is not ready for deployment. Every output is built to be interrogated — which features matter, which populations drive the prediction, and what the confidence intervals actually mean for decision-making under uncertainty.
A finding that exists only in a PDF is not translational research — it is academic output with translational aspirations. I design for deployment from the start: live data connections, automated updates, and formats that reach the audiences who act on evidence rather than just the ones who evaluate it for publication.
Pharmacoepidemiology uses population-level data to study how medicines perform — and where they harm — outside the controlled conditions of a clinical trial. It is one of the most methodologically demanding areas in epidemiology because the data are almost always confounded by indication, and the signals are almost always rare.
Post-market drug safety is a surveillance problem that clinical trials are structurally incapable of solving. Trials are too small, too short, and too selective to detect adverse events that occur in 1 in 1,000 patients, emerge after years of exposure, or concentrate in subpopulations excluded from enrolment. Pharmacovigilance begins where the trial ends — drawing on spontaneous reporting databases, electronic health records, insurance claims, and increasingly, digital patient-reported data.
My work in this domain focuses on disproportionality analysis of spontaneous adverse event reports — specifically the FDA Adverse Event Reporting System (FAERS) — alongside causal modelling frameworks that go beyond association to ask whether a signal is real, how strong it is relative to comparator drugs, and what the plausible mechanism is. The GLP-1 receptor agonist class has been a central focus: a drug class that went from niche diabetes therapy to one of the most prescribed medication classes in history within five years, with a post-market safety profile that is still being characterized.
Disproportionality analysis without causal thinking produces noise. The Reporting Odds Ratio and Proportional Reporting Ratio are screening tools, not evidence — they tell you where to look, not what you have found. My analyses pair signal detection with structured evaluation of confounding by indication, Weber effect bias, notoriety bias, and the plausibility of the proposed mechanism before a signal is characterized as a concern worth communicating.
Calculating Reporting Odds Ratios and Proportional Reporting Ratios across drug–event pairs in the FDA Adverse Event Reporting System. Adjusted for concomitant medications, reporter type, and temporal reporting patterns to reduce Weber effect inflation and notoriety bias.
Applying directed acyclic graphs to the pharmacovigilance setting — specifying confounders, mediators, and colliders in spontaneous reporting data where confounding by indication is nearly universal and cannot be ignored.
Using Medicare and commercial insurance claims to conduct active drug safety surveillance — cohort studies with new-user designs, active comparator selection, and high-dimensional propensity score adjustment to address channelling bias.
Translating pharmacovigilance findings into structured safety communications for clinical and regulatory audiences — framing absolute risk, number needed to harm, and clinical context rather than reporting raw disproportionality statistics.
A systematic disproportionality analysis of GI adverse event reports for semaglutide, liraglutide, tirzepatide, and dulaglutide in FAERS — covering nausea, vomiting, gastroparesis, ileus, and aspiration-related events. RORs were calculated with 95% confidence intervals and adjusted for concomitant use of other GI-active agents. The analysis characterized the differential signal strength across the GLP-1 class and identified gastroparesis as a disproportionately underreported signal relative to its clinical documentation rate in EHR data — suggesting systematic under-reporting in spontaneous surveillance.
Spontaneous reporting databases are not designed for causal inference — they are designed for signal detection. This work applies DAG-based causal reasoning to FAERS analyses to distinguish genuine drug–event associations from artifacts of reporting behaviour, notoriety bias, and indication-driven confounding. The structured approach identifies which signals from disproportionality analysis can be elevated to the level of a probable causal relationship and which require active surveillance in claims or EHR data before any clinical communication is warranted.
The GLP-1 safety discourse in public-facing media has diverged significantly from the regulatory and pharmacovigilance evidence base — amplifying rare signals while underreporting common GI tolerability issues that affect treatment adherence. Using the ClaimReview API and structured misinformation surveillance, I mapped the claim typology, source patterns, and temporal velocity of GLP-1 misinformation — particularly around muscle loss, hair thinning, and cardiac risk — providing a baseline for risk communication intervention design.
Browse active implementations and research outputs, or reach out to discuss collaboration, consulting, or speaking.