ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

What is ER-Reason?

ER-Reason is a large-scale benchmark suite for evaluating the clinical reasoning capabilities of large language models (LLMs) in the emergency room (ER) — a high-stakes environment where clinicians make rapid, life-critical decisions.

ER-Reason is designed to move beyond multiple-choice exam-style QA and instead test LLMs on realistic, multi-stage clinical workflows grounded in real-world electronic health records (EHRs). ER-Reason simulates the full ER decision-making pipeline—including triage, treatment selection, and final diagnosis—and includes expert-written rationales to capture the step-by-step thinking used by physicians in real clinical settings.

For detailed column descriptions of the dataset, see:
ER-Reason Dataset Column Descriptions

✅ Key Features

📚 Real-World Clinical Data
Includes 3,984 patients and 25,174 de-identified longitudinal clinical notes from an academic medical center. Document types include:
- Discharge summaries
- Progress notes
- History & Physical (H&P)
- Consult notes
- Imaging reports
- Echocardiography reports
- ER provider notes
🧠 Clinical Reasoning Annotations
72 physician-authored rationales explaining the reasoning behind clinical decisions—modeled after residency-level teaching and not typically found in EHR documentation.
⚕️ Workflow-Aligned Clinical Tasks
Tasks are structured around the actual ER care process:
- Triage Intake
- EHR Review
- Initial Assessment
- Treatment Planning
- Disposition Planning (Admit, Discharge, ICU)
- Final Diagnosis
🤖 Model Compatibility
ER-Reason includes code and templates for evaluating:
- LLaMA 3.2-3B-Instruct
- GPT-4o
- GPT-3.5 Turbo
- O3-Mini
🧪 LLM Evaluation Tasks
Benchmark includes structured tasks and metrics to evaluate:
- Acuity classification: determining patient urgency based on symptoms and clinical history
- EHR summarization: summarizing key aspects of a patient's clinical history
- Rationale generation: creating the reasoning behind clinical decisions, aligned with physician thinking
- Diagnosis inference: inferring the most likely diagnosis based on EHR data and symptoms
- Disposition prediction: predicting whether a patient should be admitted, discharged, or sent to the ICU.

🚀 Getting Started

Step 1: Clone the ER-Reason repository

git clone https://github.com/AlaaLab/ER-Reason.git

Step 2: Install dependencies

pip install -r <path-to-repo>/ER-Reason/requirements.txt

Step 3: Download the dataset

Download the ER-Reason dataset from PhysioNet:
https://physionet.org/projects/JGAP8qn2p4CPnPeXceVE/overview/

Note: Access requires registration and data use agreement approval.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
Acuity		Acuity
Diagnosis		Diagnosis
Disposition		Disposition
EHR-Review		EHR-Review
Rationale		Rationale
ER-Reason_Column_Descriptions.md		ER-Reason_Column_Descriptions.md
README.md		README.md
dataset.py		dataset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

What is ER-Reason?

✅ Key Features

🚀 Getting Started

Step 1: Clone the ER-Reason repository

Step 2: Install dependencies

Step 3: Download the dataset

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AlaaLab/ER-Reason

Folders and files

Latest commit

History

Repository files navigation

ER-Reason: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

What is ER-Reason?

✅ Key Features

🚀 Getting Started

Step 1: Clone the ER-Reason repository

Step 2: Install dependencies

Step 3: Download the dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages