A collection of papers and resources across the four principal components of dialogue summarization: Challenges, Techniques, Datasets, and Metrics. The sub-folders contain our code to crawl the Semantic Scholar and DBLP databases. We intend to periodically update this repository for its ongoing relevance in the current era of LLMs. Stay tuned!
- Overview
- The CADS Taxonomy
- Methodology
- Challenges
- Techniques
- Datasets
- Evaluation Metrics
- Future Research Directions
- Citation
This repository accompanies our systematic literature review on abstractive dialogue summarization, published in the Journal of Artificial Intelligence Research (JAIR) 2025. Our work examines 1,262 papers published between 2019-2024, with 133 papers selected for comprehensive analysis based on strict inclusion criteria.
We introduce the CADS Taxonomy (Challenges of Abstractive Dialogue Summarization), a novel framework that categorizes the major challenges in dialogue summarization and maps them to corresponding techniques, datasets, and evaluation metrics. This taxonomy provides a unified understanding of the field's inherent complexity and highlights areas where significant progress has been made or where more research is needed.
Figure 1: Overview of the six challenges in dialogue summarization, including a brief description of each challenge and an estimation of progress for related sub-challenges. Green means mostly mitigated, orange means good progress, and red stands for marked challenges still exist.
Our analysis revealed six major challenges in dialogue summarization:
-
Language: Handling the idiosyncratic nature of spoken language, including informal expressions, ungrammatical structures, colloquialisms, and domain-specific terminology.
-
Structure: Extracting and representing the underlying organization of a conversation, including topic flow, dialogue phases, and utterance dependencies.
-
Comprehension: Understanding both explicit and implicit content in dialogues, requiring local and global context awareness and background knowledge integration.
-
Speaker: Tracking participant roles, their interactions, and references to entities and other participants, including role changes across topics.
-
Salience: Identifying content critical for insightful summaries, locating relevant text spans, and adapting to different viewpoints and audiences.
-
Factuality: Ensuring generated summaries accurately reflect the conversation, avoiding both extrinsic errors (hallucinations) and intrinsic errors (misrepresentations).
Our systematic literature review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) methodology to ensure transparency, reproducibility, and comprehensiveness. The review process comprised two sequential stages with rigorously defined parameters.
We conducted a systematic search of academic literature using two established scholarly databases: Semantic Scholar and DBLP (Database of scientific publications in computer science). The search strategy employed a structured combination of anchor terms and domain-specific search terms to maximize retrieval of relevant publications.
Anchor Terms:
- dialogue summarization
- dialog summarization
- conversation summarization
Search Terms: We systematically combined each anchor term with 28 search terms representing techniques, applications, and concepts relevant to dialogue summarization:
| Category | Terms |
|---|---|
| Core Concepts | technique, approach, dataset, evaluation, task, metric |
| Technical Terms | language model, NLP, semantic representation, information extraction |
| Advanced Concepts | text simplification, topic segmentation, personalization, controllable |
| Implementation | machine, automatic, training |
| Subdomains | meeting summarization, chat summarization, customer service summarization, legal, email, interview, medical |
| Ethical Considerations | ethic, cost, privacy, co2 |
This combinatorial approach yielded 112 distinct search queries (each anchor term with each search term), resulting in 1,262 unique candidate papers published between January 2019 and March 2024 after deduplication across both databases.
The candidate papers underwent systematic screening by three domain experts, applying the following predefined eligibility criteria:
Inclusion Criteria:
- Long or short paper published in peer-reviewed venues
- Open-access publication ensuring accessibility
- Published since 2019, corresponding to the emergence of Transformer-based models for summarization
Exclusion Criteria:
- Non-English primary dataset to ensure consistent linguistic analysis
- Multi-modal studies incorporating visual or audio elements
- Focus on extractive rather than abstractive summarization
- Usage of non-Transformer-based methods
The screening process involved:
- Initial assessment of titles and abstracts reducing the corpus from 1,262 to 186 documents
- Comprehensive full-text examination with dual-reviewer validation
- Consensus resolution for disagreements regarding inclusion or exclusion
This rigorous filtering process resulted in the final inclusion of 133 papers that formed the evidentiary base for our systematic review. Each included paper was subsequently categorized according to its primary addressed challenges, employed techniques, utilized datasets, and evaluation methodologies.
- Language Challenge: Handling the idiosyncratic nature of spoken language
- Structure Challenge: Extracting and representing conversation organization
- Comprehension Challenge: Understanding explicit and implicit content
- Speaker Challenge: Tracking participant roles and interactions
- Salience Challenge: Identifying and focusing on relevant information
- Factuality Challenge: Ensuring reliability of generated summaries
The language challenge addresses the idiosyncratic nature of spoken language and individual speech patterns, including linguistic noise, content repetition, and domain-specific terminology.
Approach Categories:
- Pre-training: Adapting models to dialogue-specific language patterns
- Training tasks: Masking key dialogue elements, part-of-speech tagging
- Pre-processing: Anaphora resolution techniques
The structure challenge focuses on extracting a conversation's built-up, identifying topic flow, dialogue phases, and analyzing utterance dependencies.
Approach Categories:
- Architecture modification: Graph structures to capture dialogue dynamics
- Pre-training: Adaptations for handling long context
- Training tasks: Understanding dialogue structure
- Importance measures: Grouping sentences into sub-topics
The comprehension challenge involves understanding and contextualizing dialogue content, including both explicit and implicit information, requiring awareness of both local and global context.
Approach Categories:
- Architecture modification: Context-aware frameworks to capture local and global context
The speaker challenge addresses tracking participants, their actions, entity references, and revealing participant relationships in dialogues.
Approach Categories:
- Training tasks: Understanding speaker roles and dynamics
- Architecture modification: Graph-based models for speaker structures
- Pre-processing: Enhancing speaker representation
- Post-processing: Improving pronoun resolution
The salience challenge involves identifying and focusing on the most relevant information within a dialogue, including content critical for insightful summaries and information relevant to specific audiences.
Approach Categories:
- Training tasks: Distinguishing between salient and non-salient content
- Architecture modification: Modified attention mechanisms
- Human feedback: Incorporating human judgment
- Loss function: Special losses for missing or irrelevant information
- Pre-processing: Methods to identify key entities
The factuality challenge addresses the reliability of generated summaries, preventing both extrinsic errors (hallucinations) and intrinsic errors (misrepresentations).
Approach Categories:
- Training tasks: Estimating factual aspects
- Architecture modification: Special encoders for dialogue states
- Human feedback: Using human assessments
- Loss function: Encouraging factual content generation
- Post-processing: Correcting factual errors
For a comprehensive analysis of techniques addressing dialogue summarization challenges, see our techniques overview.
Researchers have proposed various techniques to address the challenges in dialogue summarization. Below, we categorize these techniques and link them to the challenges they primarily address.
| Challenge | Approach Category | Representative Papers |
|---|---|---|
| Language | Pre-training | Raffel et al. (2020), Zou et al. (2021), Zhou et al. (2023) |
| Training tasks | Zhu et al. (2020), Khalifa et al. (2021), Bertsch et al. (2022) | |
| Pre-processing | Ganesh & Dingliwal (2019) | |
| Structure | Architecture modification | Li (2022), Lei et al. (2021), Gao et al. (2023), Hua et al. (2023) |
| Pre-training | Lee et al. (2021), Peysakhovich & Lerer (2023), Xu et al. (2024) | |
| Training tasks | Feng et al. (2021), Liu et al. (2021), Yang et al. (2022) | |
| Importance measures | Reimers & Gurevych (2019), Liang et al. (2023) | |
| Comprehension | Architecture modification | Wang et al. (2023) |
| Speaker | Training tasks | Gan et al. (2021), Qi et al. (2021), Naraki et al. (2022) |
| Architecture modification | Lei et al. (2021), Liu et al. (2021), Hua et al. (2022) | |
| Pre-processing | Joshi et al. (2020), Lee et al. (2021) | |
| Post-processing | Fang et al. (2022), Liu & Chen (2022) | |
| Salience | Training tasks | Chauhan et al. (2022), Liu et al. (2022), Ghadimi & Beigy (2022) |
| Architecture modification | Li et al. (2021), Hua et al. (2023) | |
| Human feedback | Chen et al. (2023) | |
| Loss function | Huang et al. (2023) | |
| Pre-processing | Liu & Chen (2021), Jung et al. (2023) | |
| Factuality | Training tasks | Gan et al. (2021), Tang et al. (2022) |
| Architecture modification | Wu et al. (2021), Zhao et al. (2021), Nair et al. (2023) | |
| Human feedback | Chen et al. (2023) | |
| Loss function | Liu et al. (2022), Huang et al. (2023) | |
| Post-processing | Fu et al. (2021), Li et al. (2023) |
-
Graph-based Models: Using graph structures to represent dialogue components, their relationships, and semantic connections.
- Lei et al. (2021), Hua et al. (2022), Gao et al. (2023)
-
Hierarchical Encoders: Processing dialogue at multiple levels (words, turns, topics) to capture different structural elements.
- Zhu et al. (2020), Lei et al. (2021)
-
Multi-stage Approaches: Breaking long dialogues into manageable segments before summarization.
- Zhang et al. (2021), Laskar et al. (2023), Sharma et al. (2023)
-
Contrastive Learning: Using positive and negative examples to help models distinguish between important and irrelevant content.
- Liu et al. (2022), Huang et al. (2023)
-
Sketch-based Generation: Creating a structural outline before generating the full summary.
- Wu et al. (2021)
We identified 18 widely-used datasets across different dialogue subdomains. For detailed information, see our datasets overview.
We identified 18 widely-used datasets across different dialogue subdomains. These datasets vary in size, complexity, and the challenges they primarily address.
| Subdomain | Dataset | Primary Challenge | Key Characteristics | Usage Count |
|---|---|---|---|---|
| Daily Chat | DialogSum | Speaker | 13k dialogues, ~1k tokens input, ~130 tokens summary | 26 |
| Online Chat | SAMSum | Speaker, Salience | 16k dialogues, 2-3 speakers, ~94 tokens per conversation | 68 |
| FORUM | Salience | 100 discussion threads from online forums | 5 | |
| CRD3 | Language | 159 episodes, ~2550 turns per dialogue | 6 | |
| Meeting | AMI | Salience, Comprehension | 137 business meetings, ~6k tokens, 535 turns, 4 speakers | 63 |
| ICSI | Salience, Language | 59 academic meetings, ~13k tokens, 819 turns, 6 speakers | 21 | |
| QMSum | Salience, Language | 232 meetings, 1.8k query-summary pairs | 18 | |
| Elitr | Salience | 120 technical project meetings, ~7k words | 7 | |
| MeetingBank | Structure | 1366 city council meetings, ~28k tokens per transcript | 7 | |
| Screenplay | MediaSum | Structure | 463.6k media interview transcripts | 15 |
| SummScreen | Speaker, Salience | 27k TV series transcripts, ~6.6k tokens | 8 | |
| Customer Service | TweetSumm | Language | 1.1k Twitter customer support exchanges | 7 |
| TODSum | Language, Factuality | 10k task-oriented dialogues | 6 | |
| Medical | MTS-Dialog | Comprehension | 1.7k doctor-patient conversations | 5 |
| Debates | ADSC | Comprehension | 45 two-party dialogues on social/political topics | 5 |
| Others | ConvoSumm | Salience | 2k dialogues from various sources | 5 |
| Ferranti | Factuality | 4k items for factual error correction | 5 | |
| DialSum | Comprehension | Two-party discussion on images | 10 |
Due to the limited number of available datasets, researchers have explored methods to artificially generate datasets:
- Weak labeling: Using heuristics to generate proxy summaries (Sznajder et al., 2022)
- Paraphrasing: Modifying existing dialogues while maintaining coherence (Liu et al., 2022; Wahle et al., 2022, 2023)
- Structure alteration: Random modifications to conversation structures (Chen & Yang, 2021; Park et al., 2022)
- LLM generation: Using language models to create ground truth summaries (Asi et al., 2022; Nair et al., 2023; Zhou et al., 2023)
For a detailed analysis of evaluation metrics used in dialogue summarization, see our metrics overview.
Dialogue summarization uses various evaluation metrics, which we categorize into four groups:
-
Count-based Metrics
- ROUGE (used in 127 papers): Measures recall of n-gram overlap
- BLEU (21 papers): Measures precision of n-gram overlap
- METEOR, CIDEr, chrF++, etc.
-
Model-based Metrics
- BERTScore (38 papers): Token-level similarity using BERT embeddings
- BARTScore (13 papers): Evaluates text generation likelihood
- FactCC (9 papers): Measures factual consistency
-
QA-based Metrics
- FEQA (38 papers): Question generation and answering for faithfulness
- SummaQA, QuestEval
-
Human Evaluation
- Performance assessment: Likert scales, pairwise comparison, best-worst scaling
- Agreement metrics: Krippendorff's Alpha, Cohen's Kappa, Fleiss' Kappa
| Category | Characteristic | Description | Common Metrics |
|---|---|---|---|
| Accuracy | Factuality | Adherence to facts in the source | FactCC, FEQA, QuestEval |
| Content | Relevance | Importance of included information | ROUGE, Human evaluation |
| Coverage | Inclusion of all key topics | ROUGE-L, Human evaluation | |
| Informativeness | Depth and usefulness of information | SummaQA, Human evaluation | |
| Readability | Coherence | Logical flow and connections | Human evaluation, BARTScore |
| Structure | Organization of information | Human evaluation | |
| Conciseness | Brevity without redundancy | Human evaluation | |
| Style | Consistency | Uniform perspective and tense | Human evaluation |
| Fluency | Grammatical correctness | Perplexity, Human evaluation |
Our analysis reveals several promising research directions:
-
LLMs Impact: Exploring how Large Language Models may reduce challenges in language handling but still struggle with comprehension and factuality. Recent works suggest LLMs match or exceed task-specific models but face limitations in robustness.
-
Personalized Summarization: Growing demand for summaries tailored to specific user needs and perspectives. Only a few datasets (e.g., Chinese CSDS) currently incorporate personalized summaries.
-
Realistic Datasets: Need for high-quality, realistic datasets rather than large-scale synthetic ones, especially as models improve in few-shot learning capabilities.
-
Improved Evaluation: Development of specialized metrics for dialogue summarization, exploring LLM-based evaluation approaches like GEMBA and ICE.
-
Cross-challenge Approaches: Developing techniques that address multiple challenges simultaneously rather than focusing on individual aspects.
If you find this repository or our paper useful for your research, please cite our work:
@article{kirstein2025cads,
title={CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization},
author={Kirstein, Frederic and Wahle, Jan Philip and Gipp, Bela and Ruas, Terry},
journal={Journal of Artificial Intelligence Research},
volume={82},
pages={313--365},
year={2025}
}