Public email corpus for Rspamd integration and regression testing.
This repository contains:
- Base corpus - SpamAssassin public corpus (~1000 messages) for load testing
- Regression tests - Real emails from bug reports that caused issues
- Edge cases - Specific test cases for corner cases and special scenarios
corpus/
spam/ - Spam messages from SpamAssassin corpus
ham/ - Legitimate messages from SpamAssassin corpus
edge-cases/ - Special test cases (unicode, large files, malformed, etc.)
regression/
issue-NNNN.eml - Email from GitHub issue #NNNN
issue-NNNN.yaml - Expected test results and metadata
scripts/
prepare-corpus.sh - Download and prepare SpamAssassin corpus
build-archive.sh - Build zip archive for CI
validate-corpus.sh - Validate all emails in corpus
add-regression.sh - Add new regression test from issue
- name: Download test corpus
run: |
curl -L https://github.com/rspamd/rspamd-test-corpus/releases/download/v1.0/corpus.zip -o corpus.zip
unzip corpus.zip -d corpus/- name: Clone corpus
run: |
git clone --depth=1 https://github.com/rspamd/rspamd-test-corpus.git corpusWhen you find a problematic email that causes a bug:
# Add the email
./scripts/add-regression.sh issue-1234 /path/to/email.eml
# This creates:
# - regression/issue-1234.eml (the email)
# - regression/issue-1234.yaml (metadata)Example metadata file (regression/issue-1234.yaml):
issue: 1234
title: "MIME parser crashes on malformed Content-Type"
date: 2025-10-17
category: parser
expected:
should_not_crash: true
symbols:
- MISSING_SUBJECT
score_min: 0.0
score_max: 5.0cd scripts
./prepare-corpus.shThis will:
- Download SpamAssassin public corpus
- Extract and organize emails into
corpus/spam/andcorpus/ham/ - Select ~1000 representative messages
./scripts/build-archive.sh
# Creates: releases/rspamd-test-corpus-YYYYMMDD.zip-
Total messages: ~1200
- Spam: ~400
- Ham: ~600
- Edge cases: ~100
- Regression: ~100 (growing)
-
Size: ~15 MB (uncompressed), ~3 MB (zip)
- SpamAssassin Public Corpus (https://spamassassin.apache.org/old/publiccorpus/)
- License: Public domain / freely redistributable
- Messages from 2002-2003, cleaned and anonymized
- Real emails from GitHub issues (with sensitive data removed)
- Contributed by Rspamd community
- Manually created test cases
- Synthetic emails for specific scenarios
- SpamAssassin corpus: Public domain (see SpamAssassin license)
- Regression tests: MIT License (see LICENSE)
- Scripts: MIT License (see LICENSE)
- Open issue in rspamd/rspamd with bug description
- Use
scripts/add-regression.shto add the email to this repository - Remove any sensitive information (passwords, real email addresses, etc.)
- Submit PR with the new regression test
To update the base corpus:
cd scripts
./prepare-corpus.sh --refresh
./build-archive.shCreate a new release with the updated archive.
Validate all emails in the corpus:
./scripts/validate-corpus.shThis checks:
- All files are valid RFC822 messages
- No sensitive data leaked (emails, IPs from non-public ranges)
- Proper file encoding
- Metadata files are valid YAML
- GitHub Issues: https://github.com/rspamd/rspamd-test-corpus/issues
- Rspamd Community: https://rspamd.com/support.html