Skip to content

rspamd/rspamd-test-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rspamd Test Email Corpus

Public email corpus for Rspamd integration and regression testing.

Purpose

This repository contains:

  1. Base corpus - SpamAssassin public corpus (~1000 messages) for load testing
  2. Regression tests - Real emails from bug reports that caused issues
  3. Edge cases - Specific test cases for corner cases and special scenarios

Repository Structure

corpus/
  spam/         - Spam messages from SpamAssassin corpus
  ham/          - Legitimate messages from SpamAssassin corpus
  edge-cases/   - Special test cases (unicode, large files, malformed, etc.)

regression/
  issue-NNNN.eml   - Email from GitHub issue #NNNN
  issue-NNNN.yaml  - Expected test results and metadata

scripts/
  prepare-corpus.sh   - Download and prepare SpamAssassin corpus
  build-archive.sh    - Build zip archive for CI
  validate-corpus.sh  - Validate all emails in corpus
  add-regression.sh   - Add new regression test from issue

Usage in CI

Download pre-built corpus

- name: Download test corpus
  run: |
    curl -L https://github.com/rspamd/rspamd-test-corpus/releases/download/v1.0/corpus.zip -o corpus.zip
    unzip corpus.zip -d corpus/

Use latest from main branch

- name: Clone corpus
  run: |
    git clone --depth=1 https://github.com/rspamd/rspamd-test-corpus.git corpus

Adding Regression Tests

When you find a problematic email that causes a bug:

# Add the email
./scripts/add-regression.sh issue-1234 /path/to/email.eml

# This creates:
# - regression/issue-1234.eml (the email)
# - regression/issue-1234.yaml (metadata)

Example metadata file (regression/issue-1234.yaml):

issue: 1234
title: "MIME parser crashes on malformed Content-Type"
date: 2025-10-17
category: parser
expected:
  should_not_crash: true
  symbols:
    - MISSING_SUBJECT
  score_min: 0.0
  score_max: 5.0

Building the Corpus

Initial setup (download SpamAssassin corpus)

cd scripts
./prepare-corpus.sh

This will:

  1. Download SpamAssassin public corpus
  2. Extract and organize emails into corpus/spam/ and corpus/ham/
  3. Select ~1000 representative messages

Create release archive

./scripts/build-archive.sh
# Creates: releases/rspamd-test-corpus-YYYYMMDD.zip

Corpus Statistics

  • Total messages: ~1200

    • Spam: ~400
    • Ham: ~600
    • Edge cases: ~100
    • Regression: ~100 (growing)
  • Size: ~15 MB (uncompressed), ~3 MB (zip)

Sources

Base Corpus

Regression Tests

  • Real emails from GitHub issues (with sensitive data removed)
  • Contributed by Rspamd community

Edge Cases

  • Manually created test cases
  • Synthetic emails for specific scenarios

License

  • SpamAssassin corpus: Public domain (see SpamAssassin license)
  • Regression tests: MIT License (see LICENSE)
  • Scripts: MIT License (see LICENSE)

Contributing

Adding problematic emails

  1. Open issue in rspamd/rspamd with bug description
  2. Use scripts/add-regression.sh to add the email to this repository
  3. Remove any sensitive information (passwords, real email addresses, etc.)
  4. Submit PR with the new regression test

Updating corpus

To update the base corpus:

cd scripts
./prepare-corpus.sh --refresh
./build-archive.sh

Create a new release with the updated archive.

Validation

Validate all emails in the corpus:

./scripts/validate-corpus.sh

This checks:

  • All files are valid RFC822 messages
  • No sensitive data leaked (emails, IPs from non-public ranges)
  • Proper file encoding
  • Metadata files are valid YAML

Related

Contact

About

Email corpus for Rspamd integration and regression testing

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages