discogsography

A modern Python 3.13+ microservices system for processing Discogs database exports into multiple storage backends. The system downloads monthly data dumps from Discogs, parses the XML files, and stores the data in both Neo4j (graph database) and PostgreSQL (relational database) for different query patterns and use cases.

Overview

Discogsography consists of four microservices that work together to process and monitor the complete Discogs database:

Dashboard - Real-time monitoring dashboard with WebSocket updates for all services
Extractor - Downloads Discogs XML dumps from S3, validates checksums, parses XML to JSON, and publishes to message queues
Graphinator - Consumes messages and builds a graph database in Neo4j with relationships between artists, labels, releases, and masters
Tableinator - Consumes messages and stores denormalized data in PostgreSQL for fast queries and full-text search

Architecture

graph TD
    S3[("Discogs S3<br/>Data Dumps")]
    EXT[["Extractor<br/>(XML → JSON)"]]
    RMQ{{"RabbitMQ<br/>Message Queue"}}
    NEO4J[(Neo4j<br/>Graph DB)]
    PG[(PostgreSQL<br/>Relational DB)]
    GRAPH[["Graphinator"]]
    TABLE[["Tableinator"]]
    DASH[["Dashboard<br/>(Monitoring)"]]

    S3 -->|Download & Parse| EXT
    EXT -->|Publish Messages| RMQ
    RMQ -->|Consume| GRAPH
    RMQ -->|Consume| TABLE
    GRAPH -->|Store| NEO4J
    TABLE -->|Store| PG

    DASH -.->|Monitor| EXT
    DASH -.->|Monitor| GRAPH
    DASH -.->|Monitor| TABLE
    DASH -.->|Query Stats| RMQ
    DASH -.->|Query Stats| NEO4J
    DASH -.->|Query Stats| PG

    style S3 fill:#e1f5fe
    style RMQ fill:#fff3e0
    style NEO4J fill:#f3e5f5
    style PG fill:#e8f5e9
    style DASH fill:#fce4ec

Features

Automatic Updates: Periodic checking for new Discogs data releases (configurable interval, default 15 days)
Efficient Processing: Hash-based deduplication to avoid reprocessing unchanged records
Concurrent Processing: Multi-threaded XML parsing and concurrent message processing
Fault Tolerance: Message acknowledgment, automatic retries, and graceful shutdown
Progress Tracking: Real-time progress monitoring with detailed statistics
Docker Support: Full Docker Compose setup with security hardening (non-root, read-only filesystems, capability dropping) - see Docker Security and Dockerfile Standards
Type Safety: Comprehensive type hints and mypy validation
Security: Bandit security scanning, secure coding practices, and container security best practices

Quick Start

Prerequisites

Python 3.13+
Docker and Docker Compose
~100GB free disk space for Discogs data
8GB+ RAM recommended

Using Docker Compose (Recommended)

Clone the repository:

git clone https://github.com/SimplicityGuy/discogsography.git
cd discogsography

Start all services:
```
docker-compose up -d
```
Monitor the logs:
```
docker-compose logs -f extractor
```
Access the services:
- Dashboard: http://localhost:8003 (real-time monitoring UI)
- RabbitMQ Management: http://localhost:15672 (user: discogsography, pass: discogsography)
- Neo4j Browser: http://localhost:7474 (user: neo4j, pass: discogsography)
- PostgreSQL: localhost:5432 (user: discogsography, pass: discogsography, db: discogsography)

Local Development

Install uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install dependencies:
```
uv sync --all-extras
```
Set up pre-commit hooks:
```
uv run pre-commit install
```

Set environment variables:

export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/"
export NEO4J_ADDRESS="bolt://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"
export POSTGRES_ADDRESS="localhost:5432"
export POSTGRES_USERNAME="postgres"
export POSTGRES_PASSWORD="password"
export POSTGRES_DATABASE="discogsography"

Run services:

# Terminal 1 - Dashboard
uv run python dashboard/dashboard.py

# Terminal 2 - Extractor
uv run python extractor/extractor.py

# Terminal 3 - Graphinator
uv run python graphinator/graphinator.py

# Terminal 4 - Tableinator
uv run python tableinator/tableinator.py

Configuration

Environment Variables

Variable	Description	Default	Service
`AMQP_CONNECTION`	RabbitMQ connection string	Required	All
`DISCOGS_ROOT`	Path for downloaded files	`/discogs-data`	Extractor
`PERIODIC_CHECK_DAYS`	Days between update checks	`15`	Extractor
`NEO4J_ADDRESS`	Neo4j bolt address	Required	Dashboard, Graphinator
`NEO4J_USERNAME`	Neo4j username	Required	Dashboard, Graphinator
`NEO4J_PASSWORD`	Neo4j password	Required	Dashboard, Graphinator
`POSTGRES_ADDRESS`	PostgreSQL host:port	Required	Dashboard, Tableinator
`POSTGRES_USERNAME`	PostgreSQL username	Required	Dashboard, Tableinator
`POSTGRES_PASSWORD`	PostgreSQL password	Required	Dashboard, Tableinator
`POSTGRES_DATABASE`	PostgreSQL database	Required	Dashboard, Tableinator

Data Volume

The complete Discogs dataset includes:

~15 million releases
~2 million artists
~2 million masters
~1.5 million labels

Processing the full dataset requires:

~50GB for compressed XML files
~100GB for extracted data
Several hours for initial processing (varies by hardware)

Usage Examples

Neo4j Queries (via Graphinator)

Find all albums by an artist:

MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.id
LIMIT 10

Find all members of a band:

MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name

Find all releases on a label:

MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
RETURN r.title, r.id
LIMIT 10

PostgreSQL Queries (via Tableinator)

Search releases by title:

SELECT data->>'title' as title, data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
LIMIT 10;

Get artist details:

SELECT data->>'name' as name, data->>'profile' as profile
FROM artists
WHERE data->>'name' = 'Miles Davis';

Find releases by year:

SELECT data->>'title' as title, data->>'artist' as artist
FROM releases
WHERE (data->>'year')::int = 1969
LIMIT 10;

Monitoring

Built-in Monitoring

Each service provides detailed progress logging:

Record processing rates (records/second)
Queue depths and consumer health
Error counts and retry statistics
Stall detection (no activity for >2 minutes)

Debug Utilities

The utilities/ directory contains helpful debugging tools:

# Check for errors in logs
uv run python utilities/check_errors.py

# Monitor queue statistics
uv run python utilities/check_queues.py

# Real-time queue monitoring
uv run python utilities/monitor_queues.py

# System health dashboard
uv run python utilities/system_monitor.py

Development

Code Quality

The project uses modern Python tooling:

uv: Fast package management with lock files
ruff: Fast Python linting and formatting
mypy: Static type checking
bandit: Security vulnerability scanning
pre-commit: Automated code quality checks

Run all checks:

uv run pre-commit run --all-files

Testing

Run the test suite:

uv run pytest                    # Run all tests
uv run pytest --cov              # Run with coverage report
uv run pytest -m "not e2e"       # Run all tests except E2E
uv run pytest tests/dashboard/   # Run dashboard tests only

For E2E tests with Playwright:

# One-time setup
uv run playwright install chromium        # Install browser
uv run playwright install-deps chromium   # Install system dependencies

# Run E2E tests (starts test server automatically)
uv run task test-e2e

# Or using pytest directly:
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e

Project Structure

discogsography/
├── common/             # Shared configuration and utilities
│   ├── config.py       # Configuration management
│   └── health_server.py # Health check endpoints
├── dashboard/          # Real-time monitoring dashboard
│   └── dashboard.py    # Web UI with WebSocket updates
├── extractor/          # XML parsing and message publishing
│   ├── extractor.py    # Main service
│   └── discogs.py      # S3 download logic
├── graphinator/        # Neo4j graph database service
│   └── graphinator.py  # Graph relationship builder
├── tableinator/        # PostgreSQL service
│   └── tableinator.py  # Relational data storage
├── utilities/          # Debugging and monitoring tools
├── docker-compose.yml  # Container orchestration
└── pyproject.toml      # Project metadata and dependencies

Logging Conventions

All logger calls (logger.info, logger.warning, logger.error) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.

Emoji Key

Emoji	Usage	Example
🚀	Startup messages	`logger.info("🚀 Starting service...")`
✅	Success/completion messages	`logger.info("✅ Operation completed successfully")`
❌	Errors	`logger.error("❌ Failed to connect to database")`
⚠️	Warnings	`logger.warning("⚠️ Connection timeout, retrying...")`
🛑	Shutdown/stop messages	`logger.info("🛑 Shutting down gracefully")`
📊	Progress/statistics	`logger.info("📊 Processed 1000 records")`
📥	Downloads	`logger.info("📥 Starting download of data")`
⬇️	Downloading files	`logger.info("⬇️ Downloading file.xml")`
🔄	Processing operations	`logger.info("🔄 Processing batch of messages")`
⏳	Waiting/pending	`logger.info("⏳ Waiting for messages...")`
📋	Metadata operations	`logger.info("📋 Loaded metadata from cache")`
🔍	Checking/searching	`logger.info("🔍 Checking for updates...")`
📄	File operations	`logger.info("📄 File created successfully")`
🆕	New versions	`logger.info("🆕 Found newer version available")`
⏰	Periodic operations	`logger.info("⏰ Running periodic check")`
🔧	Setup/configuration	`logger.info("🔧 Creating database indexes")`
🐰	RabbitMQ connections	`logger.info("🐰 Connected to RabbitMQ")`
🔗	Neo4j connections	`logger.info("🔗 Connected to Neo4j")`
🐘	PostgreSQL operations	`logger.info("🐘 Connected to PostgreSQL")`
💾	Database save operations	`logger.info("💾 Updated artist ID=123 in Neo4j")`
🏥	Health server	`logger.info("🏥 Health server started on port 8001")`
⏩	Skipping operations	`logger.info("⏩ Skipped artist ID=123 (no changes)")`

Example Usage

logger.info("🚀 Starting Discogs data extractor")
logger.error("❌ Failed to connect to Neo4j: connection refused")
logger.warning("⚠️ Slow consumer detected, processing delayed")
logger.info("✅ All files processed successfully")

Data Schema

Neo4j Graph Model

The graph database uses the following node types and relationships:

Nodes:

Artist - Musicians, bands, and other creators
Label - Record labels and imprints
Master - Master recordings (the "ideal" version of a release)
Release - Specific releases/pressings of recordings
Genre - Musical genres
Style - Musical styles (sub-genres)

Relationships:

(Artist)-[:MEMBER_OF]->(Artist) - Band membership
(Artist)-[:ALIAS_OF]->(Artist) - Artist aliases
(Release)-[:BY]->(Artist) - Release credits
(Release)-[:ON]->(Label) - Label releases
(Release)-[:DERIVED_FROM]->(Master) - Master/release connection
(Label)-[:SUBLABEL_OF]->(Label) - Label hierarchy
(Release)-[:IS]->(Genre) - Genre classification
(Release)-[:IS]->(Style) - Style classification
(Style)-[:PART_OF]->(Genre) - Style/genre hierarchy

PostgreSQL Schema

Each table stores the complete JSON data with efficient indexing:

CREATE TABLE artists (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL,
    data JSONB NOT NULL
);

CREATE TABLE labels (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL,
    data JSONB NOT NULL
);

CREATE TABLE masters (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL,
    data JSONB NOT NULL
);

CREATE TABLE releases (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL,
    data JSONB NOT NULL
);

Performance Considerations

Processing Speed

Typical processing rates on modern hardware:

Extractor: 5,000-10,000 records/second
Graphinator: 1,000-2,000 records/second (Neo4j transactions)
Tableinator: 3,000-5,000 records/second (PostgreSQL inserts)

Resource Requirements

Recommended specifications for processing the full dataset:

CPU: 4+ cores (8+ recommended)
RAM: 8GB minimum (16GB+ recommended)
Storage: 200GB+ SSD recommended
Network: Stable internet for initial download (~50GB)

Optimization Tips

Neo4j Tuning: Increase heap size in Neo4j configuration for better performance
PostgreSQL Tuning: Adjust shared_buffers and work_mem for large datasets
RabbitMQ: Monitor queue depths and adjust prefetch counts if needed
Disk I/O: Use SSD storage for Discogs data directory

Troubleshooting

Common Issues

Extractor fails to download:

Check internet connection
Verify S3 bucket is accessible (public bucket, no auth required)
Ensure sufficient disk space in DISCOGS_ROOT

Services can't connect to RabbitMQ:

Verify RabbitMQ is running: docker-compose ps
Check connection string format
Ensure network connectivity between services

Neo4j connection errors:

Verify Neo4j is running and accessible
Check bolt protocol address (usually bolt://localhost:7687)
Ensure authentication credentials are correct

PostgreSQL connection issues:

Verify PostgreSQL is running
Check host:port format in POSTGRES_ADDRESS
Ensure database exists and user has permissions

Debugging Tips

Check Service Logs: Each service writes to its own log file
Monitor Queues: Use RabbitMQ management UI to check message flow
Verify Data: Query databases directly to ensure data is being stored
Use Debug Utilities: Run the tools in utilities/ for detailed diagnostics

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes and add tests
Ensure all checks pass (uv run pre-commit run --all-files)
Commit your changes (git commit -m '✨ Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow the project's logging conventions (see above)
Add type hints to all functions
Write tests for new functionality
Update documentation as needed
Ensure all pre-commit hooks pass

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Discogs for providing the monthly data dumps
The Python community for excellent libraries and tools
Contributors and users of this project

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation:
- CLAUDE.md - Detailed technical documentation for development
- Task Automation - Taskipy commands and workflows
- Docker Security - Container security best practices
- Dockerfile Standards - Dockerfile implementation standards

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.github		.github
common		common
dashboard		dashboard
docs		docs
extractor		extractor
graphinator		graphinator
scripts		scripts
tableinator		tableinator
tests		tests
utilities		utilities
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint		.yamllint
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Uh oh!

License

SimplicityGuy/discogsography

Folders and files

Latest commit

History

Repository files navigation

discogsography

Overview

Architecture

Features

Quick Start

Prerequisites

Using Docker Compose (Recommended)

Local Development

Configuration

Environment Variables

Data Volume

Usage Examples

Neo4j Queries (via Graphinator)

PostgreSQL Queries (via Tableinator)

Monitoring

Built-in Monitoring

Debug Utilities

Development

Code Quality

Testing

Project Structure

Logging Conventions

Emoji Key

Example Usage

Data Schema

Neo4j Graph Model

PostgreSQL Schema

Performance Considerations

Processing Speed

Resource Requirements

Optimization Tips

Troubleshooting

Common Issues

Debugging Tips

Contributing

Development Guidelines

License

Acknowledgments

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages