A modern Python 3.13+ microservices system for processing Discogs database exports into multiple storage backends. The system downloads monthly data dumps from Discogs, parses the XML files, and stores the data in both Neo4j (graph database) and PostgreSQL (relational database) for different query patterns and use cases.
Discogsography consists of four microservices that work together to process and monitor the complete Discogs database:
- Dashboard - Real-time monitoring dashboard with WebSocket updates for all services
- Extractor - Downloads Discogs XML dumps from S3, validates checksums, parses XML to JSON, and publishes to message queues
- Graphinator - Consumes messages and builds a graph database in Neo4j with relationships between artists, labels, releases, and masters
- Tableinator - Consumes messages and stores denormalized data in PostgreSQL for fast queries and full-text search
graph TD
S3[("Discogs S3<br/>Data Dumps")]
EXT[["Extractor<br/>(XML β JSON)"]]
RMQ{{"RabbitMQ<br/>Message Queue"}}
NEO4J[(Neo4j<br/>Graph DB)]
PG[(PostgreSQL<br/>Relational DB)]
GRAPH[["Graphinator"]]
TABLE[["Tableinator"]]
DASH[["Dashboard<br/>(Monitoring)"]]
S3 -->|Download & Parse| EXT
EXT -->|Publish Messages| RMQ
RMQ -->|Consume| GRAPH
RMQ -->|Consume| TABLE
GRAPH -->|Store| NEO4J
TABLE -->|Store| PG
DASH -.->|Monitor| EXT
DASH -.->|Monitor| GRAPH
DASH -.->|Monitor| TABLE
DASH -.->|Query Stats| RMQ
DASH -.->|Query Stats| NEO4J
DASH -.->|Query Stats| PG
style S3 fill:#e1f5fe
style RMQ fill:#fff3e0
style NEO4J fill:#f3e5f5
style PG fill:#e8f5e9
style DASH fill:#fce4ec
- Automatic Updates: Periodic checking for new Discogs data releases (configurable interval, default 15 days)
- Efficient Processing: Hash-based deduplication to avoid reprocessing unchanged records
- Concurrent Processing: Multi-threaded XML parsing and concurrent message processing
- Fault Tolerance: Message acknowledgment, automatic retries, and graceful shutdown
- Progress Tracking: Real-time progress monitoring with detailed statistics
- Docker Support: Full Docker Compose setup with security hardening (non-root, read-only filesystems, capability dropping) - see Docker Security and Dockerfile Standards
- Type Safety: Comprehensive type hints and mypy validation
- Security: Bandit security scanning, secure coding practices, and container security best practices
- Python 3.13+
- Docker and Docker Compose
- ~100GB free disk space for Discogs data
- 8GB+ RAM recommended
-
Clone the repository:
git clone https://github.com/SimplicityGuy/discogsography.git cd discogsography
-
Start all services:
docker-compose up -d
-
Monitor the logs:
docker-compose logs -f extractor
-
Access the services:
- Dashboard: http://localhost:8003 (real-time monitoring UI)
- RabbitMQ Management: http://localhost:15672 (user: discogsography, pass: discogsography)
- Neo4j Browser: http://localhost:7474 (user: neo4j, pass: discogsography)
- PostgreSQL: localhost:5432 (user: discogsography, pass: discogsography, db: discogsography)
-
Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Install dependencies:
uv sync --all-extras
-
Set up pre-commit hooks:
uv run pre-commit install
-
Set environment variables:
export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/" export NEO4J_ADDRESS="bolt://localhost:7687" export NEO4J_USERNAME="neo4j" export NEO4J_PASSWORD="password" export POSTGRES_ADDRESS="localhost:5432" export POSTGRES_USERNAME="postgres" export POSTGRES_PASSWORD="password" export POSTGRES_DATABASE="discogsography"
-
Run services:
# Terminal 1 - Dashboard uv run python dashboard/dashboard.py # Terminal 2 - Extractor uv run python extractor/extractor.py # Terminal 3 - Graphinator uv run python graphinator/graphinator.py # Terminal 4 - Tableinator uv run python tableinator/tableinator.py
Variable | Description | Default | Service |
---|---|---|---|
AMQP_CONNECTION |
RabbitMQ connection string | Required | All |
DISCOGS_ROOT |
Path for downloaded files | /discogs-data |
Extractor |
PERIODIC_CHECK_DAYS |
Days between update checks | 15 |
Extractor |
NEO4J_ADDRESS |
Neo4j bolt address | Required | Dashboard, Graphinator |
NEO4J_USERNAME |
Neo4j username | Required | Dashboard, Graphinator |
NEO4J_PASSWORD |
Neo4j password | Required | Dashboard, Graphinator |
POSTGRES_ADDRESS |
PostgreSQL host:port | Required | Dashboard, Tableinator |
POSTGRES_USERNAME |
PostgreSQL username | Required | Dashboard, Tableinator |
POSTGRES_PASSWORD |
PostgreSQL password | Required | Dashboard, Tableinator |
POSTGRES_DATABASE |
PostgreSQL database | Required | Dashboard, Tableinator |
The complete Discogs dataset includes:
- ~15 million releases
- ~2 million artists
- ~2 million masters
- ~1.5 million labels
Processing the full dataset requires:
- ~50GB for compressed XML files
- ~100GB for extracted data
- Several hours for initial processing (varies by hardware)
Find all albums by an artist:
MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.id
LIMIT 10
Find all members of a band:
MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name
Find all releases on a label:
MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
RETURN r.title, r.id
LIMIT 10
Search releases by title:
SELECT data->>'title' as title, data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
LIMIT 10;
Get artist details:
SELECT data->>'name' as name, data->>'profile' as profile
FROM artists
WHERE data->>'name' = 'Miles Davis';
Find releases by year:
SELECT data->>'title' as title, data->>'artist' as artist
FROM releases
WHERE (data->>'year')::int = 1969
LIMIT 10;
Each service provides detailed progress logging:
- Record processing rates (records/second)
- Queue depths and consumer health
- Error counts and retry statistics
- Stall detection (no activity for >2 minutes)
The utilities/
directory contains helpful debugging tools:
# Check for errors in logs
uv run python utilities/check_errors.py
# Monitor queue statistics
uv run python utilities/check_queues.py
# Real-time queue monitoring
uv run python utilities/monitor_queues.py
# System health dashboard
uv run python utilities/system_monitor.py
The project uses modern Python tooling:
- uv: Fast package management with lock files
- ruff: Fast Python linting and formatting
- mypy: Static type checking
- bandit: Security vulnerability scanning
- pre-commit: Automated code quality checks
Run all checks:
uv run pre-commit run --all-files
Run the test suite:
uv run pytest # Run all tests
uv run pytest --cov # Run with coverage report
uv run pytest -m "not e2e" # Run all tests except E2E
uv run pytest tests/dashboard/ # Run dashboard tests only
For E2E tests with Playwright:
# One-time setup
uv run playwright install chromium # Install browser
uv run playwright install-deps chromium # Install system dependencies
# Run E2E tests (starts test server automatically)
uv run task test-e2e
# Or using pytest directly:
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e
discogsography/
βββ common/ # Shared configuration and utilities
β βββ config.py # Configuration management
β βββ health_server.py # Health check endpoints
βββ dashboard/ # Real-time monitoring dashboard
β βββ dashboard.py # Web UI with WebSocket updates
βββ extractor/ # XML parsing and message publishing
β βββ extractor.py # Main service
β βββ discogs.py # S3 download logic
βββ graphinator/ # Neo4j graph database service
β βββ graphinator.py # Graph relationship builder
βββ tableinator/ # PostgreSQL service
β βββ tableinator.py # Relational data storage
βββ utilities/ # Debugging and monitoring tools
βββ docker-compose.yml # Container orchestration
βββ pyproject.toml # Project metadata and dependencies
All logger calls (logger.info
, logger.warning
, logger.error
) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.
Emoji | Usage | Example |
---|---|---|
π | Startup messages | logger.info("π Starting service...") |
β | Success/completion messages | logger.info("β
Operation completed successfully") |
β | Errors | logger.error("β Failed to connect to database") |
Warnings | logger.warning("β οΈ Connection timeout, retrying...") |
|
π | Shutdown/stop messages | logger.info("π Shutting down gracefully") |
π | Progress/statistics | logger.info("π Processed 1000 records") |
π₯ | Downloads | logger.info("π₯ Starting download of data") |
β¬οΈ | Downloading files | logger.info("β¬οΈ Downloading file.xml") |
π | Processing operations | logger.info("π Processing batch of messages") |
β³ | Waiting/pending | logger.info("β³ Waiting for messages...") |
π | Metadata operations | logger.info("π Loaded metadata from cache") |
π | Checking/searching | logger.info("π Checking for updates...") |
π | File operations | logger.info("π File created successfully") |
π | New versions | logger.info("π Found newer version available") |
β° | Periodic operations | logger.info("β° Running periodic check") |
π§ | Setup/configuration | logger.info("π§ Creating database indexes") |
π° | RabbitMQ connections | logger.info("π° Connected to RabbitMQ") |
π | Neo4j connections | logger.info("π Connected to Neo4j") |
π | PostgreSQL operations | logger.info("π Connected to PostgreSQL") |
πΎ | Database save operations | logger.info("πΎ Updated artist ID=123 in Neo4j") |
π₯ | Health server | logger.info("π₯ Health server started on port 8001") |
β© | Skipping operations | logger.info("β© Skipped artist ID=123 (no changes)") |
logger.info("π Starting Discogs data extractor")
logger.error("β Failed to connect to Neo4j: connection refused")
logger.warning("β οΈ Slow consumer detected, processing delayed")
logger.info("β
All files processed successfully")
The graph database uses the following node types and relationships:
Nodes:
Artist
- Musicians, bands, and other creatorsLabel
- Record labels and imprintsMaster
- Master recordings (the "ideal" version of a release)Release
- Specific releases/pressings of recordingsGenre
- Musical genresStyle
- Musical styles (sub-genres)
Relationships:
(Artist)-[:MEMBER_OF]->(Artist)
- Band membership(Artist)-[:ALIAS_OF]->(Artist)
- Artist aliases(Release)-[:BY]->(Artist)
- Release credits(Release)-[:ON]->(Label)
- Label releases(Release)-[:DERIVED_FROM]->(Master)
- Master/release connection(Label)-[:SUBLABEL_OF]->(Label)
- Label hierarchy(Release)-[:IS]->(Genre)
- Genre classification(Release)-[:IS]->(Style)
- Style classification(Style)-[:PART_OF]->(Genre)
- Style/genre hierarchy
Each table stores the complete JSON data with efficient indexing:
CREATE TABLE artists (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL,
data JSONB NOT NULL
);
CREATE TABLE labels (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL,
data JSONB NOT NULL
);
CREATE TABLE masters (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL,
data JSONB NOT NULL
);
CREATE TABLE releases (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL,
data JSONB NOT NULL
);
Typical processing rates on modern hardware:
- Extractor: 5,000-10,000 records/second
- Graphinator: 1,000-2,000 records/second (Neo4j transactions)
- Tableinator: 3,000-5,000 records/second (PostgreSQL inserts)
Recommended specifications for processing the full dataset:
- CPU: 4+ cores (8+ recommended)
- RAM: 8GB minimum (16GB+ recommended)
- Storage: 200GB+ SSD recommended
- Network: Stable internet for initial download (~50GB)
- Neo4j Tuning: Increase heap size in Neo4j configuration for better performance
- PostgreSQL Tuning: Adjust
shared_buffers
andwork_mem
for large datasets - RabbitMQ: Monitor queue depths and adjust prefetch counts if needed
- Disk I/O: Use SSD storage for Discogs data directory
Extractor fails to download:
- Check internet connection
- Verify S3 bucket is accessible (public bucket, no auth required)
- Ensure sufficient disk space in
DISCOGS_ROOT
Services can't connect to RabbitMQ:
- Verify RabbitMQ is running:
docker-compose ps
- Check connection string format
- Ensure network connectivity between services
Neo4j connection errors:
- Verify Neo4j is running and accessible
- Check bolt protocol address (usually
bolt://localhost:7687
) - Ensure authentication credentials are correct
PostgreSQL connection issues:
- Verify PostgreSQL is running
- Check host:port format in
POSTGRES_ADDRESS
- Ensure database exists and user has permissions
- Check Service Logs: Each service writes to its own log file
- Monitor Queues: Use RabbitMQ management UI to check message flow
- Verify Data: Query databases directly to ensure data is being stored
- Use Debug Utilities: Run the tools in
utilities/
for detailed diagnostics
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes and add tests
- Ensure all checks pass (
uv run pre-commit run --all-files
) - Commit your changes (
git commit -m 'β¨ Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Follow the project's logging conventions (see above)
- Add type hints to all functions
- Write tests for new functionality
- Update documentation as needed
- Ensure all pre-commit hooks pass
This project is licensed under the MIT License - see the LICENSE file for details.
- Discogs for providing the monthly data dumps
- The Python community for excellent libraries and tools
- Contributors and users of this project
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation:
- CLAUDE.md - Detailed technical documentation for development
- Task Automation - Taskipy commands and workflows
- Docker Security - Container security best practices
- Dockerfile Standards - Dockerfile implementation standards