feat: [#405] deploy Hetzner demo tracker and document process (in progress) by josecelano · Pull Request #406 · torrust/torrust-tracker-deployer

josecelano · 2026-03-03T19:42:11Z

Closes #405

Summary

Real-world deployment of a Torrust Tracker demo instance to Hetzner Cloud using the deployer tool, with full documentation of every step, decision, and problem encountered. The documentation will serve as both an internal reference and a source for a blog post on torrust.com.

Progress

What's in this PR

Documentation (`docs/deployments/hetzner-demo-tracker/`)

New deployment journal with per-command subdirectories:

prerequisites.md — Hetzner account, API token, SSH key setup, tool versions
deployment-spec.md — Environment config decisions and sanitized config
commands/provision/ — Command walkthrough, problems (5), improvements (7), bugs, cleanup procedure
commands/configure/ — Docker installation walkthrough
commands/release/ — Image pull walkthrough
commands/run/ — Service start walkthrough, problems, improvements, bugs (3)
post-provision/ — DNS setup, volume setup, Hetzner backups
verify/ — Full verification index: HTTP tracker, UDP tracker, API, Grafana, health check, Docker services, MySQL, storage, backup
maintenance/
- secrets-rotation.md — Complete procedure for rotating all 7 secrets; records first rotation (2026-03-04)
- os-updates.md — apt update procedure with log of first run (59 updates, reboot, all services healthy)
- uptime-monitoring.md — Documents Hetzner monitoring gap vs DigitalOcean; lists external tools (UptimeRobot, Freshping, etc.)
tracker-registry.md — newTrackon submission; explains why only udp1 is listed (udp2 kept quiet for production debugging)
bugs.md — All 11 deployer bugs found, with severity, status, and links to full descriptions
improvements.md — All 13 improvement recommendations, with links to full descriptions
observations.md — Cross-cutting insights and deployer learnings

Code fixes (`src/`, `templates/`)

fix: IdentitiesOnly=yes added to default SSH options (src/adapters/ssh/client.rs)
fix: IdentitiesOnly=yes added to Ansible ssh_args (templates/ansible/ansible.cfg)
feat: SSH retry budget increased to 300s (60 × 5s)
feat: Full SSH stderr logged in retry messages for easier diagnosis
fix: release no longer hard-fails when docker is not in PATH inside Docker container

New skill

.github/skills/usage/operations/debug-command-failure/skill.md

Service Endpoints

Service	URL	Status
HTTP Tracker 1	`https://http1.torrust-tracker-demo.com/announce`	✅ Running
HTTP Tracker 2	`https://http2.torrust-tracker-demo.com/announce`	✅ Running
UDP Tracker 1	`udp://udp1.torrust-tracker-demo.com:6969`	✅ Running
UDP Tracker 2	`udp://udp2.torrust-tracker-demo.com:6868`	✅ Running
Tracker API	`https://api.torrust-tracker-demo.com/api/v1`	✅ Running
Grafana	`https://grafana.torrust-tracker-demo.com`	✅ Running

Bugs Found (11 total, 1 fixed)

ID	Command	Description	Severity	Status
B-01	`create`	Template binds to `0.0.0.0` (IPv4 only)	Major	🔴 Open
B-02	`create`	Template defaults to SQLite silently	Major	🔴 Open
B-03	`create`	`instance_name: null` unexplained	Minor	🔴 Open
B-04	`provision`	SSH probe budget too short for Hetzner (120 s)	Major	🔴 Open
B-05	`provision`	Passphrase-protected SSH keys fail silently in Docker	Major	🔴 Open
B-06	`provision`	UDP tracker domains missing from output	Minor	🔴 Open
B-07	`release`	Fails when `docker` not in PATH	High	🟢 Fixed
B-08	`run`	MySQL `"root"` username not rejected at creation time	High	🔴 Open
B-09	`run`	MySQL root password silently diverges	Medium	🔴 Open
B-10	`run`	MySQL password not URL-encoded in connection string	High	🔴 Open
B-11	`test`	DNS check false positives with floating IP	Minor	🔴 Open

Key learnings documented

Passphrase-protected SSH keys fail silently in Docker — no agent, no TTY, signing fails. Root cause of all provision failures. Fix: remove passphrase from deployment keys.
docker compose restart does not re-read env vars — must use up -d --force-recreate after rotating secrets in .env.
MySQL password URL-encoding in tracker.toml — / in password must be encoded as %2F in the connection string.
Hetzner has no native uptime monitoring — requires an external service such as UptimeRobot.
UDP Tracker 2 kept off public tracker lists — public registration causes constant announce noise that makes production log debugging impractical.

…r demo tracker

…cker

…r (Phase 2)

…irectories - Rename configuration.md → deployment-spec.md (it describes what to deploy, not a command) - Add commands/create/README.md documenting create template, validate, create environment steps - Add commands/provision/README.md documenting provision command and server details (attempt 1) - Split problems.md into commands/create/problems.md (problems 1–3) and commands/provision/problems.md (problems 4–5) - Move create/ and provision/ docs under a commands/ parent folder for cleaner structure - Rename screenshot to hetzner-console-provisioned-server-details-attempt-1.png - Update all internal cross-links to reflect new paths

…er errors - Add .github/skills/usage/operations/debug-command-failure/skill.md covering the 5-layer investigation workflow: console output, environment state, trace log, build artifacts, and manual verification - Include common error patterns table (failed_step + error_kind → cause) - Include recovery guide for pre- vs post-cloud-resource failure scenarios - Register skill in AGENTS.md table (alphabetical order)

Replace the vague 'timeout too short' explanation with a precise timeline reconstructed from data/logs/log.txt: - tofu apply completed in 19s (15:30:13 → 15:30:32) - SSH probe attempts 1-3: ~7s each → port 22 not yet open (ConnectTimeout) - SSH probe attempts 4-60: ~2-3s each → TCP connects but auth rejected - sshd was listening within ~17s of server appearing - Real bottleneck: cloud-init had not yet created the torrust user + authorized_keys — every attempt was rejected with permission denied - cloud-init finished between 15:33:32 and ~15:44:00 (>3m32s after server) - Update Resolution note: retry may also time out for the same reason

…ement recommendations Based on problems encountered during the Hetzner provision attempt: 1. Distinguish SSH failure reason in probe loop (timeout vs auth rejected) - Both currently surface as Ok(false) with no diagnostic detail - Recommendation: capture stderr and log different message per failure type 2. Classify error_kind more precisely for auth failures - 'NetworkConnectivity' is misleading for permission-denied cases - Recommendation: add SshAuthenticationFailed / SshUserNotReady variant 3. Include per-attempt failure details in trace file - Current trace only summarises with '60 attempts (120s total)' - Investigation required manual timestamp comparison in data/logs/log.txt - Recommendation: condensed probe phase summary in trace 4. Configurable SSH connectivity timeout - 120s hardcoded; Hetzner ccx23 cloud-init took >3m32s - Recommendation: per-provider default + env config field + CLI flag 5. Resume capability after partial provision failure - destroy + recreate discards a healthy server just because SSH probe timed out - Recommendation: provision --resume or standalone wait-for-ssh command

Without IdentitiesOnly=yes, SSH offers all keys loaded in the agent to the server before trying the -i key. If enough agent keys are loaded, the server hits its MaxAuthTries limit and disconnects with: Received disconnect: Too many authentication failures This causes every wait_for_connectivity attempt to fail regardless of whether the host is reachable, making provision always time out when run outside Docker on a machine with an active SSH agent. Add IdentitiesOnly=yes to build_default_ssh_options() so only the explicitly configured private key is used for authentication.

Same root cause as the Rust SSH client fix (019e39c): when the SSH agent has many keys loaded, SSH tries all of them before the -i key, hitting the server's MaxAuthTries limit with: Too many authentication failures This broke Ansible tasks (wait-cloud-init.yml and all subsequent playbooks) when running outside Docker on a machine with an active agent. Add -o IdentitiesOnly=yes to the ssh_args in ansible.cfg so Ansible only uses the key specified via --private-key.

Previously, each 'Still waiting for SSH connectivity' log message gave no indication of why the connection was failing. The reason field now contains the raw stderr from the failed SSH attempt, e.g.: ssh: connect to host 46.225.234.201 port 22: Connection timed out torrust@46.225.234.201: Permission denied (publickey). Received disconnect: Too many authentication failures This makes it immediately visible whether sshd is not yet listening, the user/key is not yet provisioned, or another error is occurring, without needing to re-run with verbose SSH flags. Replace the indirect test_connectivity() path (which discarded stderr via check_command_with_options) with execute_with_options() directly so the raw stderr is accessible and logged in the retry loop.

Change SSH connectivity wait defaults: - DEFAULT_MAX_RETRY_ATTEMPTS: 60 (was 60, same count) - DEFAULT_RETRY_INTERVAL_SECS: 5s (was 2s) - Total budget: 300s / 5 minutes (was 120s) 2 seconds between attempts is too aggressive — it floods logs and adds unnecessary load. 5 minutes total is a more realistic budget for slower providers and small machines where cloud-init user creation can take over 3 minutes (observed on Hetzner ccx23 in failed attempt 2).

…README - Add cleanup-between-attempts.md with step-by-step procedure for cleaning up a failed provision before retrying: 1. Build updated Docker image (if code changed) 2. Destroy failed environment on provider (tofu destroy) 3. Verify server is gone on Hetzner Console 4. Purge local data with 'purge --force' command 5. Clear data/logs/log.txt for a clean retry log - Update provision README: - Status updated from ProvisionFailed to 'Cleaned - ready to retry' - SSH timeout updated from 120s to 300s (60 × 5s) to reflect new defaults - Add link to cleanup-between-attempts.md

…n, and UDP domains bug

…og confusion

…n floating IPs - Add post-provision/ directory with README, dns-setup.md, and volume-setup.md - DNS setup: assign IPv4 (116.202.176.169) and IPv6 (2a01:4f8:1c0c:9aae::1) floating IPs, document permanent netplan configuration, include three screenshots - Volume setup: guide for creating and mounting a 50 GB Hetzner volume at /opt/torrust/storage - Update deployment README.md with Phase 3.5 section and TOC entries - Update issue tracker with Phase 3.5 tasks (3.5.1 and 3.5.2 complete) - Add technical words to project-words.txt (blkid, mountpoint, nofail, NXDOMAIN, netplan, etc.)

- Apply permanent netplan config for both floating IPs (116.202.176.169/32 and 2a01:4f8:1c0c:9aae::1/64) on the server - Record actual commands run and their output in dns-setup.md Step 1.5 - Document two problems: SSH host key mismatch and netplan file permissions - Add improvement note: use 'install -m 600' instead of 'tee' for correct permissions - Mark task 3.5.3 complete in issue tracker - Add kernel network terms to project-words.txt (qdisc, qlen, codel)

- Update dns-setup.md: replace manual UI instructions with Cloud API approach using POST /v1/zones/{zone}/rrsets; add actual command outputs and verification results; document the old dns.hetzner.com API incompatibility with Cloud Console zones - Add hetzner-dns-create-api-token-form.png screenshot - Mark tasks 3.5.4 and 3.5.5 as done in issue tracker - Add rrset/rrsets to project-words.txt All 12 DNS records (A + AAAA for http1, http2, api, grafana, udp1, udp2) now resolve globally to floating IPs 116.202.176.169 / 2a01:4f8:1c0c:9aae::1.

- Create 50 GB ext4 volume (torrust-tracker-demo-storage, id=104927743) in nbg1 via Cloud API; attach to server 122663759 - Mount at /opt/torrust/storage with discard,nofail,defaults via fstab (UUID=6fb9df14-c744-4e50-a48d-9ca4522a02de) - Set ownership to torrust:torrust; verified fstab remount - Replace placeholder draft with actual commands and outputs - Add screenshots: volumes list + configure popup - Mark tasks 3.5.6-3.5.9 done in issue tracker - Add automount, ENDSSH to project-words.txt

Hetzner does not support volume snapshots (only server root disk snapshots). Remove the incorrect 'Targeted backups via snapshot' bullet and add a Problems entry explaining the limitation and the alternative (application-level backup command or rsync/tar).

- Add commands/configure/README.md with actual output (took ~103 s) - Add commands/release/README.md skeleton (placeholder for next step) - Mark task 3.2 done in issue tracker Server post-configure state: Docker 28.2.2 and Docker Compose v2.29.2 installed and running. No containers running yet (expected — release and run come next).

Add a 'Design Notes' section to post-provision/README.md covering: - Why the deployer has no failure-recovery (intentional design: complexity vs. fast server recreation) - The sequencing dilemma: volume must be set up before configure (so Ansible writes data directly to the volume), but this means extra manual work if provision fails and the server must be recreated - Floating IPs are safe: just reassign to the new server, no DNS changes - Volume is the painful part: detach, reattach, remount on the new server - Alternative approach: defer volume setup until after run succeeds, then migrate data (better for prod, acceptable overhead for demo) Also update status table: DNS and Volume setup both marked done.

- New observations.md: cross-cutting deployer insights gathered during this deployment. First entry documents the theoretical state recovery path via environment.json snapshots, with a prominent warning about the risks of partial execution states. - Updated main README.md ToC: added configure, release, run to the deployment commands list and added observations.md as item 7. - Updated Phase 4 section to reference the completed configure README. - Added run/README.md placeholder (parallel to existing release placeholder). Refs #405

… not in PATH) The local docker-compose.yml validator added in PR #384 runs 'docker compose config --quiet' on the host. When the deployer is invoked via its Docker container (the standard usage), the docker binary is not installed inside the container and the command fails with ENOENT, aborting the release. Documents: - Root cause: local validator assumes docker is in PATH, but the deployer container has no docker binary installed - Fix applied: handle ErrorKind::NotFound gracefully (skip with warning) - State recovery: how environment.json was manually reset from ReleaseFailed to Configured so release could be retried Also adds 'ENOENT' to project-words.txt. Refs #405

When the deployer runs inside a Docker container (the standard production usage), the 'docker' binary is not installed inside the container. The local validator was treating io::ErrorKind::NotFound as a hard failure, blocking the release command entirely. Fix: match on NotFound specifically and skip validation with a warning log. Any other OS error (e.g. PermissionDenied) is still a hard failure. The warning message makes the skip reason explicit: 'Skipping local docker-compose.yml validation: docker is not available in PATH (deployer may be running inside a container). The rendered file will be validated by Docker Compose on the remote host.' Also updates the CommandExecutionFailed error message and test to reflect that NotFound is no longer an error case. Fixes release failure documented in: docs/deployments/hetzner-demo-tracker/commands/release/bugs.md Refs #405

- release/README.md: populated with actual output (~134 s, state=Released), plus reference to bugs.md for the docker-not-in-PATH issue on first attempt - hetzner-demo-tracker/README.md: Phase 5 section filled in - issue 405: task 3.3 marked done Refs #405

Add bugs.md documenting two bugs found during the run command: - Bug 1: MYSQL_USER='root' is not rejected at environment creation time, causing MySQL 8.4 to refuse to start with a confusing error - Bug 2: MySQL root password silently gets a '_root' suffix appended, diverging from the configured password in the env JSON config Add problems.md documenting the run command failure symptom, confirmed root cause, required fix, and related deployer improvement needed.

The '_root' suffix is a placeholder stub with a comment explicitly stating it should be managed securely in production. Update the root cause description to reflect this — it was not a design decision but an unfinished implementation.

Add Bug 3 to bugs.md: MySQL password is not URL-encoded when rendered into the tracker.toml connection string. A '/' in the password causes an InvalidPort parse error and the tracker enters a restart loop. Documents the manual workaround applied (%2F encoding + scp + restart). Add improvements.md clarifying that the 'Running' state only means docker compose up -d succeeded, not that all services are healthy. Documents why run intentionally does not wait for service health and points to the 'test' command as the right verification tool.

- grafana.md: mark verified, mask admin password, switch curl commands from URL-embedded to -u flag (URL form breaks on passwords with /), add note about this, complete results table - verify/README.md: mark Grafana as verified

- udp-tracker.md: mark verified, consolidate BEP 15 script to cover both trackers in one loop, add nc reliability note, update results table with actual connection IDs from successful handshakes - verify/README.md: mark UDP Tracker as verified (all services now verified except browser check for Grafana)

…UDP trackers

…ice endpoints table

…ion index

…yment - Add maintenance/README.md index - Add maintenance/secrets-rotation.md with 7 rotation steps - Include secret-to-files relationship map (admin token in .env AND prometheus.yml x2; MySQL torrust password in .env, tracker.toml, backup.conf; MySQL root and Grafana passwords in .env only) - Step 1 expanded into 1a/1b/1c covering both .env and prometheus.yml token update, with tracker + prometheus restart and verification - Update deployment README.md ToC with Maintenance section

Steps 5 and 6 completed on 2026-03-04. Both tokens deleted from their respective consoles.

…edure - Add maintenance/os-updates.md with step-by-step apt update procedure, reboot check, service verification, and a log table - Update maintenance/README.md with OS Updates row - Add 'autoremove' to project-words.txt

docker compose restart is not enough when rotating the admin token because it is injected as an env var at container creation time. Use: docker compose up -d --force-recreate tracker prometheus Add note in step 2e clarifying that restart IS sufficient for tracker.toml and backup.conf since they are bind-mounted files re-read on startup.

…one (step 2)

… rotation complete

…ole (done)

…ST health check endpoint - os-updates.md: update log entry with successful reboot result - secrets-rotation.md: tick off SSH access and API with new token - health-check.md: add Public REST API Health Check section (port 1212 via Caddy, no auth token required)

Hetzner does not include native uptime monitoring unlike DigitalOcean. Add documentation to cover this gap: - Explain context and comparison with DigitalOcean - Add comparison table of external tools (UptimeRobot, Freshping, Better Uptime, Checkly, statuspage.io) with free-tier details - List public endpoints to monitor with expected responses - Note UDP monitoring limitations - Add Checkly, Freshping, statuspage to project-words.txt

- Add tracker-registry.md with newTrackon submission instructions - Only udp1 (udp://udp1.torrust-tracker-demo.com:6969/announce) is submitted; udp2 kept off public lists to avoid announce noise in logs - Mark submitted on 2026-03-04, pending appearance in public list - Link from deployment README ToC (item 10) - Remove newTrackon section from uptime-monitoring.md (wrong place) - Add Trackon to project-words.txt

Collect all deployer bugs found during this deployment in one file with links to full descriptions in the relevant command docs. 11 bugs total across create/provision/release/run/test: - B-01: create template binds to 0.0.0.0 (IPv4 only) - B-02: create template defaults to SQLite silently - B-03: instance_name: null unexplained in template - B-04: provision SSH probe budget too short for Hetzner (120s) - B-05: passphrase-protected SSH keys fail silently in Docker - B-06: UDP tracker domains missing from provision output - B-07: release fails when docker not in PATH [FIXED] - B-08: MySQL 'root' username not rejected at creation time - B-09: MySQL root password silently diverges (_root suffix) - B-10: MySQL password not URL-encoded in tracker.toml - B-11: test DNS check false positives with floating IP

Collect all deployer improvement recommendations found during this deployment in one file with links to full descriptions. 13 items across create/provision/run/cross-cutting/post-provision: - I-01: create template should document instance_name auto-generation - I-02: create template should default to [::] for public sockets - I-03: create template should prompt for database choice - I-04: provision SSH probe should distinguish failure reasons - I-05: provision error_kind should be specific for SSH auth failures - I-06: provision trace file should include per-attempt SSH details - I-07: provision SSH connectivity timeout should be configurable - I-08: provision should detect passphrase-protected SSH keys early - I-09: provision should have wait-for-ssh or --resume flag - I-10: provision output should include IPv6 address - I-11: run should add lightweight post-start health check - I-12: env config should support floating IP for DNS checks - I-13: post-provision netplan config should use correct permissions

josecelano · 2026-03-04T18:27:15Z

ACK 675b826

josecelano added 15 commits March 3, 2026 12:28

docs: [#405] create deployment journal directory structure for Hetzne…

e5ad25c

…r demo tracker

docs: [#405] document and complete prerequisites for Hetzner demo tra…

739f003

…cker

docs: [#405] configure environment and create for Hetzner demo tracke…

61e24fe

…r (Phase 2)

docs: [#405] document provision success, passphrase bug, IPv6 omissio…

9ba435f

…n, and UDP domains bug

docs: [#405] add attempt-4 screenshot and document Hetzner activity l…

fda04f1

…og confusion

docs: [#405] mark provision task as complete in issue tracker

3872406

josecelano self-assigned this Mar 3, 2026

josecelano added 14 commits March 4, 2026 11:13

josecelano added 24 commits March 4, 2026 15:35

docs(verify): add Grafana verification results

c0dd03a

- grafana.md: mark verified, mask admin password, switch curl commands from URL-embedded to -u flag (URL form breaks on passwords with /), add note about this, complete results table - verify/README.md: mark Grafana as verified

docs(verify): add Docker services health and log verification

c5ab1db

docs(verify): add MySQL database connectivity verification

8f97bf8

docs(verify): add storage volume mount verification

31c4869

docs(verify): add actual tree output to storage verification

6870512

docs(verify): add backup verification and document credentials oversight

487c1b5

docs(verify): add Torrust tracker client announce tests for HTTP and …

7a4f714

…UDP trackers

docs(deploy): update progress — all 9 services verified, fill in serv…

7001123

…ice endpoints table

docs(post-provision): add screenshot of Hetzner backups enabled state

36af759

docs(post-provision): add Hetzner backups step to ToC and post-provis…

ea7ea22

…ion index

docs(maintenance): mark Hetzner Cloud and DNS API tokens as deleted

96d1628

Steps 5 and 6 completed on 2026-03-04. Both tokens deleted from their respective consoles.

docs(maintenance): mark Grafana admin password as rotated (step 3 done)

8a0064d

docs(maintenance): mark tracker admin token rotation as done (step 1)

c55763c

docs(maintenance): mark MySQL torrust and root password rotation as d…

d3d1401

…one (step 2)

docs(maintenance): mark SSH deployer key rotation as done (step 4)

fda93d0

docs(maintenance): mark local file archival done (step 7) and secrets…

954a6a8

… rotation complete

docs(maintenance): add step 4g - delete old SSH key from Hetzner cons…

4b55ad0

…ole (done)

josecelano requested a review from da2ce7 March 4, 2026 18:20

josecelano added 2 commits March 4, 2026 18:21

josecelano marked this pull request as ready for review March 4, 2026 18:27

josecelano merged commit 29f5d17 into main Mar 4, 2026
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [#405] deploy Hetzner demo tracker and document process (in progress)#406

feat: [#405] deploy Hetzner demo tracker and document process (in progress)#406
josecelano merged 59 commits intomainfrom
405-deploy-hetzner-demo-tracker-and-document-process

josecelano commented Mar 3, 2026 •

edited

Loading

Uh oh!

josecelano commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

josecelano commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Progress

What's in this PR

Documentation (docs/deployments/hetzner-demo-tracker/)

Code fixes (src/, templates/)

New skill

Service Endpoints

Bugs Found (11 total, 1 fixed)

Key learnings documented

Uh oh!

josecelano commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

josecelano commented Mar 3, 2026 •

edited

Loading

Documentation (`docs/deployments/hetzner-demo-tracker/`)

Code fixes (`src/`, `templates/`)