Skip to content

feat: [#405] deploy Hetzner demo tracker and document process (in progress)#406

Merged
josecelano merged 59 commits intomainfrom
405-deploy-hetzner-demo-tracker-and-document-process
Mar 4, 2026
Merged

feat: [#405] deploy Hetzner demo tracker and document process (in progress)#406
josecelano merged 59 commits intomainfrom
405-deploy-hetzner-demo-tracker-and-document-process

Conversation

@josecelano
Copy link
Member

@josecelano josecelano commented Mar 3, 2026

Closes #405

Summary

Real-world deployment of a Torrust Tracker demo instance to Hetzner Cloud using the deployer tool, with full documentation of every step, decision, and problem encountered. The documentation will serve as both an internal reference and a source for a blog post on torrust.com.

Progress

  • Phase 1: Prerequisites documented
  • Phase 2: Environment created and configured
  • Phase 3.1: Infrastructure provisioned (Hetzner ccx23 at nbg1)
  • Phase 3.2: Post-provision manual steps (DNS, volume, Hetzner backups)
  • Phase 3.3: Configure instance (Docker 28.2.2, Docker Compose v2.29.2)
  • Phase 3.4: Release application (all Docker images staged)
  • Phase 3.5: Run services (all 5 services healthy)
  • Phase 4: Verify and document (HTTP tracker, UDP tracker, API, Grafana, health check, MySQL, storage, backup)
  • Phase 5: Post-deployment maintenance
    • Secrets rotation (all 7 secrets rotated)
    • OS updates (59 packages incl. 37 security, reboot, all services healthy)
    • Uptime monitoring documented (Hetzner has no native monitoring — external tools listed)
    • Tracker registry — udp1 submitted to newTrackon (2026-03-04)
  • Phase 6: Bug and improvement indexes

What's in this PR

Documentation (docs/deployments/hetzner-demo-tracker/)

New deployment journal with per-command subdirectories:

  • prerequisites.md — Hetzner account, API token, SSH key setup, tool versions
  • deployment-spec.md — Environment config decisions and sanitized config
  • commands/provision/ — Command walkthrough, problems (5), improvements (7), bugs, cleanup procedure
  • commands/configure/ — Docker installation walkthrough
  • commands/release/ — Image pull walkthrough
  • commands/run/ — Service start walkthrough, problems, improvements, bugs (3)
  • post-provision/ — DNS setup, volume setup, Hetzner backups
  • verify/ — Full verification index: HTTP tracker, UDP tracker, API, Grafana, health check, Docker services, MySQL, storage, backup
  • maintenance/
    • secrets-rotation.md — Complete procedure for rotating all 7 secrets; records first rotation (2026-03-04)
    • os-updates.mdapt update procedure with log of first run (59 updates, reboot, all services healthy)
    • uptime-monitoring.md — Documents Hetzner monitoring gap vs DigitalOcean; lists external tools (UptimeRobot, Freshping, etc.)
  • tracker-registry.md — newTrackon submission; explains why only udp1 is listed (udp2 kept quiet for production debugging)
  • bugs.md — All 11 deployer bugs found, with severity, status, and links to full descriptions
  • improvements.md — All 13 improvement recommendations, with links to full descriptions
  • observations.md — Cross-cutting insights and deployer learnings

Code fixes (src/, templates/)

  • fix: IdentitiesOnly=yes added to default SSH options (src/adapters/ssh/client.rs)
  • fix: IdentitiesOnly=yes added to Ansible ssh_args (templates/ansible/ansible.cfg)
  • feat: SSH retry budget increased to 300s (60 × 5s)
  • feat: Full SSH stderr logged in retry messages for easier diagnosis
  • fix: release no longer hard-fails when docker is not in PATH inside Docker container

New skill

  • .github/skills/usage/operations/debug-command-failure/skill.md

Service Endpoints

Service URL Status
HTTP Tracker 1 https://http1.torrust-tracker-demo.com/announce ✅ Running
HTTP Tracker 2 https://http2.torrust-tracker-demo.com/announce ✅ Running
UDP Tracker 1 udp://udp1.torrust-tracker-demo.com:6969 ✅ Running
UDP Tracker 2 udp://udp2.torrust-tracker-demo.com:6868 ✅ Running
Tracker API https://api.torrust-tracker-demo.com/api/v1 ✅ Running
Grafana https://grafana.torrust-tracker-demo.com ✅ Running

Bugs Found (11 total, 1 fixed)

ID Command Description Severity Status
B-01 create Template binds to 0.0.0.0 (IPv4 only) Major 🔴 Open
B-02 create Template defaults to SQLite silently Major 🔴 Open
B-03 create instance_name: null unexplained Minor 🔴 Open
B-04 provision SSH probe budget too short for Hetzner (120 s) Major 🔴 Open
B-05 provision Passphrase-protected SSH keys fail silently in Docker Major 🔴 Open
B-06 provision UDP tracker domains missing from output Minor 🔴 Open
B-07 release Fails when docker not in PATH High 🟢 Fixed
B-08 run MySQL "root" username not rejected at creation time High 🔴 Open
B-09 run MySQL root password silently diverges Medium 🔴 Open
B-10 run MySQL password not URL-encoded in connection string High 🔴 Open
B-11 test DNS check false positives with floating IP Minor 🔴 Open

Key learnings documented

  1. Passphrase-protected SSH keys fail silently in Docker — no agent, no TTY, signing fails. Root cause of all provision failures. Fix: remove passphrase from deployment keys.
  2. docker compose restart does not re-read env vars — must use up -d --force-recreate after rotating secrets in .env.
  3. MySQL password URL-encoding in tracker.toml/ in password must be encoded as %2F in the connection string.
  4. Hetzner has no native uptime monitoring — requires an external service such as UptimeRobot.
  5. UDP Tracker 2 kept off public tracker lists — public registration causes constant announce noise that makes production log debugging impractical.

…irectories

- Rename configuration.md → deployment-spec.md (it describes what to deploy, not a command)
- Add commands/create/README.md documenting create template, validate, create environment steps
- Add commands/provision/README.md documenting provision command and server details (attempt 1)
- Split problems.md into commands/create/problems.md (problems 1–3) and commands/provision/problems.md (problems 4–5)
- Move create/ and provision/ docs under a commands/ parent folder for cleaner structure
- Rename screenshot to hetzner-console-provisioned-server-details-attempt-1.png
- Update all internal cross-links to reflect new paths
…er errors

- Add .github/skills/usage/operations/debug-command-failure/skill.md
  covering the 5-layer investigation workflow: console output, environment
  state, trace log, build artifacts, and manual verification
- Include common error patterns table (failed_step + error_kind → cause)
- Include recovery guide for pre- vs post-cloud-resource failure scenarios
- Register skill in AGENTS.md table (alphabetical order)
Replace the vague 'timeout too short' explanation with a precise timeline
reconstructed from data/logs/log.txt:

- tofu apply completed in 19s (15:30:13 → 15:30:32)
- SSH probe attempts 1-3: ~7s each → port 22 not yet open (ConnectTimeout)
- SSH probe attempts 4-60: ~2-3s each → TCP connects but auth rejected
- sshd was listening within ~17s of server appearing
- Real bottleneck: cloud-init had not yet created the torrust user +
  authorized_keys — every attempt was rejected with permission denied
- cloud-init finished between 15:33:32 and ~15:44:00 (>3m32s after server)
- Update Resolution note: retry may also time out for the same reason
…ement recommendations

Based on problems encountered during the Hetzner provision attempt:

1. Distinguish SSH failure reason in probe loop (timeout vs auth rejected)
   - Both currently surface as Ok(false) with no diagnostic detail
   - Recommendation: capture stderr and log different message per failure type

2. Classify error_kind more precisely for auth failures
   - 'NetworkConnectivity' is misleading for permission-denied cases
   - Recommendation: add SshAuthenticationFailed / SshUserNotReady variant

3. Include per-attempt failure details in trace file
   - Current trace only summarises with '60 attempts (120s total)'
   - Investigation required manual timestamp comparison in data/logs/log.txt
   - Recommendation: condensed probe phase summary in trace

4. Configurable SSH connectivity timeout
   - 120s hardcoded; Hetzner ccx23 cloud-init took >3m32s
   - Recommendation: per-provider default + env config field + CLI flag

5. Resume capability after partial provision failure
   - destroy + recreate discards a healthy server just because SSH probe timed out
   - Recommendation: provision --resume or standalone wait-for-ssh command
Without IdentitiesOnly=yes, SSH offers all keys loaded in the agent to
the server before trying the -i key. If enough agent keys are loaded,
the server hits its MaxAuthTries limit and disconnects with:

  Received disconnect: Too many authentication failures

This causes every wait_for_connectivity attempt to fail regardless of
whether the host is reachable, making provision always time out when
run outside Docker on a machine with an active SSH agent.

Add IdentitiesOnly=yes to build_default_ssh_options() so only the
explicitly configured private key is used for authentication.
Same root cause as the Rust SSH client fix (019e39c): when the SSH
agent has many keys loaded, SSH tries all of them before the -i key,
hitting the server's MaxAuthTries limit with:

  Too many authentication failures

This broke Ansible tasks (wait-cloud-init.yml and all subsequent
playbooks) when running outside Docker on a machine with an active
agent. Add -o IdentitiesOnly=yes to the ssh_args in ansible.cfg so
Ansible only uses the key specified via --private-key.
Previously, each 'Still waiting for SSH connectivity' log message gave
no indication of why the connection was failing. The reason field now
contains the raw stderr from the failed SSH attempt, e.g.:

  ssh: connect to host 46.225.234.201 port 22: Connection timed out
  torrust@46.225.234.201: Permission denied (publickey).
  Received disconnect: Too many authentication failures

This makes it immediately visible whether sshd is not yet listening,
the user/key is not yet provisioned, or another error is occurring,
without needing to re-run with verbose SSH flags.

Replace the indirect test_connectivity() path (which discarded stderr
via check_command_with_options) with execute_with_options() directly
so the raw stderr is accessible and logged in the retry loop.
Change SSH connectivity wait defaults:
- DEFAULT_MAX_RETRY_ATTEMPTS: 60 (was 60, same count)
- DEFAULT_RETRY_INTERVAL_SECS: 5s (was 2s)
- Total budget: 300s / 5 minutes (was 120s)

2 seconds between attempts is too aggressive — it floods logs and adds
unnecessary load. 5 minutes total is a more realistic budget for
slower providers and small machines where cloud-init user creation can
take over 3 minutes (observed on Hetzner ccx23 in failed attempt 2).
…README

- Add cleanup-between-attempts.md with step-by-step procedure for
  cleaning up a failed provision before retrying:
  1. Build updated Docker image (if code changed)
  2. Destroy failed environment on provider (tofu destroy)
  3. Verify server is gone on Hetzner Console
  4. Purge local data with 'purge --force' command
  5. Clear data/logs/log.txt for a clean retry log

- Update provision README:
  - Status updated from ProvisionFailed to 'Cleaned - ready to retry'
  - SSH timeout updated from 120s to 300s (60 × 5s) to reflect new defaults
  - Add link to cleanup-between-attempts.md
@josecelano josecelano self-assigned this Mar 3, 2026
…n floating IPs

- Add post-provision/ directory with README, dns-setup.md, and volume-setup.md
- DNS setup: assign IPv4 (116.202.176.169) and IPv6 (2a01:4f8:1c0c:9aae::1) floating IPs,
  document permanent netplan configuration, include three screenshots
- Volume setup: guide for creating and mounting a 50 GB Hetzner volume at /opt/torrust/storage
- Update deployment README.md with Phase 3.5 section and TOC entries
- Update issue tracker with Phase 3.5 tasks (3.5.1 and 3.5.2 complete)
- Add technical words to project-words.txt (blkid, mountpoint, nofail, NXDOMAIN, netplan, etc.)
- Apply permanent netplan config for both floating IPs (116.202.176.169/32 and
  2a01:4f8:1c0c:9aae::1/64) on the server
- Record actual commands run and their output in dns-setup.md Step 1.5
- Document two problems: SSH host key mismatch and netplan file permissions
- Add improvement note: use 'install -m 600' instead of 'tee' for correct permissions
- Mark task 3.5.3 complete in issue tracker
- Add kernel network terms to project-words.txt (qdisc, qlen, codel)
- Update dns-setup.md: replace manual UI instructions with Cloud API
  approach using POST /v1/zones/{zone}/rrsets; add actual command outputs
  and verification results; document the old dns.hetzner.com API
  incompatibility with Cloud Console zones
- Add hetzner-dns-create-api-token-form.png screenshot
- Mark tasks 3.5.4 and 3.5.5 as done in issue tracker
- Add rrset/rrsets to project-words.txt

All 12 DNS records (A + AAAA for http1, http2, api, grafana, udp1, udp2)
now resolve globally to floating IPs 116.202.176.169 / 2a01:4f8:1c0c:9aae::1.
- Create 50 GB ext4 volume (torrust-tracker-demo-storage, id=104927743)
  in nbg1 via Cloud API; attach to server 122663759
- Mount at /opt/torrust/storage with discard,nofail,defaults via fstab
  (UUID=6fb9df14-c744-4e50-a48d-9ca4522a02de)
- Set ownership to torrust:torrust; verified fstab remount
- Replace placeholder draft with actual commands and outputs
- Add screenshots: volumes list + configure popup
- Mark tasks 3.5.6-3.5.9 done in issue tracker
- Add automount, ENDSSH to project-words.txt
Hetzner does not support volume snapshots (only server root disk
snapshots). Remove the incorrect 'Targeted backups via snapshot' bullet
and add a Problems entry explaining the limitation and the alternative
(application-level backup command or rsync/tar).
- Add commands/configure/README.md with actual output (took ~103 s)
- Add commands/release/README.md skeleton (placeholder for next step)
- Mark task 3.2 done in issue tracker

Server post-configure state: Docker 28.2.2 and Docker Compose v2.29.2
installed and running. No containers running yet (expected — release
and run come next).
Add a 'Design Notes' section to post-provision/README.md covering:

- Why the deployer has no failure-recovery (intentional design: complexity
  vs. fast server recreation)
- The sequencing dilemma: volume must be set up before configure (so
  Ansible writes data directly to the volume), but this means extra
  manual work if provision fails and the server must be recreated
- Floating IPs are safe: just reassign to the new server, no DNS changes
- Volume is the painful part: detach, reattach, remount on the new server
- Alternative approach: defer volume setup until after run succeeds, then
  migrate data (better for prod, acceptable overhead for demo)

Also update status table: DNS and Volume setup both marked done.
- New observations.md: cross-cutting deployer insights gathered during this
  deployment. First entry documents the theoretical state recovery path via
  environment.json snapshots, with a prominent warning about the risks of
  partial execution states.
- Updated main README.md ToC: added configure, release, run to the
  deployment commands list and added observations.md as item 7.
- Updated Phase 4 section to reference the completed configure README.
- Added run/README.md placeholder (parallel to existing release placeholder).

Refs #405
… not in PATH)

The local docker-compose.yml validator added in PR #384 runs
'docker compose config --quiet' on the host. When the deployer is invoked via
its Docker container (the standard usage), the docker binary is not installed
inside the container and the command fails with ENOENT, aborting the release.

Documents:
- Root cause: local validator assumes docker is in PATH, but the deployer
  container has no docker binary installed
- Fix applied: handle ErrorKind::NotFound gracefully (skip with warning)
- State recovery: how environment.json was manually reset from ReleaseFailed
  to Configured so release could be retried

Also adds 'ENOENT' to project-words.txt.

Refs #405
When the deployer runs inside a Docker container (the standard production
usage), the 'docker' binary is not installed inside the container. The
local validator was treating io::ErrorKind::NotFound as a hard failure,
blocking the release command entirely.

Fix: match on NotFound specifically and skip validation with a warning log.
Any other OS error (e.g. PermissionDenied) is still a hard failure.

The warning message makes the skip reason explicit:
  'Skipping local docker-compose.yml validation: docker is not available
   in PATH (deployer may be running inside a container). The rendered file
   will be validated by Docker Compose on the remote host.'

Also updates the CommandExecutionFailed error message and test to reflect
that NotFound is no longer an error case.

Fixes release failure documented in:
docs/deployments/hetzner-demo-tracker/commands/release/bugs.md

Refs #405
- release/README.md: populated with actual output (~134 s, state=Released),
  plus reference to bugs.md for the docker-not-in-PATH issue on first attempt
- hetzner-demo-tracker/README.md: Phase 5 section filled in
- issue 405: task 3.3 marked done

Refs #405
Add bugs.md documenting two bugs found during the run command:
- Bug 1: MYSQL_USER='root' is not rejected at environment creation time,
  causing MySQL 8.4 to refuse to start with a confusing error
- Bug 2: MySQL root password silently gets a '_root' suffix appended,
  diverging from the configured password in the env JSON config

Add problems.md documenting the run command failure symptom, confirmed
root cause, required fix, and related deployer improvement needed.
The '_root' suffix is a placeholder stub with a comment explicitly
stating it should be managed securely in production. Update the root
cause description to reflect this — it was not a design decision but
an unfinished implementation.
Add Bug 3 to bugs.md: MySQL password is not URL-encoded when rendered
into the tracker.toml connection string. A '/' in the password causes
an InvalidPort parse error and the tracker enters a restart loop.
Documents the manual workaround applied (%2F encoding + scp + restart).

Add improvements.md clarifying that the 'Running' state only means
docker compose up -d succeeded, not that all services are healthy.
Documents why run intentionally does not wait for service health and
points to the 'test' command as the right verification tool.
- grafana.md: mark verified, mask admin password, switch curl commands
  from URL-embedded to -u flag (URL form breaks on passwords with /),
  add note about this, complete results table
- verify/README.md: mark Grafana as verified
- udp-tracker.md: mark verified, consolidate BEP 15 script to cover
  both trackers in one loop, add nc reliability note, update results
  table with actual connection IDs from successful handshakes
- verify/README.md: mark UDP Tracker as verified (all services now
  verified except browser check for Grafana)
…yment

- Add maintenance/README.md index
- Add maintenance/secrets-rotation.md with 7 rotation steps
- Include secret-to-files relationship map (admin token in .env AND
  prometheus.yml x2; MySQL torrust password in .env, tracker.toml,
  backup.conf; MySQL root and Grafana passwords in .env only)
- Step 1 expanded into 1a/1b/1c covering both .env and prometheus.yml
  token update, with tracker + prometheus restart and verification
- Update deployment README.md ToC with Maintenance section
Steps 5 and 6 completed on 2026-03-04.
Both tokens deleted from their respective consoles.
…edure

- Add maintenance/os-updates.md with step-by-step apt update procedure,
  reboot check, service verification, and a log table
- Update maintenance/README.md with OS Updates row
- Add 'autoremove' to project-words.txt
docker compose restart is not enough when rotating the admin token because
it is injected as an env var at container creation time. Use:
  docker compose up -d --force-recreate tracker prometheus

Add note in step 2e clarifying that restart IS sufficient for tracker.toml
and backup.conf since they are bind-mounted files re-read on startup.
…ST health check endpoint

- os-updates.md: update log entry with successful reboot result
- secrets-rotation.md: tick off SSH access and API with new token
- health-check.md: add Public REST API Health Check section
  (port 1212 via Caddy, no auth token required)
Hetzner does not include native uptime monitoring unlike DigitalOcean.
Add documentation to cover this gap:

- Explain context and comparison with DigitalOcean
- Add comparison table of external tools (UptimeRobot, Freshping,
  Better Uptime, Checkly, statuspage.io) with free-tier details
- List public endpoints to monitor with expected responses
- Note UDP monitoring limitations
- Add Checkly, Freshping, statuspage to project-words.txt
- Add tracker-registry.md with newTrackon submission instructions
- Only udp1 (udp://udp1.torrust-tracker-demo.com:6969/announce) is
  submitted; udp2 kept off public lists to avoid announce noise in logs
- Mark submitted on 2026-03-04, pending appearance in public list
- Link from deployment README ToC (item 10)
- Remove newTrackon section from uptime-monitoring.md (wrong place)
- Add Trackon to project-words.txt
@josecelano josecelano requested a review from da2ce7 March 4, 2026 18:20
Collect all deployer bugs found during this deployment in one file
with links to full descriptions in the relevant command docs.

11 bugs total across create/provision/release/run/test:
- B-01: create template binds to 0.0.0.0 (IPv4 only)
- B-02: create template defaults to SQLite silently
- B-03: instance_name: null unexplained in template
- B-04: provision SSH probe budget too short for Hetzner (120s)
- B-05: passphrase-protected SSH keys fail silently in Docker
- B-06: UDP tracker domains missing from provision output
- B-07: release fails when docker not in PATH [FIXED]
- B-08: MySQL 'root' username not rejected at creation time
- B-09: MySQL root password silently diverges (_root suffix)
- B-10: MySQL password not URL-encoded in tracker.toml
- B-11: test DNS check false positives with floating IP
Collect all deployer improvement recommendations found during this
deployment in one file with links to full descriptions.

13 items across create/provision/run/cross-cutting/post-provision:
- I-01: create template should document instance_name auto-generation
- I-02: create template should default to [::] for public sockets
- I-03: create template should prompt for database choice
- I-04: provision SSH probe should distinguish failure reasons
- I-05: provision error_kind should be specific for SSH auth failures
- I-06: provision trace file should include per-attempt SSH details
- I-07: provision SSH connectivity timeout should be configurable
- I-08: provision should detect passphrase-protected SSH keys early
- I-09: provision should have wait-for-ssh or --resume flag
- I-10: provision output should include IPv6 address
- I-11: run should add lightweight post-start health check
- I-12: env config should support floating IP for DNS checks
- I-13: post-provision netplan config should use correct permissions
@josecelano josecelano marked this pull request as ready for review March 4, 2026 18:27
@josecelano
Copy link
Member Author

ACK 675b826

@josecelano josecelano merged commit 29f5d17 into main Mar 4, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deploy Hetzner demo tracker and document the process

1 participant