feat: [#405] deploy Hetzner demo tracker and document process (in progress)#406
Merged
josecelano merged 59 commits intomainfrom Mar 4, 2026
Merged
Conversation
…irectories - Rename configuration.md → deployment-spec.md (it describes what to deploy, not a command) - Add commands/create/README.md documenting create template, validate, create environment steps - Add commands/provision/README.md documenting provision command and server details (attempt 1) - Split problems.md into commands/create/problems.md (problems 1–3) and commands/provision/problems.md (problems 4–5) - Move create/ and provision/ docs under a commands/ parent folder for cleaner structure - Rename screenshot to hetzner-console-provisioned-server-details-attempt-1.png - Update all internal cross-links to reflect new paths
…er errors - Add .github/skills/usage/operations/debug-command-failure/skill.md covering the 5-layer investigation workflow: console output, environment state, trace log, build artifacts, and manual verification - Include common error patterns table (failed_step + error_kind → cause) - Include recovery guide for pre- vs post-cloud-resource failure scenarios - Register skill in AGENTS.md table (alphabetical order)
Replace the vague 'timeout too short' explanation with a precise timeline reconstructed from data/logs/log.txt: - tofu apply completed in 19s (15:30:13 → 15:30:32) - SSH probe attempts 1-3: ~7s each → port 22 not yet open (ConnectTimeout) - SSH probe attempts 4-60: ~2-3s each → TCP connects but auth rejected - sshd was listening within ~17s of server appearing - Real bottleneck: cloud-init had not yet created the torrust user + authorized_keys — every attempt was rejected with permission denied - cloud-init finished between 15:33:32 and ~15:44:00 (>3m32s after server) - Update Resolution note: retry may also time out for the same reason
…ement recommendations Based on problems encountered during the Hetzner provision attempt: 1. Distinguish SSH failure reason in probe loop (timeout vs auth rejected) - Both currently surface as Ok(false) with no diagnostic detail - Recommendation: capture stderr and log different message per failure type 2. Classify error_kind more precisely for auth failures - 'NetworkConnectivity' is misleading for permission-denied cases - Recommendation: add SshAuthenticationFailed / SshUserNotReady variant 3. Include per-attempt failure details in trace file - Current trace only summarises with '60 attempts (120s total)' - Investigation required manual timestamp comparison in data/logs/log.txt - Recommendation: condensed probe phase summary in trace 4. Configurable SSH connectivity timeout - 120s hardcoded; Hetzner ccx23 cloud-init took >3m32s - Recommendation: per-provider default + env config field + CLI flag 5. Resume capability after partial provision failure - destroy + recreate discards a healthy server just because SSH probe timed out - Recommendation: provision --resume or standalone wait-for-ssh command
Without IdentitiesOnly=yes, SSH offers all keys loaded in the agent to the server before trying the -i key. If enough agent keys are loaded, the server hits its MaxAuthTries limit and disconnects with: Received disconnect: Too many authentication failures This causes every wait_for_connectivity attempt to fail regardless of whether the host is reachable, making provision always time out when run outside Docker on a machine with an active SSH agent. Add IdentitiesOnly=yes to build_default_ssh_options() so only the explicitly configured private key is used for authentication.
Same root cause as the Rust SSH client fix (019e39c): when the SSH agent has many keys loaded, SSH tries all of them before the -i key, hitting the server's MaxAuthTries limit with: Too many authentication failures This broke Ansible tasks (wait-cloud-init.yml and all subsequent playbooks) when running outside Docker on a machine with an active agent. Add -o IdentitiesOnly=yes to the ssh_args in ansible.cfg so Ansible only uses the key specified via --private-key.
Previously, each 'Still waiting for SSH connectivity' log message gave no indication of why the connection was failing. The reason field now contains the raw stderr from the failed SSH attempt, e.g.: ssh: connect to host 46.225.234.201 port 22: Connection timed out torrust@46.225.234.201: Permission denied (publickey). Received disconnect: Too many authentication failures This makes it immediately visible whether sshd is not yet listening, the user/key is not yet provisioned, or another error is occurring, without needing to re-run with verbose SSH flags. Replace the indirect test_connectivity() path (which discarded stderr via check_command_with_options) with execute_with_options() directly so the raw stderr is accessible and logged in the retry loop.
Change SSH connectivity wait defaults: - DEFAULT_MAX_RETRY_ATTEMPTS: 60 (was 60, same count) - DEFAULT_RETRY_INTERVAL_SECS: 5s (was 2s) - Total budget: 300s / 5 minutes (was 120s) 2 seconds between attempts is too aggressive — it floods logs and adds unnecessary load. 5 minutes total is a more realistic budget for slower providers and small machines where cloud-init user creation can take over 3 minutes (observed on Hetzner ccx23 in failed attempt 2).
…README - Add cleanup-between-attempts.md with step-by-step procedure for cleaning up a failed provision before retrying: 1. Build updated Docker image (if code changed) 2. Destroy failed environment on provider (tofu destroy) 3. Verify server is gone on Hetzner Console 4. Purge local data with 'purge --force' command 5. Clear data/logs/log.txt for a clean retry log - Update provision README: - Status updated from ProvisionFailed to 'Cleaned - ready to retry' - SSH timeout updated from 120s to 300s (60 × 5s) to reflect new defaults - Add link to cleanup-between-attempts.md
…n, and UDP domains bug
…n floating IPs - Add post-provision/ directory with README, dns-setup.md, and volume-setup.md - DNS setup: assign IPv4 (116.202.176.169) and IPv6 (2a01:4f8:1c0c:9aae::1) floating IPs, document permanent netplan configuration, include three screenshots - Volume setup: guide for creating and mounting a 50 GB Hetzner volume at /opt/torrust/storage - Update deployment README.md with Phase 3.5 section and TOC entries - Update issue tracker with Phase 3.5 tasks (3.5.1 and 3.5.2 complete) - Add technical words to project-words.txt (blkid, mountpoint, nofail, NXDOMAIN, netplan, etc.)
- Apply permanent netplan config for both floating IPs (116.202.176.169/32 and 2a01:4f8:1c0c:9aae::1/64) on the server - Record actual commands run and their output in dns-setup.md Step 1.5 - Document two problems: SSH host key mismatch and netplan file permissions - Add improvement note: use 'install -m 600' instead of 'tee' for correct permissions - Mark task 3.5.3 complete in issue tracker - Add kernel network terms to project-words.txt (qdisc, qlen, codel)
- Update dns-setup.md: replace manual UI instructions with Cloud API
approach using POST /v1/zones/{zone}/rrsets; add actual command outputs
and verification results; document the old dns.hetzner.com API
incompatibility with Cloud Console zones
- Add hetzner-dns-create-api-token-form.png screenshot
- Mark tasks 3.5.4 and 3.5.5 as done in issue tracker
- Add rrset/rrsets to project-words.txt
All 12 DNS records (A + AAAA for http1, http2, api, grafana, udp1, udp2)
now resolve globally to floating IPs 116.202.176.169 / 2a01:4f8:1c0c:9aae::1.
- Create 50 GB ext4 volume (torrust-tracker-demo-storage, id=104927743) in nbg1 via Cloud API; attach to server 122663759 - Mount at /opt/torrust/storage with discard,nofail,defaults via fstab (UUID=6fb9df14-c744-4e50-a48d-9ca4522a02de) - Set ownership to torrust:torrust; verified fstab remount - Replace placeholder draft with actual commands and outputs - Add screenshots: volumes list + configure popup - Mark tasks 3.5.6-3.5.9 done in issue tracker - Add automount, ENDSSH to project-words.txt
Hetzner does not support volume snapshots (only server root disk snapshots). Remove the incorrect 'Targeted backups via snapshot' bullet and add a Problems entry explaining the limitation and the alternative (application-level backup command or rsync/tar).
- Add commands/configure/README.md with actual output (took ~103 s) - Add commands/release/README.md skeleton (placeholder for next step) - Mark task 3.2 done in issue tracker Server post-configure state: Docker 28.2.2 and Docker Compose v2.29.2 installed and running. No containers running yet (expected — release and run come next).
Add a 'Design Notes' section to post-provision/README.md covering: - Why the deployer has no failure-recovery (intentional design: complexity vs. fast server recreation) - The sequencing dilemma: volume must be set up before configure (so Ansible writes data directly to the volume), but this means extra manual work if provision fails and the server must be recreated - Floating IPs are safe: just reassign to the new server, no DNS changes - Volume is the painful part: detach, reattach, remount on the new server - Alternative approach: defer volume setup until after run succeeds, then migrate data (better for prod, acceptable overhead for demo) Also update status table: DNS and Volume setup both marked done.
- New observations.md: cross-cutting deployer insights gathered during this deployment. First entry documents the theoretical state recovery path via environment.json snapshots, with a prominent warning about the risks of partial execution states. - Updated main README.md ToC: added configure, release, run to the deployment commands list and added observations.md as item 7. - Updated Phase 4 section to reference the completed configure README. - Added run/README.md placeholder (parallel to existing release placeholder). Refs #405
… not in PATH) The local docker-compose.yml validator added in PR #384 runs 'docker compose config --quiet' on the host. When the deployer is invoked via its Docker container (the standard usage), the docker binary is not installed inside the container and the command fails with ENOENT, aborting the release. Documents: - Root cause: local validator assumes docker is in PATH, but the deployer container has no docker binary installed - Fix applied: handle ErrorKind::NotFound gracefully (skip with warning) - State recovery: how environment.json was manually reset from ReleaseFailed to Configured so release could be retried Also adds 'ENOENT' to project-words.txt. Refs #405
When the deployer runs inside a Docker container (the standard production usage), the 'docker' binary is not installed inside the container. The local validator was treating io::ErrorKind::NotFound as a hard failure, blocking the release command entirely. Fix: match on NotFound specifically and skip validation with a warning log. Any other OS error (e.g. PermissionDenied) is still a hard failure. The warning message makes the skip reason explicit: 'Skipping local docker-compose.yml validation: docker is not available in PATH (deployer may be running inside a container). The rendered file will be validated by Docker Compose on the remote host.' Also updates the CommandExecutionFailed error message and test to reflect that NotFound is no longer an error case. Fixes release failure documented in: docs/deployments/hetzner-demo-tracker/commands/release/bugs.md Refs #405
- release/README.md: populated with actual output (~134 s, state=Released), plus reference to bugs.md for the docker-not-in-PATH issue on first attempt - hetzner-demo-tracker/README.md: Phase 5 section filled in - issue 405: task 3.3 marked done Refs #405
Add bugs.md documenting two bugs found during the run command: - Bug 1: MYSQL_USER='root' is not rejected at environment creation time, causing MySQL 8.4 to refuse to start with a confusing error - Bug 2: MySQL root password silently gets a '_root' suffix appended, diverging from the configured password in the env JSON config Add problems.md documenting the run command failure symptom, confirmed root cause, required fix, and related deployer improvement needed.
The '_root' suffix is a placeholder stub with a comment explicitly stating it should be managed securely in production. Update the root cause description to reflect this — it was not a design decision but an unfinished implementation.
Add Bug 3 to bugs.md: MySQL password is not URL-encoded when rendered into the tracker.toml connection string. A '/' in the password causes an InvalidPort parse error and the tracker enters a restart loop. Documents the manual workaround applied (%2F encoding + scp + restart). Add improvements.md clarifying that the 'Running' state only means docker compose up -d succeeded, not that all services are healthy. Documents why run intentionally does not wait for service health and points to the 'test' command as the right verification tool.
- grafana.md: mark verified, mask admin password, switch curl commands from URL-embedded to -u flag (URL form breaks on passwords with /), add note about this, complete results table - verify/README.md: mark Grafana as verified
- udp-tracker.md: mark verified, consolidate BEP 15 script to cover both trackers in one loop, add nc reliability note, update results table with actual connection IDs from successful handshakes - verify/README.md: mark UDP Tracker as verified (all services now verified except browser check for Grafana)
…ice endpoints table
…yment - Add maintenance/README.md index - Add maintenance/secrets-rotation.md with 7 rotation steps - Include secret-to-files relationship map (admin token in .env AND prometheus.yml x2; MySQL torrust password in .env, tracker.toml, backup.conf; MySQL root and Grafana passwords in .env only) - Step 1 expanded into 1a/1b/1c covering both .env and prometheus.yml token update, with tracker + prometheus restart and verification - Update deployment README.md ToC with Maintenance section
Steps 5 and 6 completed on 2026-03-04. Both tokens deleted from their respective consoles.
…edure - Add maintenance/os-updates.md with step-by-step apt update procedure, reboot check, service verification, and a log table - Update maintenance/README.md with OS Updates row - Add 'autoremove' to project-words.txt
docker compose restart is not enough when rotating the admin token because it is injected as an env var at container creation time. Use: docker compose up -d --force-recreate tracker prometheus Add note in step 2e clarifying that restart IS sufficient for tracker.toml and backup.conf since they are bind-mounted files re-read on startup.
… rotation complete
…ST health check endpoint - os-updates.md: update log entry with successful reboot result - secrets-rotation.md: tick off SSH access and API with new token - health-check.md: add Public REST API Health Check section (port 1212 via Caddy, no auth token required)
Hetzner does not include native uptime monitoring unlike DigitalOcean. Add documentation to cover this gap: - Explain context and comparison with DigitalOcean - Add comparison table of external tools (UptimeRobot, Freshping, Better Uptime, Checkly, statuspage.io) with free-tier details - List public endpoints to monitor with expected responses - Note UDP monitoring limitations - Add Checkly, Freshping, statuspage to project-words.txt
- Add tracker-registry.md with newTrackon submission instructions - Only udp1 (udp://udp1.torrust-tracker-demo.com:6969/announce) is submitted; udp2 kept off public lists to avoid announce noise in logs - Mark submitted on 2026-03-04, pending appearance in public list - Link from deployment README ToC (item 10) - Remove newTrackon section from uptime-monitoring.md (wrong place) - Add Trackon to project-words.txt
Collect all deployer bugs found during this deployment in one file with links to full descriptions in the relevant command docs. 11 bugs total across create/provision/release/run/test: - B-01: create template binds to 0.0.0.0 (IPv4 only) - B-02: create template defaults to SQLite silently - B-03: instance_name: null unexplained in template - B-04: provision SSH probe budget too short for Hetzner (120s) - B-05: passphrase-protected SSH keys fail silently in Docker - B-06: UDP tracker domains missing from provision output - B-07: release fails when docker not in PATH [FIXED] - B-08: MySQL 'root' username not rejected at creation time - B-09: MySQL root password silently diverges (_root suffix) - B-10: MySQL password not URL-encoded in tracker.toml - B-11: test DNS check false positives with floating IP
Collect all deployer improvement recommendations found during this deployment in one file with links to full descriptions. 13 items across create/provision/run/cross-cutting/post-provision: - I-01: create template should document instance_name auto-generation - I-02: create template should default to [::] for public sockets - I-03: create template should prompt for database choice - I-04: provision SSH probe should distinguish failure reasons - I-05: provision error_kind should be specific for SSH auth failures - I-06: provision trace file should include per-attempt SSH details - I-07: provision SSH connectivity timeout should be configurable - I-08: provision should detect passphrase-protected SSH keys early - I-09: provision should have wait-for-ssh or --resume flag - I-10: provision output should include IPv6 address - I-11: run should add lightweight post-start health check - I-12: env config should support floating IP for DNS checks - I-13: post-provision netplan config should use correct permissions
Member
Author
|
ACK 675b826 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #405
Summary
Real-world deployment of a Torrust Tracker demo instance to Hetzner Cloud using the deployer tool, with full documentation of every step, decision, and problem encountered. The documentation will serve as both an internal reference and a source for a blog post on torrust.com.
Progress
ccx23atnbg1)udp1submitted to newTrackon (2026-03-04)What's in this PR
Documentation (
docs/deployments/hetzner-demo-tracker/)New deployment journal with per-command subdirectories:
secrets-rotation.md— Complete procedure for rotating all 7 secrets; records first rotation (2026-03-04)os-updates.md—aptupdate procedure with log of first run (59 updates, reboot, all services healthy)uptime-monitoring.md— Documents Hetzner monitoring gap vs DigitalOcean; lists external tools (UptimeRobot, Freshping, etc.)udp1is listed (udp2kept quiet for production debugging)Code fixes (
src/,templates/)fix: IdentitiesOnly=yesadded to default SSH options (src/adapters/ssh/client.rs)fix: IdentitiesOnly=yesadded to Ansiblessh_args(templates/ansible/ansible.cfg)feat:SSH retry budget increased to 300s (60 × 5s)feat:Full SSH stderr logged in retry messages for easier diagnosisfix:releaseno longer hard-fails whendockeris not in PATH inside Docker containerNew skill
.github/skills/usage/operations/debug-command-failure/skill.mdService Endpoints
https://http1.torrust-tracker-demo.com/announcehttps://http2.torrust-tracker-demo.com/announceudp://udp1.torrust-tracker-demo.com:6969udp://udp2.torrust-tracker-demo.com:6868https://api.torrust-tracker-demo.com/api/v1https://grafana.torrust-tracker-demo.comBugs Found (11 total, 1 fixed)
create0.0.0.0(IPv4 only)createcreateinstance_name: nullunexplainedprovisionprovisionprovisionreleasedockernot in PATHrun"root"username not rejected at creation timerunruntestKey learnings documented
docker compose restartdoes not re-read env vars — must useup -d --force-recreateafter rotating secrets in.env.tracker.toml—/in password must be encoded as%2Fin the connection string.