Skip to content

Tailscale installer: Add wait mechanism for operator CRD registration #17

@Soypete

Description

@Soypete

Problem

The Tailscale component installer fails when trying to deploy the Connector CRD immediately after Helm installation completes:

Error: failed to apply manifest (kind=Connector, name=foundry-vip-connector): the server could not find the requested resource

Root Cause

The installation flow in v1/internal/component/tailscale/install.go executes these steps sequentially:

  1. Helm installs the Tailscale operator
  2. Immediately tries to deploy Connector CRD
  3. Immediately tries to deploy DNSConfig CRD

However, the operator needs time to:

  • Start up (pod becomes Running)
  • Register its CRDs with the Kubernetes API server (eventually consistent operation)

This timing gap causes the Connector deployment to fail with "resource not found".

Proposed Solution

Add a waitForOperatorReady() method between Helm installation and CRD deployment that:

  1. Waits for operator pod to be Running
  2. Waits for CRD registration - Poll CRDExists() for connectors.tailscale.com
  3. Adds buffer time for full propagation

Implementation

File: v1/internal/component/tailscale/install.go

Add after Helm install (Step 4) and before Connector deployment (Step 5):

// Step 4.5: Wait for operator to be ready and CRDs to be registered
if err := i.waitForOperatorReady(ctx); err != nil {
    return fmt.Errorf("failed waiting for operator to be ready: %w", err)
}

New method:

func (i *Installer) waitForOperatorReady(ctx context.Context) error {
    timeout := 120 * time.Second
    deadline := time.Now().Add(timeout)
    
    fmt.Println("Waiting for Tailscale operator to be ready...")
    
    // Wait for operator pod to be running
    for time.Now().Before(deadline) {
        pods, err := i.kubeClient.GetPods(ctx, DefaultNamespace)
        if err == nil {
            for _, pod := range pods {
                if strings.Contains(pod.Name, "operator") && pod.Status == "Running" {
                    fmt.Println("✓ Operator pod is running")
                    goto checkCRD
                }
            }
        }
        time.Sleep(2 * time.Second)
    }
    return fmt.Errorf("timeout waiting for operator pod to be ready")
    
checkCRD:
    // Wait for Connector CRD to be registered
    for time.Now().Before(deadline) {
        exists, err := i.kubeClient.CRDExists(ctx, "connectors.tailscale.com")
        if err == nil && exists {
            fmt.Println("✓ Connector CRD is registered")
            time.Sleep(2 * time.Second)
            return nil
        }
        time.Sleep(2 * time.Second)
    }
    return fmt.Errorf("timeout waiting for Connector CRD to be registered")
}

Alternative Approach

Use retry with exponential backoff in DeployConnector() for better resilience.

Testing

  • Fresh cluster installation with use_tailscale: true
  • Verify operator starts successfully
  • Verify Connector and DNSConfig deploy without errors
  • Test timeout scenarios

Context

  • Discovered during pedro-ops cluster deployment testing
  • Required for automated foundry stack install with Tailscale enabled
  • Follow-up to PR #2i: Tailscale component registry integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions