-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Describe the bug
I currently manage hundreds of Kubernetes clusters, which are configured using Flux from a single GitOps repository. We utilize flux_bootstrap_git to manage Flux installations for each cluster. On average, this repository receives new commits every minute during day time hours.
This high frequency of commits has caused issues when updating the flux_bootstrap_git resource. Specifically, whenever flux terraform provider attempts to push a commit to our GitOps repository, the Terraform provider almost always times out with the following error:
│ failed to push manifests: failed to push to remote: command error on
│ refs/heads/main: cannot lock ref 'refs/heads/main': is at
│ da2267035aa50139f41df052947da4e85202c0f0 but expected
│ 71853c6197a6a7f222db0f1978c7cb232b87c5ee
To mitigate this, we’ve increased the timeouts, which has helped to some extent. However, we’ve observed that on every retry, the Terraform provider performs a full clone of the entire repository. This process is time-consuming, given that the repository has over 300,000 commits, and new commits are often added within the retry window.
A potential improvement could involve modifying the func (prd *providerResourceData) CloneRepository(ctx context.Context) function in internal/provider/provider_resource_data.go to use a shallow clone. Here’s an example of the proposed change:
func (prd *providerResourceData) CloneRepository(ctx context.Context) (*gogit.Client, error) {
tmpDir, err := manifestgen.MkdirTempAbs("", "flux-bootstrap-")
if err != nil {
return nil, fmt.Errorf("could not create temporary working directory for git repository: %w", err)
}
gitClient, err := prd.GetGitClient(tmpDir)
if err != nil {
return nil, fmt.Errorf("could not create git client: %w", err)
}
// TODO: Need to conditionally clone here. If repository is empty this will fail.
_, err = gitClient.Clone(ctx, prd.GetRepositoryURL().String(), repository.CloneConfig{
CheckoutStrategy: repository.CheckoutStrategy{
Branch: prd.git.Branch.ValueString(),
},
+ ShallowClone: true,
})
if err != nil {
return nil, fmt.Errorf("could not clone git repository: %w", err)
}
return gitClient, nil
}Testing this change locally has shown an improvement in performance. It reduces the time required to clone the repository and should decrease the likelihood of timeouts when applying our Terraform configuration.
Would this be a reasonable proposal for a pull request? Let me know if there are other considerations I should account for.
Steps to reproduce
Note
The failure is transient, so reproducing it may be tricky.
- Bootstrap a repository using
flux_bootstrap_gitby runningterraform apply. - Modify a property in
flux_bootstrap_git, which triggers a new commit to be pushed to the bootstrapped repository. - Reapply the Terraform configuration (
terraform apply). - While Terraform is applying, continuously push new commits to the bootstrapped repository to intentionally disrupt the process.
(This is especially helpful if the repository is large and slow to clone.)
Expected behavior
Ideally, flux_bootstrap_git should be designed to scale efficiently and remain resilient under high-frequency repository operations, avoiding timeouts.
Screenshots and recordings
No response
Terraform and provider versions
Terraform v1.10.4
on linux_amd64
+ provider registry.terraform.io/fluxcd/flux v1.2.3
Terraform provider configurations
provider "flux" {
kubernetes = {
host = var.kubernetes.host
cluster_ca_certificate = base64decode(var.kubernetes.ca_certificate)
token = var.kubernetes.token
}
git = {
branch = var.branch
url = "ssh://[email protected]/${var.github_owner}/${var.repository_name}.git"
ssh = {
username = "git"
private_key = tls_private_key.main.private_key_pem
}
}
}flux_bootstrap_git resource
resource "flux_bootstrap_git" "this" {
depends_on = [github_repository_deploy_key.main]
path = var.target_path
components_extra = var.components_extra
kustomization_override = templatefile("${path.module}/kustomization.tftpl.yaml", {
// .....
})
timeouts = {
create = "30m"
delete = "30m"
update = "30m"
read = "10m"
}
}Flux version
v2.2.3
Additional context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Would you like to implement a fix?
Yes