Skip to content

Conversation

@valerymo
Copy link
Contributor

@valerymo valerymo commented Jul 22, 2025

Overview

Jira: https://issues.redhat.com/browse/MGDAPI-5690

We want to avoid a situation where the CRO applies the same tags to Redis snapshots during every reconciliation cycle.
Instead, we aim to apply only new tags or tags that have changed.
This PR includes changes to the Redis snapshot tagging logic to support this behavior.

NOTES
This PR requires adding the following permission to your AWS user policy:
elasticache:ListTagsForResource

However, we are hitting an AWS limitation:
Inline policies for users, groups, or roles are limited to 2,048 characters per policy.

To address this limitation, we removed two currently unused permissions.
The following permissions can be safely removed (see comments below):

  1. iam:CreateServiceLinkedRole
  • No direct usage found in any provider code
  • This permission might have been added for potential future functionality, but it is not currently used
  • It's only required for CloudWatch alarm operations. However, the RHOAM operator uses CloudWatch only for metrics collection via the GetMetricData API call
  1. cloudwatch:ListMetrics
  • Although the permission is defined, the application never makes the corresponding AWS API call
  • The method func (r *RealCloudWatchClient) ListMetrics(...) exists only to satisfy an interface contract and is not invoked anywhere in the application

Verification

  • check that your AWS user has elasticache:ListTagsForResource permission in the Permissions policies
$ oc get credentialsrequest cloud-resources-aws-credentials -n cloud-resource-operator -o yaml | grep -E "user:|policy:"
$ aws iam get-user-policy --user-name <user-name> --policy-name <policy-name> | grep -i listtagsforresource |grep elast

#expected to see:
  "elasticache:ListTagsForResource"
  • Clone this branch
  • Run make cluster/prepare
  • Run make run
  • Check that tags created on Redis snapshots
  • Ensure no redundant tag operations on Redis backups/snapshots.
    Tags should be created or updated only if missing or different — e.g., on the first cycle or when tags change.
    On later cycles with no config changes, no tag action should occur.
    Expected logs:
  • In case of new or changed tag(s):
INFO[0030] creating or updating tags on elasticache nodes and snapshots
INFO[0031] Successfully applied 1 new/updated tags to cluster arn:aws:elasticache<redis-cluster-name>
  • In case of no changes:
INFO[0026] creating or updating tags on elasticache nodes and snapshots 
INFO[0028] Redis cluster arn:aws:elasticache:<redis-cluster-name>: no tag changes required 

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign carlkyrillos for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Jul 23, 2025

Codecov Report

❌ Patch coverage is 75.75758% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.88%. Comparing base (5450c62) to head (fd000af).

Files with missing lines Patch % Lines
pkg/providers/aws/provider_redis.go 75.75% 7 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #888      +/-   ##
==========================================
+ Coverage   67.83%   67.88%   +0.05%     
==========================================
  Files          42       42              
  Lines        5326     5350      +24     
==========================================
+ Hits         3613     3632      +19     
- Misses       1350     1354       +4     
- Partials      363      364       +1     
Files with missing lines Coverage Δ
pkg/providers/aws/credentials.go 90.19% <ø> (ø)
pkg/providers/aws/provider_redis.go 60.22% <75.75%> (+0.64%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@valerymo valerymo force-pushed the MGDAPI-5690-1 branch 3 times, most recently from 467a683 to dc4a74e Compare July 27, 2025 15:27
"elasticache:ModifyCacheSubnetGroup",
"elasticache:DeleteCacheSubnetGroup",
"elasticache:ModifyReplicationGroup",
"elasticache:ListTagsForResource",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the passed we reached an aws limit of permissions we could add to a single account. Just wondering are you seeing these limits when you add this permission. Think it was the main reason why we didn't proceed with this change in the passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the reason. I added ListTagsForResource in AWS, but removed something just for testing. I did the same in CRO, and it's working for me now. Logging has also improved.
However, I'm continuing the investigation — even though the CRO logs look good, there are still "AddTag..." events showing up in CloudTrail.
Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @austincunningham, Just a small fix applied — I had forgotten to add a condition for snapshots:
if len(snapshotList.Snapshots) > 0 && len(filteredTags) > 0.
No more tagging events appearing in CloudTrail.
Thank you
(remains - check unit tests, after latest updates)

Copy link
Contributor Author

@valerymo valerymo Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests should be ok now.

Copy link
Contributor Author

@valerymo valerymo Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permissions removed:

  1. iam:CreateServiceLinkedRole
  • No direct usage found in any provider code
  • Only referenced in vendor documentation for CloudWatch alarms: "// If you are an IAM user, you must have iam:CreateServiceLinkedRole to create a composite alarm that has Systems Manager OpsItem actions."
  • This permission might have been added for future functionality but isn't currently used

Seems we can safely remove iam:CreateServiceLinkedRole because:

  • It's only needed for CloudWatch alarm operations
  • RHOAM operator only uses CloudWatch for metrics collection (GetMetricData)
  1. cloudwatch:ListMetrics
    cloudwatch:ListMetrics can be safely removed because:
  • The permission is for the AWS API call, but The application never makes that API call
  • The method func (r *RealCloudWatchClient) ListMetrics(...) exists only to satisfy the interface contract

@valerymo valerymo force-pushed the MGDAPI-5690-1 branch 2 times, most recently from e137f15 to d549f6f Compare July 28, 2025 09:04
"cloudwatch:ListMetrics",
"cloudwatch:GetMetricData",
//"iam:CreateServiceLinkedRole", // Only needed for CloudWatch alarms (not used)
//"cloudwatch:ListMetrics", // Only needed for metric discovery (not used)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be inclined to leave ListMetrics in as it is referenced in the interface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the changes are metrics based, think I will test this in rhoam to confirm that the alerting is working.

Copy link
Member

@austincunningham austincunningham Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something up with metrics when deployed on rhoam.

image

steps I took

  • edited the org in the makefile to my quay.io org
  • ran make image/push
  • install the rhoam addon on a ccs cluster with useclusterstorage: 'false'
  • patched the rhoam cr with the cluster package workaround
oc -n redhat-rhoam-operator patch rhmis.integreatly.org rhoam \
  --type=merge --subresource=status \
  -p '{"status":{
        "preflightMessage":"preflight checks passed",
        "stage":"Preflight Checks",
        "preflightStatus": "successful",
        "stages":{}
      }}'
  • manually updated the operator images in the rhoam csv and the cro csv to point to the one I just built
  • port forward the prometheus service to port 9089
oc port-forward services/rhoam-prometheus 9089:9090 -n redhat-rhoam-operator-observability
  • checked the status targets and found that the serviceMonitor/redhat-rhoam-operator-observability/cloud-resource-operator-metrics/0 was down.

So we are not serving metrics for prometheus to consume.

I changed the image back to the normal one and metrics endpoint was exposed and working again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth checking an image built of master in cro to see if that has the same issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update - done. Test - TODO. Thank you

"cloudwatch:ListMetrics",
"cloudwatch:GetMetricData",
//"iam:CreateServiceLinkedRole", // Only needed for CloudWatch alarms (not used)
//"cloudwatch:ListMetrics", // Only needed for metric discovery (not used)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the changes are metrics based, think I will test this in rhoam to confirm that the alerting is working.

if err != nil {
msg := "failed to add tags to aws elasticache :"
return croType.StatusMessage(msg), err
msg := "Failed to filter already applied tags"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg := "Failed to filter already applied tags"
msg := "failed to filter already applied tags"

small thing , a convention that we always use lower case in error messages. Although we don't always appear to follow it .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you

msg := "failed to add tags to aws elasticache :"
return croType.StatusMessage(msg), err
}
logrus.Infof("Successfully applied %d new/updated tags to cluster %s", len(filteredTags), arn)
Copy link
Member

@austincunningham austincunningham Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't put the arn in the log messages potential security hole as it exposes the account number and region

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use CacheClusterId instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you

}
logrus.Infof("Successfully applied %d new/updated tags to cluster %s", len(filteredTags), arn)
} else {
logrus.Infof("Redis cluster %s: no tag changes required", arn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here don't put the arn in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you

})
if err != nil {
// If we can't list tags (permission issue), fall back to applying all tags
logrus.Warnf("Could not list existing tags for %s: %v. Will attempt to apply all tags (may result in unnecessary API calls for already-applied tags).", resourceARN, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here don't put the arn in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you

@valerymo valerymo force-pushed the MGDAPI-5690-1 branch 2 times, most recently from df38981 to 294e780 Compare September 28, 2025 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants