Skip to content

Conversation

akolarkunnu
Copy link
Contributor

@akolarkunnu akolarkunnu commented Jun 17, 2025

Description

This fix is mainly to fix the issue of duplicate master key generation. If key generation for a tenant is in progress and another request come for encryption/decryption with same tenant id, it will again try to generate another master key , because old request is in the process of creating key and not yet completed. So there is a chance of creating duplicate keys for single tenant. This fix improves the scalability for this specified scenario also.

Fix : Moved ContDownLatch from initMasterKey() to encrypt() or decrypt() methods. Storing all tenants who are waiting for the key generation in the map tenantWaitingListenerMap. Whenever the key generation completes or error happened for a tenant, notify all requestors waiting for that tenant key generation.

Testing:
Added more test cases with multi threaded success and failure use cases.
Also manually tested single and multi tenant use cases with multiple invocations using scripts. Tested this by setting property "plugins.ml_commons.multi_tenancy_enabled"

Related Issues

Resolves #3510

Check List

  • New functionality includes testing.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ility

Removed the usage of ContDownLatch. Every requets will be submitted and returns the Future.
Added a list to track the ongoing master key generation. If any tenant id is in the list, then it's key generation is on going and it will wait until other thread completes the key genearion. Same time system will accept other requests, if key is already avaialble in the map that will procced otherwise key generation for new tenant will start in different thread. So, multiple tenants key generation can happen simulatneuosly.

Resolves opensearch-project#3510

Signed-off-by: Abdul Muneer Kolarkunnu <[email protected]>
@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 12:23 — with GitHub Actions Inactive
@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 12:23 — with GitHub Actions Inactive
@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 12:23 — with GitHub Actions Inactive
@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 12:23 — with GitHub Actions Inactive
@dhrubo-os
Copy link
Collaborator

Awesome! Thanks for raising the PR. This will be a great improvement. I'll start actively reviewing this PR from tomorrow.

Can you also please update your PR in details like how did you test for single tenancy and also for multi tenancy?

Copy link

codecov bot commented Jun 17, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 12 lines in your changes missing coverage. Please review.

Project coverage is 80.40%. Comparing base (969dc3d) to head (80ad33a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../opensearch/ml/common/connector/HttpConnector.java 75.00% 3 Missing and 1 partial ⚠️
...g/opensearch/ml/common/connector/McpConnector.java 75.00% 3 Missing and 1 partial ⚠️
.../opensearch/ml/engine/encryptor/EncryptorImpl.java 88.23% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #3919   +/-   ##
=========================================
  Coverage     80.39%   80.40%           
- Complexity     7910     7915    +5     
=========================================
  Files           693      693           
  Lines         34849    34863   +14     
  Branches       3872     3877    +5     
=========================================
+ Hits          28018    28030   +12     
  Misses         5096     5096           
- Partials       1735     1737    +2     
Flag Coverage Δ
ml-commons 80.40% <81.81%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 21:46 — with GitHub Actions Inactive
@akolarkunnu akolarkunnu temporarily deployed to ml-commons-cicd-env-require-approval June 17, 2025 21:46 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 18, 2025 15:19 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 18, 2025 15:19 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 18, 2025 15:19 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 18, 2025 15:19 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 19, 2025 16:00 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 19, 2025 16:00 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 19, 2025 16:00 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 19, 2025 16:00 — with GitHub Actions Failure
Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @akolarkunnu, first I want to thank you for your contribution! I've experienced flaky test behavior from this particular class (see #2888) and this may help with that.

That said, I've made multiple implementations on OpenSearch plugins and our Remote Metadata SDK using futures and learned some hard lessons along the way. I did leave a line by line review above but wanted to follow up with some general comments.

  1. TLDR on using Futures... avoid them if you can implement something with an ActionListener. These are longstanding well-established and tested callback mechanisms that handle most asynchronous work in OpenSearch. When you have a single "thread" with async breaks they are almost always the correct way to handle it. When you are awaiting multiple things to happen, that's where it gets complex. That may be the case here.
  2. OpenSearch has an ActionFuture class that doubles as an ActionListener. It has an actionGet() method that implements some exception handling (unwrapping the nested exceptions, etc.) that is pretty useful. Please take a look at using that.
  3. Generally speaking I see the code replacing an entire map with an entirely new map every time an encrypt/decrypt call is made. It's hard for me to think this approach is thread safe without synchronization of the map. It seems to me we can probably easily add to a map without touching the other keys but the existing "replace the map" implementation raises a lot of questions.
  4. Consider an approach using an AtomicReference to the decryptedCredential map, taking advantage of the updateAndGet() method. (I'm not sure that will work here, but it seems a possible improvement over iterate-all-and-replace-without-atomicity.

@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 20, 2025 13:55 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 20, 2025 13:55 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 20, 2025 13:55 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 20, 2025 13:55 — with GitHub Actions Error
@jngz-es
Copy link
Collaborator

jngz-es commented Jun 23, 2025

TLDR on using Futures... avoid them if you can implement something with an ActionListener.

I couldn't agree more on this point.

@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 23, 2025 19:26 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 23, 2025 19:26 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 23, 2025 19:26 — with GitHub Actions Failure
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env-require-approval June 23, 2025 19:26 — with GitHub Actions Failure
@akolarkunnu akolarkunnu requested a deployment to ml-commons-cicd-env-require-approval August 18, 2025 11:31 — with GitHub Actions Waiting
@akolarkunnu akolarkunnu requested a deployment to ml-commons-cicd-env-require-approval August 18, 2025 11:31 — with GitHub Actions Waiting
@akolarkunnu akolarkunnu requested a deployment to ml-commons-cicd-env-require-approval August 18, 2025 11:31 — with GitHub Actions Waiting
@akolarkunnu akolarkunnu requested a deployment to ml-commons-cicd-env-require-approval August 18, 2025 11:31 — with GitHub Actions Waiting
Resolves opensearch-project#3510

Signed-off-by: Abdul Muneer Kolarkunnu <[email protected]>
@akolarkunnu
Copy link
Contributor Author

@dbwiddis @dhrubo-os Please review the latest changes.

@akolarkunnu
Copy link
Contributor Author

@dbwiddis @dhrubo-os @pyek-bot Gentle reminder for review!

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @akolarkunnu thanks for continuing to try to iterate on this, but using thread-blocking code for concurrency is not ideal.

Please try to find a solution using only a chain of action listeners without blocking. These all naturally block on IO threads and don't consume thread pool resources.

initMasterKey(tenantId);
final AwsCrypto crypto = AwsCrypto.builder().withCommitmentPolicy(CommitmentPolicy.RequireEncryptRequireDecrypt).build();
JceMasterKey jceMasterKey = createJceMasterKey(tenantId);
CountDownLatch latch = new CountDownLatch(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really shouldn't use a latch here, as waiting for it blocks the thread and consumes a thread pool resource. Try to stick to plain action listeners here.

final AwsCrypto crypto = AwsCrypto.builder().withCommitmentPolicy(CommitmentPolicy.RequireEncryptRequireDecrypt).build();
JceMasterKey jceMasterKey = createJceMasterKey(tenantId);
CountDownLatch latch = new CountDownLatch(1);
AtomicReference<Object> decryptResponse = new AtomicReference<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Object for a reference here creates a lot of complexity and loses type safety.

// checking and waiting if the master key generation triggered by any other thread
if (existingWaitingListener != null && tenantWaitingListenerMap.containsKey(tenantId)) {
log.info("Waiting for other thread to generate master key");
waitingListeners.wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a blocking call, and it's inside a synchronized block so it'll never release waitingListeners, blocking both this method and every other method synchronizing on the same object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve EncryptorImpl with Asynchronous Handling for Scalability
6 participants