-
Notifications
You must be signed in to change notification settings - Fork 1k
PHOENIX-7677 TTL_DELETE CDC event to use batch mutation #2247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments.
LGTM otherwise, +1
for (int retryCount = 0; retryCount < cdcTtlMutationMaxRetries; retryCount++) { | ||
try (Table cdcIndexTable = | ||
env.getConnection().getTable(TableName.valueOf(cdcIndex.getPhysicalName().getBytes()))) { | ||
cdcIndexTable.put(new ArrayList<>(pendingMutations.values())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, what are the semantics here for partial failures when trying to apply batch mutations? We will log failure even if some mutations from the batch were applied after all retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using batchMutate() on the table where all the rowkeys are most likely expected to go to same region, either all mutations will be successful or none
* that all 82 rows have CDC TTL_DELETE events recorded with correct pre-image data. | ||
*/ | ||
@Test | ||
public void testCDCBatchMutationsForTTLExpiredRows() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to verify the actual batching and retry logic from this IT? Do you think a metric for this operation would be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric would be helpful but initiating retry is not that simple, certainly should be kept separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palashc i have added bunch of logs, do you think we still need metrics?
*/ | ||
public final class CDCCompactionUtil { | ||
|
||
private static final Logger LOGGER = LoggerFactory.getLogger(CDCCompactionUtil.class); | ||
|
||
// Shared cache for row images across all CompactionScanner instances in the JVM. | ||
// Entries expire after 1200 seconds (20 minutes) by default. | ||
// The JVM level cache helps merge the pre-image for the row with multiple CFs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that for multiple CFs all compactionscanner instances will run in the same JVM ? What if the RS aborts between compaction on two CFs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is best case scenario only. Let me add a note here. Generating TTL_DELETE event for multi-CF is not guaranteed to be accurate due to the edge case with server crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than comment in the code, let me clarify this in Jira
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me also evaluate doing batch writes only during CompactionScanner close(). This way, we might be able to guarantee no incorrect values for the CDC event. Let me think about this.
Jira: PHOENIX-7677
Inserting TTL_DELETE CDC event from CompactionScanner for single expired row at a time can cause slight degradation to the write workload to the CDC index table.
To improve the performance of inserting TTL_DELETE events, CompactionScanner can accumulate 50 mutations by default and perform the batchMutate(). Any mutations that are not inserted into the CDC index should be inserted with batchMutate() while closing the CompactionScanner.