Skip to content

optimized filtering #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

nvdk
Copy link
Member

@nvdk nvdk commented Jan 23, 2025

builds on #20, but has some optimizations wrt to filtering:

  • by grouping services on match pattern: this means we only do detection and filtering once for a given match pattern (takes the pattern and options.sendMatchesOnly into account as a grouping key)
  • by doing pattern matching before origin filter, since the latter seems more costly

edit: can be tested using nvdk/mu-delta-notifier:feature-optimized-filtering. see https://github.com/lblod/app-lblod-harvester/pull/79/files for an example

this greatly reduces the size of delta messages in stacks with large data changes.
@nvdk nvdk force-pushed the feature/optimized-filtering branch 2 times, most recently from 65ce8d9 to 1854db3 Compare January 23, 2025 15:16
Niels V added 2 commits January 23, 2025 16:19
this tries to optimize filtering and match detection in two ways:
- by grouping on match pattern: this means we only do detection and filtering
once for a given match pattern (takes the pattern and sendMatchesOnly into
account)
- by doing matching before origin filter, since the latter seems more costly
@nvdk nvdk force-pushed the feature/optimized-filtering branch from 1854db3 to 34004ff Compare January 23, 2025 15:33
Niels V added 3 commits January 23, 2025 17:00
this is not the prettiest of solutions, but it does achieve the same thing.
bonus is it only happens once on startup
it's mentioned in the expressjs best practices (and elsewhere) that process.env should be used sparingly:

| If you need to write environment-specific code, you can check the value of NODE_ENV with process.env.NODE_ENV. Be aware that checking the value of any environment variable incurs a performance penalty, and so should be done sparingly.
https://expressjs.com/th/advanced/best-practice-performance.html#set-node_env-to-production

it seems this is because process.env is a getter that processes the ENV array each time
}
} else {
if (process.env["DEBUG_DELTA_SEND"] || process.env["DEBUG_DELTA_NOT_SENDING_EMPTY"])
if (DEBUG_DELTA_SEND || DEBUG_DELTA_NOT_SENDING_EMPTY)
console.log(`Changeset empty. Not sending to ${entry.callback.method} ${entry.callback.url}`);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this, but I don't think this situation can occur.

Copy link
Contributor

@x-m-el x-m-el left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some optimizations are possible. I added comments in adjusted code below, because the suggestions are interlinked with each other, so separate comments could become confusing:

async function informWatchers( changeSets, res, muCallIdTrail, muSessionId ){
  /* This code used to be inside the full loop, but this is not needed. 
   * If the entry has `sendMatchesOnly` set to true, using changedTriples can be avoided (see below).
   * If `sendMatchesOnly` is set to false, the code using `changedTriples` will always need all triples
   * (because there is no extra "filtering").
   * so there is no need to recalculate it every time.
   */
  let allInserts = [];
  let allDeletes = [];
  changeSets.forEach( (change) => {
    allInserts = [...allInserts, ...change.insert];
    allDeletes = [...allDeletes, ...change.delete];
  } );
  const changedTriples = [...allInserts, ...allDeletes];

  // Iterate over each unique match pattern
  for (const matchKey in groupedServices) {
    const firstEntry = groupedServices[matchKey][0];
    // can use first entry since it's part of grouping
    const sendMatchesOnly = firstEntry.options.sendMatchesOnly;
    let maybePatternFilteredChangesets = changeSets;
    if (sendMatchesOnly) {
      maybePatternFilteredChangesets = filterChangesetsOnPattern(changeSets, firstEntry);
      /* Currently, fully empty change sets are send too, but I don't think 
       * this is desired if `sendMatchesOnly` is set to true?
       * Note: can also do this filtering in `filterChangesetsOnPattern` directly
       * (by having an if-case or similar around `filteredChangesets.push(clonedChangeSet);`)
       */
      maybePatternFilteredChangesets = maybePatternFilteredChangesets.filter(
         (change) =>
           change.insert.length > 0 ||
           change.delete.length > 0 ||  change.effectiveInsert.length > 0 ||
           change.effectiveDelete.length > 0
       );
    }

    /* `filterChangesetsOnPattern` already does all the filtering for matches,
     * so if `sendMatchesOnly` is true, we can just check if there are still
     * non-empty change sets left to send. If all are empty (or none are left,
     * in case of filtering them away as stated above), there are no matches with the spec.
     */
    const someTripleMatchedSpec = sendMatchesOnly
      ? maybePatternFilteredChangesets.length > 0 /* makes the assumption that maybePatternFilteredChangesets has no empty change sets. */
      : changedTriples.some((triple) =>
          tripleMatchesSpec(triple, firstEntry.match)
        );
  
    /* The if can be taken out of the forEach loop, to avoid the loop all together
     * if no triples will be send out. Note that `DEBUG_TRIPLE_MATCHES_SPEC` should
     * be moved before the `if` case then. However, this does change the debug output:
     * before it would print a line per match (true/false), now it will do this
     * per service group. The difference is negligible, so this change could be skipped.
     */ 
  
    if( someTripleMatchedSpec ) {
    const matchingServices = groupedServices[matchKey];
      matchingServices.forEach( async (entry) => {
   /// rest of the code
   });
   }

// for each entity
if( DEBUG_DELTA_MATCH )
console.log(`Checking if we want to send to ${entry.callback.url}`);
const matchSpec = entry.match;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused code

Suggested change
const matchSpec = entry.match;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants