Skip to content

TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start #244

@sdheh

Description

@sdheh

Version 1.1.11
Example 1 greedy regex after optional:

String licenseText = "ab cd text";
String licenseTemplate = "<<beginOptional>>cd<<endOptional>> <<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(\Qcd\E\s*)?(.{0,5000})\Qtext\E\s*
start index found: 0

but the start index should be 3.

Example 2 greedy regex at start:

String licenseText = "abtext";
String licenseTemplate = "<<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(.?{0,5000})\Qtext\E\s*
start index found: 1

but the start index should be 2.
.?{0,5000} doesn't seem to work as expected. It is an unusual regex that some online regex websites say is invalid: https://regex101.com/r/l3810b/1, regexr.com/81kfo.
https://www.freeformatter.com/java-regex-tester.html says the regular expression is valid.

I think maybe to fix this you could just offer a method for a regex to find the beginning of the non-optional part. Otherwise a changing the regular expressions in these two cases to something like the following could work

(?im)((\Qcd\E\s*)(.{0,5000})\Qtext\E\s*)|(\Qtext\E\s*)
(?im)\Qtext\E\s*

In the first case if there were multiple optional parts it would get even more complicated to do it correctly.

Activity

goneall

goneall commented on Jun 15, 2024

@goneall
Member

Now that I review this issue, I do tend to agree that this is an issue if the method is being used without following on with the template matcher.

@sdheh - It looks like you have a pretty good handle on approaches to fix this. Can you create a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    matchingLicense matching and recognition

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bact@goneall@sdheh

        Issue actions

          TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start · Issue #244 · spdx/Spdx-Java-Library