Skip to content

Official Apache-1.1 license text is not being matched correctly by LicenseCompareHelper.matchingStandardLicenseIdsWithinText() #230

@pmonks

Description

@pmonks
Collaborator

When org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() is run on the official Apache-1.1 license text, it fails to find any matches, and I believe I've narrowed down the problem to the Clause5 alternative text tag in the template; if I remove the example header from the license text, and run org.spdx.utility.compare.LicenseCompareHelper.isTextStandardLicense().getDifferenceMessage() on it, I get:

Variable text rule combined-bullet-Clause5 did not match the compare text starting at line #31 column #1 "5" while processing rule var: combined-bullet-Clause5

When I manually converted that <alt> tag into a Java regex, and bullet 5 from the Apache 1.1 license text is manually cleansed of comment characters and newlines, I do get a match, so I'm pretty confident the problem is in the library rather than the template. Beyond that I'm not really sure what the root cause might be - whether it has to do with comment character handling, regexification of that particular <alt> tag, or something else entirely.

This was reproduced with Spdx-Java-Library v1.11 and SPDX license list v3.23.

Activity

pmonks

pmonks commented on Mar 14, 2024

@pmonks
CollaboratorAuthor

It it's helpful, I'm also seeing similar failures with the official Apache-1.0 license text too, though I haven't troubleshooted that to the same level of detail is I did with Apache-1.1.

goneall

goneall commented on Feb 26, 2025

@goneall
Member

This issue is due to line breaks in the original text not being taken into account in the regex in the license XML variable text tag.

Here is the regex:

(.{0,20})\s*((Products derived from this software may not be called\s.+nor may\s.+appear in their name, without prior written permission of\s.+.|Products may not include\s.+in their name, without prior written permission of\s.+.))

And the text it is attempting to match against (with the leading asterisks removed):

5. Products derived from this software may not be called "Apache",
    nor may "Apache" appear in their name, without prior written
    permission of the Apache Software Foundation.

If the line breaks are removed:

5. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior written permission of the Apache Software Foundation.

the license text matches.

This can be fixed in the template by replacing the spaces with [\r\n\s]+. e.g.:

(Products[\r\n\s]+derived[\r\n\s]+from[\r\n\s]+this[\r\n\s]+software[\r\n\s]+may[\r\n\s]+not[\r\n\s]+be[\r\n\s]+called[\r\n\s]+.+nor may[\r\n\s]+.+appear[\r\n\s]+in[\r\n\s]+their[\r\n\s]+name,[\r\n\s]+without[\r\n\s]+prior[\r\n\s]+written[\r\n\s]+permission[\r\n\s]+of[\r\n\s]+.+.|Products[\r\n\s]+may[\r\n\s]+not[\r\n\s]+include[\r\n\s]+.+in[\r\n\s]+their[\r\n\s]+name,[\r\n\s]+without[\r\n\s]+prior[\r\n\s]+written[\r\n\s]+permission[\r\n\s]+of[\r\n\s]+.+.)
pmonks

pmonks commented on Feb 27, 2025

@pmonks
CollaboratorAuthor

\s should match \r and \n, at least according to the JavaDocs, which implies the character class [\s\r\n] is redundant.

[edit] unless Unix newlines mode is enabled, it seems like.

goneall

goneall commented on Feb 28, 2025

@goneall
Member

\s should match \r and \n, at least according to the JavaDocs, which implies the character class [\s\r\n] is redundant.

[edit] unless Unix newlines mode is enabled, it seems like.

Good point - I just tested using \s+ and it worked.

It looks like this can be fixed with a change to the license list XML source. I'll add a PR.

added a commit that references this issue on Feb 28, 2025
pmonks

pmonks commented on Feb 28, 2025

@pmonks
CollaboratorAuthor

Or might this be covered by the Whitespace section of the matching guidelines? It reads:

Purpose

To avoid the possibility of a non-match due to different spacing of words, line breaks, or paragraphs.

Guideline

All whitespace should be treated as a single blank space.

XML files do not require specific markup to implement this guideline.

goneall

goneall commented on Feb 28, 2025

@goneall
Member

See the discussion in spdx/license-list-XML#2669

I think we have to be careful where we draw the line on overwriting the regexes to implement the matching guidelines - it could get very complicated if we include all the matching guidelines. Overwriting the whitespace may make practical senses, however.

pmonks

pmonks commented on Feb 28, 2025

@pmonks
CollaboratorAuthor

Yeah I asked about the potential need for precedence rules (both within the templates and across the matching guidelines) in spdx/license-list-XML#1617, but nobody responded at that time. FWIW from my initial attempts at implementing matching (before I switched to Spdx-Java-Library) I was already starting to run into this kind of problem - how to interpret a given template and its regex fragments, while also ensuring that the general matching guidelines were being correctly implemented.

added a commit that references this issue on Mar 6, 2025
goneall

goneall commented on Mar 6, 2025

@goneall
Member
reopened this on Apr 30, 2025
pmonks

pmonks commented on Apr 30, 2025

@pmonks
CollaboratorAuthor

I just retested this with version 2.0.0 of Spdx-Java-Library, and it seems to still be happening. Interestingly, org.spdx.utility.compare.LicenseCompareHelper.isTextStandardLicense() is working correctly, but org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() is not. This is occurring with both the Apache-1.0 and Apache-1.1 official license texts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    matchingLicense matching and recognition

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @pmonks@bact@goneall

        Issue actions

          Official Apache-1.1 license text is not being matched correctly by LicenseCompareHelper.matchingStandardLicenseIdsWithinText() · Issue #230 · spdx/Spdx-Java-Library