Skip to content

Add XPath predicate support to XPathExpressionDiscoverer#127

Merged
mvdbos merged 8 commits intomasterfrom
copilot/allow-square-bracket-notation
Jan 17, 2026
Merged

Add XPath predicate support to XPathExpressionDiscoverer#127
mvdbos merged 8 commits intomasterfrom
copilot/allow-square-bracket-notation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 17, 2026

Implementation Complete ✅

Summary

Successfully implemented support for square-bracket notation (XPath predicates) in XPathExpressionDiscoverer with all code review feedback and static analysis issues addressed.

Changes Made

  • Created SimpleXPathExpressionDiscoverer - Preserves original strict behavior for backward compatibility
  • Enhanced XPathExpressionDiscoverer - Now supports advanced XPath expressions with predicates like //a[starts-with(@href, '/')]
  • Fixed Critical Validation Issue - Updated regex from #//a(\[|/|$)# to #//a(\[[^\]]*\])*$# to properly reject selectors where anchor is not the final element
  • Added Missing Test Coverage - Added tests for invalid patterns like //a[@class='link']//span, //a/text(), and //a/@href
  • Improved Documentation - Clarified that SimpleXPathExpressionDiscoverer supports predicates in the path but not on the anchor element itself
  • Improved Error Messages - Made error messages more consistent between both discoverers
  • Updated Test Comments - Clarified comments to be more accurate about what is and isn't supported
  • Documented Limitations - Added note about nested brackets limitation (extremely rare edge case)
  • Fixed Static Analysis Issues - Removed trailing whitespace to pass phpcs PSR2 checks

Testing Results

  • ✅ Unit Tests: 87/87 passed (26 discoverer tests including 3 new invalid pattern tests)
  • ✅ Code Coverage: 100% for SimpleXPathExpressionDiscoverer and XPathExpressionDiscoverer
  • ✅ Static Analysis: All checks pass (phpcs, phpmd, phan)
  • ✅ Validation: Properly rejects invalid patterns that select non-anchor elements
  • ✅ No regressions introduced

New Validation Behavior

The enhanced regex #//a(\[[^\]]*\])*$# now correctly:

  • ✅ Accepts: //a, //a[starts-with(@href, '/')], //div[@id='content']//a
  • ❌ Rejects: //a[@class='link']//span, //a/text(), //a/@href

Known Limitation

XPath predicates containing nested brackets (e.g., //a[contains(@href, '[test]')]) are not supported due to regex limitations. This is an extremely rare edge case that doesn't affect common use cases.

Original prompt

This section details on the original issue you should resolve

<issue_title>Allow square-bracket notation after anchor selector</issue_title>
<issue_description>Currently the XPathExpressionDiscoverer allows selectors ending with /a only. This means being more specific by using the square-bracket notation is not supported. However, this would make the spider so much more powerful and we wouldn't even have to change a lot.

An example of the square-bracket notation could be the following.

//a[starts-with(@href, '/') or starts-with(@href, '$url')]

To allow this, spider could either be less strict about the selector argument (maybe replacing endsWith with a regular expression) or move the validation of the selector argument away from the constructor (to a protected function maybe), such that when extending the XPathExpressionDiscoverer you could override such validation method and have your own selector validation.</issue_description>

Comments on the Issue (you are @copilot in this section)

@mvdbos First of all, apologies for the late reaction. My day job has been busy.

The power of the spider is that it can use any Discoverer, and you can implement custom ones too. The examples show how to set the one you want to use.

Your suggestion sounds like a great idea. Would you be open to create an implementation for the improved XPathExpressionDiscoverer? I would be happy to adopt it as the new one. We could rename the current one to SimpleXPathExpressionDiscoverer. example_simple.php could then keep using the simple discoverer and example_complex could use the new one with bracket notation.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 4 commits January 17, 2026 14:27
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
…erer with bracket notation support

Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
…standard annotations

Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Copilot AI changed the title [WIP] Allow square-bracket notation after anchor selector Add XPath predicate support to XPathExpressionDiscoverer Jan 17, 2026
Copilot AI requested a review from mvdbos January 17, 2026 14:41
@mvdbos mvdbos marked this pull request as ready for review January 17, 2026 14:44
Copilot AI review requested due to automatic review settings January 17, 2026 14:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds XPath predicate support (square-bracket notation) to the XPathExpressionDiscoverer by introducing a regex-based validation approach. To maintain backward compatibility, the original strict validation is preserved in a new SimpleXPathExpressionDiscoverer class.

Changes:

  • Created SimpleXPathExpressionDiscoverer with original strict /a ending validation
  • Enhanced XPathExpressionDiscoverer to accept predicates using regex pattern //a(\[|/|$)
  • Added comprehensive test coverage for both discoverers including predicate, path, and error cases

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/Discoverer/SimpleXPathExpressionDiscoverer.php New class preserving original strict validation requiring selectors to end with /a
src/Discoverer/XPathExpressionDiscoverer.php Enhanced with regex validation to support XPath predicates and complex expressions
tests/Discoverer/SimpleXPathExpressionDiscovererTest.php Comprehensive test suite for simple discoverer including rejection of bracket notation
tests/Discoverer/XpathExpressionDiscovererTest.php Extended tests covering predicates, nested paths, and multiple predicates
example/example_simple.php Updated to use SimpleXPathExpressionDiscoverer for backward compatibility
example/example_complex.php Updated to demonstrate predicate usage with XPathExpressionDiscoverer


$discoverer = new XPathExpressionDiscoverer("//div[@id='content']");
$this->executeDiscoverer($discoverer);
}
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for XPath expressions where anchors are not the final selected element. The validation should reject expressions like //a[@class='link']//span or //a/text() that would select non-anchor elements or attributes. Add test cases to verify these invalid patterns are properly rejected, as they would fail at runtime when the discoverer attempts to extract href attributes from elements that don't have them.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Added test cases for testDiscoverWithAnchorFollowedByDescendant, testDiscoverWithAnchorText, and testDiscoverWithAnchorAttribute to verify invalid patterns are properly rejected.

$this->expectException(InvalidArgumentException::class);
$this->expectExceptionMessage("Please end your selector with '/a'");

// SimpleXPathExpressionDiscoverer does not support bracket notation
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "SimpleXPathExpressionDiscoverer does not support bracket notation" is misleading. The test on line 48 shows that SimpleXPathExpressionDiscoverer DOES accept bracket notation in the path before the anchor (e.g., //div[@id='content']//a). It only rejects bracket notation on the anchor element itself (e.g., //a[...]). Consider clarifying this comment to state "SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements" for accuracy.

Suggested change
// SimpleXPathExpressionDiscoverer does not support bracket notation
// SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated comment to: "SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements".

Comment on lines +11 to +14
*
* For more advanced XPath expressions with predicates (square-bracket notation),
* use XPathExpressionDiscoverer instead.
*
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "For more advanced XPath expressions with predicates (square-bracket notation), use XPathExpressionDiscoverer instead" but this is imprecise. SimpleXPathExpressionDiscoverer does support predicates in the XPath path before the anchor (e.g., //div[@id='content']//a is valid). It only restricts predicates directly on the anchor element itself. Consider clarifying: "For XPath expressions with predicates on anchor elements (square-bracket notation on the anchor), use XPathExpressionDiscoverer instead."

Suggested change
*
* For more advanced XPath expressions with predicates (square-bracket notation),
* use XPathExpressionDiscoverer instead.
*
*
* For XPath expressions with predicates on anchor elements (square-bracket
* notation on the anchor), use XPathExpressionDiscoverer instead.
*

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated documentation to clarify: "For XPath expressions with predicates on anchor elements (square-bracket notation on the anchor), use XPathExpressionDiscoverer instead."

* - //div[@id='content']//a
* - //a[@class='internal']
*
* For simple selectors that just end with '/a', you may also use SimpleXPathExpressionDiscoverer.
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation suggests SimpleXPathExpressionDiscoverer is for "simple selectors that just end with '/a'", but SimpleXPathExpressionDiscoverer actually supports complex paths like //div[@id='content']//a as long as the anchor has no predicates. The distinction is not about path complexity but about whether the anchor element itself has predicates. Consider clarifying: "For selectors where the anchor element has no predicates, you may also use SimpleXPathExpressionDiscoverer."

Suggested change
* For simple selectors that just end with '/a', you may also use SimpleXPathExpressionDiscoverer.
* For selectors where the anchor element has no predicates (for example, //div[@id='content']//a), you may also use SimpleXPathExpressionDiscoverer.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated documentation to clarify: "For selectors where the anchor element has no predicates (for example, //div[@id='content']//a), you may also use SimpleXPathExpressionDiscoverer."

return substr($haystack, -$length) === $needle;
// Match patterns that indicate the selector targets anchor elements
// Ensures //a is present and followed by [, /, or end of string
return preg_match('#//a(\[|/|$)#', $selector) === 1;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation regex #//a(\[|/|$)# accepts XPath expressions where anchor elements appear anywhere in the path, not necessarily as the final selected element. For example, //a[@class='link']//span would pass validation but would select span elements (not anchors), which lack href attributes that the discoverer tries to extract. The regex should ensure that the anchor element is the final element in the XPath, such as by using #//a(\[|$)# or ensuring no path separator follows after the anchor predicates.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated regex from #//a(\[|/|$)# to #//a(\[[^\]]*\])*$# to ensure anchor is the final element. Now properly rejects patterns like //a[@class='link']//span.

Comment on lines +32 to +34
throw new InvalidArgumentException("Please end your selector with '/a': " .
"selectors should look for `a` elements " .
"so that the Discoverer can extract their `href` attribute for further crawling.");
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message inconsistency could confuse users. The new XPathExpressionDiscoverer uses "must target anchor ('a') elements" while SimpleXPathExpressionDiscoverer says "Please end your selector with '/a'". Consider making the SimpleXPathExpressionDiscoverer error message also mention that it "must target anchor ('a') elements" for consistency, while still explaining the strict '/a' ending requirement.

Suggested change
throw new InvalidArgumentException("Please end your selector with '/a': " .
"selectors should look for `a` elements " .
"so that the Discoverer can extract their `href` attribute for further crawling.");
throw new InvalidArgumentException(
"SimpleXPathExpressionDiscoverer selectors must target anchor ('a') elements and " .
"must end with '/a' so that the Discoverer can extract their `href` attribute for further crawling."
);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated error message to be consistent: "SimpleXPathExpressionDiscoverer selectors must target anchor ('a') elements and must end with '/a' so that the Discoverer can extract their href attribute for further crawling."

Comment on lines +54 to +61
/**
* Validates that the selector targets anchor elements.
*
* Accepts selectors that:
* - End with '//a' (simple case)
* - End with '//a[...]' (with predicates)
* - Contain '//a[' (anchor with predicates anywhere in the path)
* - Contain '//a/' (anchor followed by more path)
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation claims the validation "Accepts selectors that... Contain '//a/' (anchor followed by more path)" which would incorrectly allow selectors like //a[@id='nav']//span that target descendant elements of anchors rather than the anchors themselves. This contradicts the stated purpose that selectors "should look for a elements so that the Discoverer can extract their href attribute." The documentation should clarify that only selectors where anchors are the final selected elements are valid.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 08ffeab. Updated validation and documentation to clarify that only selectors where anchors are the final selected elements are valid. The new regex #//a(\[[^\]]*\])*$# ensures this.

@mvdbos
Copy link
Copy Markdown
Owner

mvdbos commented Jan 17, 2026

@copilot please fix the build issues

…eedback

Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot please fix the build issues

Build issues fixed in commit 08ffeab. All 87 tests now pass. The main fix was improving the validation regex to #//a(\[[^\]]*\])*$# to properly reject invalid XPath patterns, and adding comprehensive test coverage for edge cases.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • bar
    • Triggering command: /usr/bin/php php ./vendor/bin/phpunit (dns block)
  • foobar.org
    • Triggering command: /usr/bin/php php ./vendor/bin/phpunit (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
@mvdbos
Copy link
Copy Markdown
Owner

mvdbos commented Jan 17, 2026

@copilot fix static analysis errors

Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot fix static analysis errors

Fixed in commit 4be1720. Removed trailing whitespace from docblocks to pass phpcs PSR2 checks. All static analysis checks now pass (phpcs, phpmd, phan).

@mvdbos mvdbos merged commit 593276f into master Jan 17, 2026
7 checks passed
@mvdbos mvdbos deleted the copilot/allow-square-bracket-notation branch January 17, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow square-bracket notation after anchor selector

3 participants