Add XPath predicate support to XPathExpressionDiscoverer#127
Conversation
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
…erer with bracket notation support Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
…standard annotations Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds XPath predicate support (square-bracket notation) to the XPathExpressionDiscoverer by introducing a regex-based validation approach. To maintain backward compatibility, the original strict validation is preserved in a new SimpleXPathExpressionDiscoverer class.
Changes:
- Created
SimpleXPathExpressionDiscovererwith original strict/aending validation - Enhanced
XPathExpressionDiscovererto accept predicates using regex pattern//a(\[|/|$) - Added comprehensive test coverage for both discoverers including predicate, path, and error cases
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Discoverer/SimpleXPathExpressionDiscoverer.php | New class preserving original strict validation requiring selectors to end with /a |
| src/Discoverer/XPathExpressionDiscoverer.php | Enhanced with regex validation to support XPath predicates and complex expressions |
| tests/Discoverer/SimpleXPathExpressionDiscovererTest.php | Comprehensive test suite for simple discoverer including rejection of bracket notation |
| tests/Discoverer/XpathExpressionDiscovererTest.php | Extended tests covering predicates, nested paths, and multiple predicates |
| example/example_simple.php | Updated to use SimpleXPathExpressionDiscoverer for backward compatibility |
| example/example_complex.php | Updated to demonstrate predicate usage with XPathExpressionDiscoverer |
|
|
||
| $discoverer = new XPathExpressionDiscoverer("//div[@id='content']"); | ||
| $this->executeDiscoverer($discoverer); | ||
| } |
There was a problem hiding this comment.
Missing test coverage for XPath expressions where anchors are not the final selected element. The validation should reject expressions like //a[@class='link']//span or //a/text() that would select non-anchor elements or attributes. Add test cases to verify these invalid patterns are properly rejected, as they would fail at runtime when the discoverer attempts to extract href attributes from elements that don't have them.
There was a problem hiding this comment.
Fixed in commit 08ffeab. Added test cases for testDiscoverWithAnchorFollowedByDescendant, testDiscoverWithAnchorText, and testDiscoverWithAnchorAttribute to verify invalid patterns are properly rejected.
| $this->expectException(InvalidArgumentException::class); | ||
| $this->expectExceptionMessage("Please end your selector with '/a'"); | ||
|
|
||
| // SimpleXPathExpressionDiscoverer does not support bracket notation |
There was a problem hiding this comment.
The comment "SimpleXPathExpressionDiscoverer does not support bracket notation" is misleading. The test on line 48 shows that SimpleXPathExpressionDiscoverer DOES accept bracket notation in the path before the anchor (e.g., //div[@id='content']//a). It only rejects bracket notation on the anchor element itself (e.g., //a[...]). Consider clarifying this comment to state "SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements" for accuracy.
| // SimpleXPathExpressionDiscoverer does not support bracket notation | |
| // SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements |
There was a problem hiding this comment.
Fixed in commit 08ffeab. Updated comment to: "SimpleXPathExpressionDiscoverer does not support bracket notation on anchor elements".
| * | ||
| * For more advanced XPath expressions with predicates (square-bracket notation), | ||
| * use XPathExpressionDiscoverer instead. | ||
| * |
There was a problem hiding this comment.
The documentation states "For more advanced XPath expressions with predicates (square-bracket notation), use XPathExpressionDiscoverer instead" but this is imprecise. SimpleXPathExpressionDiscoverer does support predicates in the XPath path before the anchor (e.g., //div[@id='content']//a is valid). It only restricts predicates directly on the anchor element itself. Consider clarifying: "For XPath expressions with predicates on anchor elements (square-bracket notation on the anchor), use XPathExpressionDiscoverer instead."
| * | |
| * For more advanced XPath expressions with predicates (square-bracket notation), | |
| * use XPathExpressionDiscoverer instead. | |
| * | |
| * | |
| * For XPath expressions with predicates on anchor elements (square-bracket | |
| * notation on the anchor), use XPathExpressionDiscoverer instead. | |
| * |
There was a problem hiding this comment.
Fixed in commit 08ffeab. Updated documentation to clarify: "For XPath expressions with predicates on anchor elements (square-bracket notation on the anchor), use XPathExpressionDiscoverer instead."
| * - //div[@id='content']//a | ||
| * - //a[@class='internal'] | ||
| * | ||
| * For simple selectors that just end with '/a', you may also use SimpleXPathExpressionDiscoverer. |
There was a problem hiding this comment.
The documentation suggests SimpleXPathExpressionDiscoverer is for "simple selectors that just end with '/a'", but SimpleXPathExpressionDiscoverer actually supports complex paths like //div[@id='content']//a as long as the anchor has no predicates. The distinction is not about path complexity but about whether the anchor element itself has predicates. Consider clarifying: "For selectors where the anchor element has no predicates, you may also use SimpleXPathExpressionDiscoverer."
| * For simple selectors that just end with '/a', you may also use SimpleXPathExpressionDiscoverer. | |
| * For selectors where the anchor element has no predicates (for example, //div[@id='content']//a), you may also use SimpleXPathExpressionDiscoverer. |
| return substr($haystack, -$length) === $needle; | ||
| // Match patterns that indicate the selector targets anchor elements | ||
| // Ensures //a is present and followed by [, /, or end of string | ||
| return preg_match('#//a(\[|/|$)#', $selector) === 1; |
There was a problem hiding this comment.
The validation regex #//a(\[|/|$)# accepts XPath expressions where anchor elements appear anywhere in the path, not necessarily as the final selected element. For example, //a[@class='link']//span would pass validation but would select span elements (not anchors), which lack href attributes that the discoverer tries to extract. The regex should ensure that the anchor element is the final element in the XPath, such as by using #//a(\[|$)# or ensuring no path separator follows after the anchor predicates.
There was a problem hiding this comment.
Fixed in commit 08ffeab. Updated regex from #//a(\[|/|$)# to #//a(\[[^\]]*\])*$# to ensure anchor is the final element. Now properly rejects patterns like //a[@class='link']//span.
| throw new InvalidArgumentException("Please end your selector with '/a': " . | ||
| "selectors should look for `a` elements " . | ||
| "so that the Discoverer can extract their `href` attribute for further crawling."); |
There was a problem hiding this comment.
The error message inconsistency could confuse users. The new XPathExpressionDiscoverer uses "must target anchor ('a') elements" while SimpleXPathExpressionDiscoverer says "Please end your selector with '/a'". Consider making the SimpleXPathExpressionDiscoverer error message also mention that it "must target anchor ('a') elements" for consistency, while still explaining the strict '/a' ending requirement.
| throw new InvalidArgumentException("Please end your selector with '/a': " . | |
| "selectors should look for `a` elements " . | |
| "so that the Discoverer can extract their `href` attribute for further crawling."); | |
| throw new InvalidArgumentException( | |
| "SimpleXPathExpressionDiscoverer selectors must target anchor ('a') elements and " . | |
| "must end with '/a' so that the Discoverer can extract their `href` attribute for further crawling." | |
| ); |
There was a problem hiding this comment.
Fixed in commit 08ffeab. Updated error message to be consistent: "SimpleXPathExpressionDiscoverer selectors must target anchor ('a') elements and must end with '/a' so that the Discoverer can extract their href attribute for further crawling."
| /** | ||
| * Validates that the selector targets anchor elements. | ||
| * | ||
| * Accepts selectors that: | ||
| * - End with '//a' (simple case) | ||
| * - End with '//a[...]' (with predicates) | ||
| * - Contain '//a[' (anchor with predicates anywhere in the path) | ||
| * - Contain '//a/' (anchor followed by more path) |
There was a problem hiding this comment.
The documentation claims the validation "Accepts selectors that... Contain '//a/' (anchor followed by more path)" which would incorrectly allow selectors like //a[@id='nav']//span that target descendant elements of anchors rather than the anchors themselves. This contradicts the stated purpose that selectors "should look for a elements so that the Discoverer can extract their href attribute." The documentation should clarify that only selectors where anchors are the final selected elements are valid.
There was a problem hiding this comment.
Fixed in commit 08ffeab. Updated validation and documentation to clarify that only selectors where anchors are the final selected elements are valid. The new regex #//a(\[[^\]]*\])*$# ensures this.
|
@copilot please fix the build issues |
…eedback Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Build issues fixed in commit 08ffeab. All 87 tests now pass. The main fix was improving the validation regex to Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
|
@copilot fix static analysis errors |
Co-authored-by: mvdbos <1101757+mvdbos@users.noreply.github.com>
Implementation Complete ✅
Summary
Successfully implemented support for square-bracket notation (XPath predicates) in XPathExpressionDiscoverer with all code review feedback and static analysis issues addressed.
Changes Made
//a[starts-with(@href, '/')]#//a(\[|/|$)#to#//a(\[[^\]]*\])*$#to properly reject selectors where anchor is not the final element//a[@class='link']//span,//a/text(), and//a/@hrefTesting Results
New Validation Behavior
The enhanced regex
#//a(\[[^\]]*\])*$#now correctly://a,//a[starts-with(@href, '/')],//div[@id='content']//a//a[@class='link']//span,//a/text(),//a/@hrefKnown Limitation
XPath predicates containing nested brackets (e.g.,
//a[contains(@href, '[test]')]) are not supported due to regex limitations. This is an extremely rare edge case that doesn't affect common use cases.Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.