feat(extractor): extract default HTML metadata + cache XPath expressions by marevol · Pull Request #164 · codelibs/fess-crawler

marevol · 2026-05-04T22:48:06Z

Summary

Extract standard HTML metadata by default: title, description, OpenGraph (og:title, og:description, og:image, og:type, og:url), Twitter Card, canonical URL, keywords, author.
Parse <script type="application/ld+json"> blocks; expose jsonld.type and jsonld.raw (multivalue). Malformed JSON is skipped with a warn log.
Cache compiled XPathExpression per thread (via ThreadLocal<Map> and ThreadLocal<XPath>) to eliminate per-call recompilation under high crawl rate.
Add setDefaultFieldRules(Map), setExtractDefaultMetadata(boolean), setExtractJsonLd(boolean) for full opt-out / customization. New clearXPathCache() for dynamic rule changes.

Why

Most search use cases want title and description for snippet rendering; users currently have to wire each XPath manually. JSON-LD provides high-quality structured data signals. XPath compilation in a hot loop was wasted work.

Threat model

HTML content is untrusted. JSON-LD parsing uses Jackson with default settings; malformed input is caught and logged, never fails extraction. XPath cache key is the expression string (admin-configured), bounded by configured rules — no unbounded growth from untrusted input.

Tests

12 new tests added (total 18 passing); existing 6 still pass.
Default metadata extraction: title, description, OpenGraph, canonical, keywords, author.
JSON-LD: single block, multiple blocks with array @type, malformed JSON resilience.
XPath cache: same compiled instance reused across calls; clearXPathCache() empties cache.
Opt-out flags actually disable each subsystem.
User-provided rule map overrides defaults.

Verification

mvn -pl fess-crawler test -Dtest=HtmlExtractorTest → 18/18 pass.
mvn -pl fess-crawler test → 1706 run, 0 failures, 55 pre-existing env-dependent errors (Docker/LibreOffice).
mvn formatter:format && mvn license:format clean.

Test plan

CI green
Manual review of XPath cache thread-safety (per-thread cache + ThreadLocal XPath)
Verify no regression on existing fixture tests

Populate ExtractData with standard HTML metadata by default (title, description, OpenGraph, Twitter Card, canonical, keywords, author), parse <script type="application/ld+json"> blocks into jsonld.type and jsonld.raw, and cache compiled XPathExpression objects per thread to eliminate per-call recompilation under high crawl rates. The default-field rule map is fully overridable via setDefaultFieldRules and both subsystems can be disabled independently with setExtractDefaultMetadata / setExtractJsonLd. Malformed JSON-LD blocks are logged and skipped without aborting extraction.

…anup, warn on metadata collisions

@type

Three regressions / gaps were uncovered in the HtmlExtractor PR #164 review: 1. Malformed XPath expressions (in contentXpath or metadataXpathMap) used to log a warning and yield empty values — XPathAPI.eval threw XPathException for both compile and evaluate failures and the catch handled them uniformly. The compile cache split that into a separate getXPathExpression path that throws CrawlerSystemException, which was not caught downstream and therefore propagated out of createExtractData, aborting the whole extraction. Catch CrawlerSystemException in getStringsByXPath (and in extractJsonLd, for symmetry) and restore the warn+empty contract. 2. extractJsonLd unconditionally putValues for jsonld.raw / jsonld.type, silently overwriting any value that an operator-supplied addMetadata("jsonld.raw"/"jsonld.type", ...) rule had already populated. Mirror the precedence rule used by applyDefaultFieldRules: only auto- populate when the key is absent. 3. collectTypeNodes only inspected @type on the immediate object (or array elements). Schema.org markup commonly nests typed entities under @graph, mainEntity, author, publisher, etc.; those @type values were therefore never exposed via jsonld.type. Walk every object child recursively (skipping @type / @context to avoid double-collection and vocabulary leakage). Recursion is bounded by the parser's existing JSONLD_MAX_NESTING_DEPTH guard. Five regression tests added: malformed metadata XPath, malformed contentXpath, custom jsonld metadata key precedence, @graph type collection, and the @context-object negative case.

@type

…s blank, match JSON-LD type case-insensitively Two regressions surfaced in code review of PR #164: 1. extractor.xml in fess-crawler-lasta registers addMetadata("title", "//TITLE"), so the metadataXpathMap loop unconditionally calls putValues("title", []) on pages without a <title>. The default-rule existence check (getValues != null) then sees the empty array and skips the og:title fallback, silently disabling the PR's "extract default HTML metadata" intent in real deployments. Switch the predicate to "has a non-blank value" so default rules backfill when the custom rule produced nothing. 2. JSONLD_XPATH matched only the literal lowercase 'application/ld+json'. Per RFC 6838 / HTML5 the type attribute is case-insensitive and may carry surrounding whitespace; NekoHTML uppercases element names but preserves attribute values verbatim, so 'Application/LD+JSON' or ' application/ld+json ' was missed. Use translate(normalize-space(@type), ...) so common real-world variants are picked up.

marevol added 4 commits May 5, 2026 07:46

fix(extractor): harden HtmlExtractor JSON-LD DoS, add ThreadLocal cle…

c114ac4

…anup, warn on metadata collisions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extractor): extract default HTML metadata + cache XPath expressions#164

feat(extractor): extract default HTML metadata + cache XPath expressions#164
marevol wants to merge 4 commits intomasterfrom
fix/extractor-html-metadata

marevol commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marevol commented May 4, 2026

Summary

Why

Threat model

Tests

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant