Check for cross submits #68

silentninja · 2025-02-21T14:40:40Z

Fixes #32

Skips unverified sitemap links by checking the URLs of the robots.txt of the target link

Signed-off-by: Julien Nioche <[email protected]>

…wl#65 Signed-off-by: Julien Nioche <[email protected]>

sebastian-nagel

Hi @silentninja, thanks for the contribution! The code looks good and the accompanying unit test is very appreciated.

I see that the PR is still marked as "draft". But one comment ahead which might require a deeper change:

How are sitemap indexes handled? For example in the following constellation:

robots.txt -> sitemap-index.xml -> sitemap-news.xml.gz
                                -> sitemap-video.xml.gz
                                -> sitemap-books.xml.gz

Then looking into the robots.txt alone is not enough. A recursive lookup into a sitemap to check whether it's an index seems too expensive, esp. because the robots.txt is likely cached while sitemaps are definitely not.

What about keeping traces in the status index for any sitemap from which robots.txt it was detected? A similar feature is already available in StormCrawler to trace the seed origin of URLs, see metadata.track.path in MetadataTransfer. Of course, it would be sufficient to track the original robots.txt host name(s). Note: it's not uncommon that a sitemap is referenced from multiple hosts. This way it would not even necessary to fetch any robots.txt in case it is not found in the cache.

One real example of such a "(news) sitemap detection chain":

https://www.anews.com.tr/robots.txt
  -> https://www.anews.com.tr/sitemap/index.xml
     -> https://www.anews.com.tr/sitemap/news.xml

silentninja · 2025-02-26T14:16:01Z

Thanks for the review @sebastian-nagel.

What about keeping traces in the status index for any sitemap from which robots.txt it was detected. A similar feature is already available in StormCrawler to trace the seed origin of URLs, see metadata.track.path in MetadataTransfer

Why do we need to keep traces in the status index if metadata.track.path already contains the trace?

How are sitemap indexes handled

Sitemap indexes are tricky especially for sitemap with incomplete trace. For example, if the seed is the https://www.anews.com.tr/sitemap/news.xml in the anews example, we would still have to recursively fetch the sitemap index to find out the lineage in case the domains are different.

silentninja · 2025-02-26T14:21:29Z

@sebastian-nagel I made some commits to check the path using MetadataTransfer like you suggested. Can you see if that makes sense?

silentninja · 2025-03-04T16:28:34Z

cross-submits within the pay-level domain are definitely not safe for large hosting domains (blogspot.com, github.io, etc.) and would allow to inject spam links

@sebastian-nagel I addressed most of your concerns except for filtering out the large hosting domains. The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out. I will create a issue to track it

sebastian-nagel · 2025-03-04T18:01:09Z

The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out.

This is already implemented in crawler-commons' EffectiveTldFinder. It's already a dependency, but we should upgrade it.

…ttp.conn.util.PublicSuffixMatcher to get the hostnames when checking for cross submit

silentninja · 2025-03-05T13:08:18Z

The Apache Http Client which we use to get the root domain does not differentiate these private domains. We need to create a separate list from the Suffix list for these large hosting domains and filter it out.

This is already implemented in crawler-commons' EffectiveTldFinder. It's already a dependency, but we should upgrade it.

Neat! I made changes to the PR based on your suggestion. Thanks for the suggestion!

sebastian-nagel

Very sorry, @silentninja, for the overlong delay... I'd totally understand if you now could not continue working on this PR. In this case, I'd be happy to take this on my shoulders. Let me know... In any case, thank you very much for your work!

sebastian-nagel · 2025-05-21T17:46:44Z

src/main/java/org/commoncrawl/stormcrawler/news/NewsSiteMapParserBolt.java

+        }
+
+        // Cross-host checks
+        Metadata targetMetadata = metadataTransfer.getMetaForOutlink(targetURL.toString(), sitemapURL.toString(), metadata);


My hint about the MetadataTransfer wasn't about using it for link checking, but to forward the robots.txt URL to the record of a sitemap URL.

This can be done by setting the configuration property metadata.track.path to true. The MetadataTransfer is called on all outlinks and also redirects from the StatusEmitterBolt.

Two caveats:

not the robots.txt URL is tracked but the URL which was going to be fetched and triggered the robots.txt to be fetched and checked. But this should be equivalent.

in case, there is a chain of redirects and/or a sitemap index record between robots.txt and the news sitemap, all intermediate "hops" are recorded. Then the first element of Metadata.urlPathKeyName is the original URL. Note: better use the string constant, not the literal "url.path".

sebastian-nagel · 2025-05-21T17:50:04Z

src/main/java/org/commoncrawl/stormcrawler/news/NewsSiteMapParserBolt.java

    }

+    public String getHost(URI url) {
+        if (this.crossSubmitLenient) {


👍 Nice. That's a good, lean solution.

sebastian-nagel · 2025-05-21T18:00:29Z

src/main/java/org/commoncrawl/stormcrawler/news/NewsSiteMapParserBolt.java

+                    ol.getMetadata().setValue(Constants.STATUS_ERROR_MESSAGE, errorMessage);
+                    Values v = new Values(ol.getTargetURL(), ol.getMetadata(),
+                        Status.ERROR);
+                    collector.emit(StatusStreamName, tuple, v);


Good question, whether URLs failing the cross-submit check should be emitted at all:

it unnecessary fills up the status index (it's about URLs which are not considered to be fetched)

the error status is not necessarily a permanent one: depending on the configuration, a error page is retried after some time. Then the cross-submit check would be circumvented after a time delay.

sebastian-nagel · 2025-05-21T18:01:20Z

src/test/java/org/commoncrawl/stormcrawler/news/NewsSiteMapParserTest.java

+	 * URLs from example.net fail since their robots reference a different sitemap index.
+	 */
+	@Test
+    public void test_cross_host_submission_sitemaps() throws IOException, UnknownFormatException, URISyntaxException {


Not a canonical Java method name.

sebastian-nagel · 2025-05-21T18:01:42Z

src/test/java/org/commoncrawl/stormcrawler/news/NewsSiteMapParserTest.java

                 SitemapType.SITEMAP, type);
     }
-
+	@Test


👍 Good to have a unit test!

silentninja · 2025-05-28T10:26:13Z

Very sorry, @silentninja, for the overlong delay... I'd totally understand if you now could not continue working on this PR. In this case, I'd be happy to take this on my shoulders. Let me know... In any case, thank you very much for your work!

Thanks for reviewing the PR! I'm definitely still interested in working on it. I've been tied up with moving to a new house, but I’ll be able to pick this back up next week.

jnioche and others added 3 commits December 13, 2023 10:56

Have as many WARCBolt instances as there are workers, fix commoncrawl#64

63dafc1

Signed-off-by: Julien Nioche <[email protected]>

Route tuples to the status updater bolt based on URLs,fixes commoncra…

efd0d24

…wl#65 Signed-off-by: Julien Nioche <[email protected]>

Add cross submit verification for sitemaps

4c854e0

silentninja mentioned this pull request Feb 21, 2025

Check cross-submits for sitemaps #32

Open

sebastian-nagel reviewed Feb 21, 2025

View reviewed changes

Add cross submit verification check for sitemap indexes

0a58e3b

silentninja added 3 commits March 3, 2025 15:12

Use URI instead of URL

8cf52f5

Revert accidental changes

29e5f3d

Allow lenient cross submits as config

3a2bd46

silentninja marked this pull request as ready for review March 4, 2025 16:20

use crawlercommons.domains.EffectiveTldFinder instead of org.apache.h…

72a74c3

…ttp.conn.util.PublicSuffixMatcher to get the hostnames when checking for cross submit

sebastian-nagel requested changes May 21, 2025

View reviewed changes

Merge branch 'commoncrawl:master' into feat/cross-submit

7642a87

Check for cross submits #68

Are you sure you want to change the base?

Check for cross submits #68

Uh oh!

Conversation

silentninja commented Feb 21, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

silentninja commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silentninja commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silentninja commented Mar 4, 2025

Uh oh!

sebastian-nagel commented Mar 4, 2025

Uh oh!

silentninja commented Mar 5, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 21, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 21, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 21, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 21, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel May 21, 2025

Choose a reason for hiding this comment

Uh oh!

silentninja commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

silentninja commented Feb 26, 2025 •

edited

Loading

silentninja commented Feb 26, 2025 •

edited

Loading