Improve extraction of host names and registered domains

- no host name is extracted in the following situations
  - URL contains 4 slashes after the protocol: https:////example.org/ - while [java.net.URL](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html) extracts an empty hostname, the Nutch's OkHTTP-based protocol seems to fetch the resource as if there are only two slashes.
  - similarly java.net.URL and OkHttp show a different behavior if there is an overlong (or even invalid?) userinfo before the hostname (scheme://userinfo@hostname/) 
- IP addresses are not recognized as such if ending in a dot: https://123.123.123.123./robots.txt
- the extraction of registered domains (done by crawler-commons' [EffectiveTldFinder](https://crawler-commons.github.io/crawler-commons/1.3/crawlercommons/domains/EffectiveTldFinder.html) does not extract anything if the hostname is equal to a public suffix (`gov.uk`, `kharkov.ua` for example)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve extraction of host names and registered domains #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve extraction of host names and registered domains #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions