Skip to content

Commit f630817

Browse files
docs: extend section on web graphs (#15)
1 parent a1b9275 commit f630817

File tree

2 files changed

+44
-5
lines changed

2 files changed

+44
-5
lines changed

README.md

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,7 @@ If we click on `CC-MAIN-2024-22' in the dropdown, we are taken to a page listing
5858

5959
![crawl_file_listing.png](img/crawl_file_listing.png)
6060

61-
In this whirlwind tour, we're going to look first at the WARC, WET, and WAT files: the data types which store the crawl data. Later, we will look at the two index files and how these help us access the crawl data we want.
62-
63-
(We also have a [web graph](https://commoncrawl.org/web-graphs) by host and domains, but it is not currently demonstrated in this tour.)
61+
In this whirlwind tour, we're going to look at the WARC, WET, and WAT files: the data types which store the crawl data. Later, we will look at the two index files and how these help us access the crawl data we want. At the [end of the Tour](#other-datasets), we'll mention some of Common Crawl's other datasets and where you can find more information about them.
6462

6563
### WARC
6664

@@ -517,9 +515,50 @@ You have completed the Whirlwind Tour of Common Crawl's Datasets using Python! Y
517515

518516
We make more datasets available than just the ones discussed in this Whirlwind Tour. Below is a short introduction to some of these other datasets, along with links to where you can find out more.
519517

520-
### Web graph
518+
### Web Graphs
519+
520+
Common Crawl regularly releases Web Graphs which are graphs describing the structure and connectivity of the web as captured in the crawl releases. We provide two levels of graph: host-level and domain-level. Both are available to download [from our website](https://commoncrawl.org/web-graphs).
521+
522+
The host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.
523+
524+
As an example, let's look at the [Web Graph release for March, April and May 2025](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/index.html). This page provides links to download data associated with the host- and domain-level graph for those months. The key files needed to construct the graphs are the files containing the vertices or nodes (the hosts or domains), and the files containing the edges (the links between the hosts/domains). These are currently the top two links in each of the tables.
525+
526+
![web-graph.png](img/web-graph.png)
527+
528+
The `.txt` files for nodes and edges are actually tab-separated files. The "Description" column in the table explains what data is in the columns. If we download the domain-level graph vertices,
529+
[cc-main-2025-mar-apr-may-domain-vertices.txt](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/domain/cc-main-2025-mar-apr-may-domain-vertices.txt.gz), we find that the top of the file looks like this:
530+
531+
```tsv
532+
0 aaa.1111 1
533+
1 aaa.11111 1
534+
2 aaa.2 1
535+
3 aaa.a 1
536+
4 aaa.aa 1
537+
5 aaa.aaa 3
538+
6 aaa.aaaa 1
539+
7 aaa.aaaaaa 1
540+
8 aaa.aaaaaaa 1
541+
9 aaa.aaaaaaaaa 1
542+
```
543+
The first column gives the node ID, the second gives the (pay-level) domain name (as provided by reverse DNS), and the third column gives the number of hosts in the domain.
544+
545+
We can also look at the top of the domain-level edges/vertices [cc-main-2025-mar-apr-may-domain-edges.txt](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/domain/cc-main-2025-mar-apr-may-domain-edges.txt.gz):
546+
547+
```tsv
548+
39 126790965
549+
41 53700629
550+
41 126790965
551+
42 126790965
552+
48 22113090
553+
48 91547783
554+
48 110426784
555+
48 119774627
556+
48 121059062
557+
49 22113090
558+
```
559+
Here, each row defines a link between two domains, with the first column giving the ID of the originating nodes, and the second column giving the ID of the destination node. The files of nodes and edges for the host-level graph are similar to those for the domain graph, with the only difference being that there is no column for number of hosts in a domain.
521560

522-
Common Crawl regularly releases host- and domain-level graphs for visualising the crawl data. The web graphs are available to download [here](https://commoncrawl.org/web-graphs). We provide a [repository](https://github.com/commoncrawl/cc-webgraph) with tools to construct, process, and explore the web graphs.
561+
If you're interested in working more with the Web Graphs, we provide a [repository](https://github.com/commoncrawl/cc-webgraph) with tools to construct, process, and explore the Web Graphs. We also have a [notebook](https://github.com/commoncrawl/cc-notebooks/tree/main/cc-webgraph-statistics) which shows users how to view statistics about the Common Crawl Web Graph data sets and interactively explore the graphs.
523562

524563
### Host index
525564

img/web-graph.png

239 KB
Loading

0 commit comments

Comments
 (0)