Skip to content

Conversation

laurieburchell
Copy link
Member

  • Changed the structure of the tour to focus around tasks
  • Added background info about the data
  • Added more explanation of software tools
  • General prettification
  • Tried to be more consistent with capitalisation, probably introduced a mix of British and American spelling

@wumpus
Copy link
Member

wumpus commented Jun 7, 2025

The CI failure is OK, ubuntu 24.04 doesn't have 3.7 and the CI hadn't been run since 24.04 became available. I will fix it in a separate PR.

Copy link
Member

@wumpus wumpus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did not look at every detail but everything I did read was friendlier than the original, so... bravo!

README.md Outdated
Open up `whirlwind.warc` in your favorite text editor. This is the uncompressed
version of the file -- normally we always work with these files while they
are compressed.
This tutorial was written on Linux and MacOS. We think it should also work on Windows WSL, but raise an issue if you encounter problems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made it work in actual Windows, at least the wacky Windows thing that Github uses for CI!


## Iterate over warc, wet, wat
(We also have a [web graph](https://commoncrawl.org/web-graphs) by host and domains, but it is not currently demonstrated in this tour.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding some nearly-empty sections at the bottom for the things that exist and aren't part of the tour yet. Web graph, host index, index annotations. In each case you can point to a github repo or our website, that's better than nothing.

README.md Outdated
Let's start with the cdxj index.
### CDX(J) index

The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons :)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can insert emoji 😉

@wumpus wumpus merged commit a1b9275 into main Jun 10, 2025
11 checks passed
@wumpus wumpus deleted the ww-remix branch June 10, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants