-
Notifications
You must be signed in to change notification settings - Fork 6
docs: rewrite Whirlwind Tour #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
laurieburchell
commented
Jun 6, 2025
- Changed the structure of the tour to focus around tasks
- Added background info about the data
- Added more explanation of software tools
- General prettification
- Tried to be more consistent with capitalisation, probably introduced a mix of British and American spelling
The CI failure is OK, ubuntu 24.04 doesn't have 3.7 and the CI hadn't been run since 24.04 became available. I will fix it in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I did not look at every detail but everything I did read was friendlier than the original, so... bravo!
README.md
Outdated
Open up `whirlwind.warc` in your favorite text editor. This is the uncompressed | ||
version of the file -- normally we always work with these files while they | ||
are compressed. | ||
This tutorial was written on Linux and MacOS. We think it should also work on Windows WSL, but raise an issue if you encounter problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made it work in actual Windows, at least the wacky Windows thing that Github uses for CI!
|
||
## Iterate over warc, wet, wat | ||
(We also have a [web graph](https://commoncrawl.org/web-graphs) by host and domains, but it is not currently demonstrated in this tour.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding some nearly-empty sections at the bottom for the things that exist and aren't part of the tour yet. Web graph, host index, index annotations. In each case you can point to a github repo or our website, that's better than nothing.
README.md
Outdated
Let's start with the cdxj index. | ||
### CDX(J) index | ||
|
||
The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can insert emoji 😉