Skip to content

Chore: LLMs generator#463

Open
scode2277 wants to merge 5 commits intodevelopfrom
chore/llms-generation
Open

Chore: LLMs generator#463
scode2277 wants to merge 5 commits intodevelopfrom
chore/llms-generation

Conversation

@scode2277
Copy link
Copy Markdown
Collaborator

@scode2277 scode2277 commented Apr 22, 2026

I've created a script that runs post build and generates LLM-friendly documentation overriding vocs default one.

What it generates:

All files start with instructions for AI assistants on how to cite the source, when to fetch other files, and where to look if the question spans multiple frameworks.

All files are branch aware so on main, dev: true pages are excluded and on develop all pages are included so contributors get full coverage. Same thing for the links in the files, they change based on the branch/site they files are builded.

Other than that, i've also added a file at the end of the sidebar that explains how to use the files we generate and how to route to them directly following the naming structure -> https://chore-llms-generation.frameworks-573.pages.dev/intro/llms

Also fixed a couple of titles in the DPRK IT Workers fw as they were were causing wrong section headers in the output (title mentioned "overview" in a file that was not the overview one)

Frameworks PR Checklist

Thank you for contributing to the Security Frameworks! Before you open a PR, make sure to read information for contributors and take a look at the following checklist:

  • Describe your changes, substitute this text with the information
  • If you are touching an existing piece of content, tag current contributors from the attribution list
  • If there is a steward for that framework, ask the steward to review it
  • If you're modifying the general outline, make sure to update it in the vocs.config.ts adding the dev: true parameter
  • If you need feedback for your content from the wider community, share the PR in our Discord
  • Review changes to ensure there are no typos; see instructions below.

@scode2277 scode2277 added the enhancement Updates that improve or refine existing features, user experience, or system performance. label Apr 22, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Sidebar Configuration Reminder

This PR includes added, renamed, or removed documentation files:

  • docs/pages/intro/llms.mdx (added)

Please ensure that:

  • The sidebar in vocs.config.tsx has been updated to include these files
  • New content has the dev: true parameter so it's marked as under development
  • Sidebar links match the file paths - use the preview deployment to verify

See Contributing Guide – Sidebar & Navigation for more details.


This is an automated reminder. If this PR doesn't need sidebar changes, you can ignore this message.

Copy link
Copy Markdown
Collaborator

@frameworks-volunteer frameworks-volunteer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model: z-ai/glm-5.1 Reasoning: high Provider: openrouter

Approving. Clean implementation that follows existing patterns well.

Security: No concerns. The script is build-time only -- reads local MDX files and writes to dist. No eval, no child_process, no network calls, no secrets. Env var usage (CF_PAGES_BRANCH, CF_PAGES_URL) matches sitemap-generator.js. gray-matter is already a project dependency.

QA: Script reuses the same vocs.config.tsx regex and sidebar-ordering approach as sitemap-generator.js. The dev:true filtering is actually more thorough (double layer: line-level in config + frontmatter data.dev check). DPRK title fixes are correct -- they remove "Overview" from non-overview pages which would cause confusing section headers in the generated output. The llms.mdx page matches the frontmatter pattern of other intro pages (no contributors field). CI green.

Comment thread utils/generate-llms.js Outdated
if (!match) continue;
const link = match[2];
// On main branch, skip dev-only pages (same logic as the rest of the codebase)
if (isMainBranch && line.includes('dev:') && line.includes('true')) continue;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/iam/overview (vocs.config.tsx:200) inherits dev: true from its parent category (line 198) but this same-line check misses it, so it leaks into llms-iam.txt on main.
searchbar-indexing.js:226-233 and tags-fetcher.js:124-133 handle this by scanning the 3 lines before the link: matching that keeps the sidebar scanners aligned.

Copy link
Copy Markdown
Collaborator Author

@scode2277 scode2277 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
The IAM overview page was missing the dev: true parameter and was leaking into the main version of the llms files. The script now, instead of checking only each link's own line for dev: true, tracks brace depth as it scans the config file. When a parent sidebar block carries dev: true, all child links inside it are skipped automatically, even if they don't have the param on their own line.

Of course, i've also added the dev: true param to the iam overview in the config file

@Ghadi8
Copy link
Copy Markdown
Collaborator

Ghadi8 commented Apr 23, 2026

Looks good! One thing flagged as inline comment worth looking out for (dev: true filter can leak dev content).

@ElliotFriedman
Copy link
Copy Markdown
Contributor

Really nice work, this fills a clear gap. A few thoughts after digging in.

Structural question: file granularity

The big one I'd want to think through before merge. We currently bundle the full content of every page into a single llms-{framework}.txt. Some numbers from the preview build:

  • 10 of 31 framework files exceed 10K tokens (bytes ÷ 4 estimate)
  • llms-opsec.txt is 139KB / ~35K tokens; incident-management 111KB; guides 103KB
  • Inside opsec, one page (Travel Security Guide) is ~11.5K tokens — bigger than 22 of the 31 entire framework files on its own
  • The routing index /llms.txt itself is 46KB / ~11.6K tokens because it embeds the H2 outline of every page

The bottleneck is consumer tools, not frontier models or context windows:

  • ChatGPT custom GPT actions cap at 100,000 characters per response (per OpenAI's GPT Actions production notes)
  • Claude Code's WebFetch reportedly truncates the markdown-converted result around 100KB (per public reverse-engineering of its system prompt)
  • Cursor users have widely reported tool-output truncation, though the platform doesn't publish a specific byte cap

llms-opsec.txt will be silently truncated in those clients with no signal that anything is missing.

Surveying prior art — Anthropic (platform.claude.com/llms.txt + /llms-full.txt), Stripe, Vercel, Next.js, docs.expo.dev, VitePress, Svelte — the convention is overwhelmingly thin llms.txt index of per-page markdown links + raw content at sibling URLs, sometimes supplemented with llms-full.txt (Anthropic and Svelte do this) for the "ingest everything" case. None of them bundle content per-topic the way we currently do. The llmstxt.org spec itself defines llms.txt as "a markdown file … offers brief background information, guidance, and links to detailed markdown files" — an index of links, not embedded content.

Concrete suggestion — hybrid output:

  • Thin /llms.txt index (per-page links + one-line descriptions, drop the nested H3 outline)
  • llms-{framework}-{page}.txt per-page files (most fetches land here)
  • Keep llms-{framework}.txt bundles for the "give me the whole framework" case
  • Add llms-full.txt for "ingest everything"

pageHeadings is already collected, so emitting per-page files in the same loop is small. Lets each consumer pick granularity without us guessing.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 24, 2026

Deploying frameworks with  Cloudflare Pages  Cloudflare Pages

Latest commit: fa821fa
Status: ✅  Deploy successful!
Preview URL: https://ac9e46c3.frameworks-573.pages.dev
Branch Preview URL: https://chore-llms-generation.frameworks-573.pages.dev

View logs

@scode2277
Copy link
Copy Markdown
Collaborator Author

@ElliotFriedman Thanks for the review. The token issue is worth considering and assessing!

Changes i've made:

  • llms.txt -> cleaned out from pages headings and now has only one entry per framework with URL, description, and a Topics line (page titles)
  • llms/{framework-name}.txt is now a framework index instead of a full content dump: it embeds the overview page for immediate context, then lists links to all per-page files with one-line descriptions. (llms/opsec.txt went from 139KB to 11KB)
  • llms/{framework-name}/{page}.txt -> Added per-page files, one for every sidebar-listed page with full stripped content, source URL, and framework attribution.

Files are organized like this ->

llms.txt
llms/
     wallet-security.txt      <- Overview file content + all the links to the subpages    
     wallet-security/          
            account-abstraction.txt
            ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Updates that improve or refine existing features, user experience, or system performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants