Skip to content

yamazakikakuyo/LLM-example-usage-hands-on

Repository files navigation

RAG + Clustering + Summarization Hands-on (Colab-ready)

This bundle includes four notebooks that share a single storage location on Google Drive:

  1. 01_Embedder.ipynb — loads two demo datasets: (1) 20 Newsgroups for RAG and Clustering tasks; (2) a very long book (The Project Gutenberg eBook of War and Peace) for Summarization task, then builds embeddings for data (1) with either BERT base uncased (GPU if available) or OpenAI text-embedding-3-small, and saves all artifacts to Drive.
  2. 02_RAGPipeline.ipynb — builds a Chroma vector DB in Drive using the same embeddings, performs retrieval, and runs a RAG chatbot with OpenAI gpt-5-nano. The bot includes retrieved chunks and the assembled prompt.
  3. 03_DocClustering.ipynb — runs KMeans on the saved embeddings with optional silhouette tuning for K.
  4. 04_HierarchicalSummarizer.ipynb — demonstrates a map→reduce (hierarchical) summarization pipeline on a very long text (Google-style chunking + parallelizable Map phase + iterative Reduce). Outputs include citations of chunk IDs used at each stage.

Datasets

  • RAG & Clustering: scikit‑learn’s built-in 20 Newsgroups (English) for simple, license-friendly hands-on use.
  • Summarization: War and Peace (Project Gutenberg, public domain). Click here to see dataset

Persistence

All notebooks write to the same folder on Google Drive (default: /content/drive/MyDrive/LLM Example Usage Hands-on/rag_lab). You can change it in the config cell at the top of each notebook.

Demo Mode

Each notebook has DEMO_MODE = True by default to keep runs fast (smaller subsets, fewer chunks). Set it to False for full runs. MAKE SURE THE NOTEBOOK NO.1 RUN WITH DEMO_MODE = False BEFORE MOVING TO OTHER NOTEBOOK IF YOU DO WANT FOR FULL RUNS

API Keys

  • Set environment variable OPENAI_API_KEY and HF_TOKEN in each respective notebook runtime to use OpenAI models and BERT models (HuggingFace model)

References (used to design the code)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published