Replace gRPC URL resolver with Gemini urlContext implementation #377

MrOrz · 2025-08-12T07:09:29Z

This PR implements the suggested migration from gRPC url-resolver to Gemini's urlContext tool to address performance and maintenance issues.

Changes

Replace gRPC dependency with Google AI SDK Gemini urlContext tool
Extract title, summary, and topImageUrl as requested
Leave html field empty per requirements
Add experimental testing script for manual evaluation
Update tests to use new implementation
Maintain backward compatibility with existing API

Benefits

Eliminates gRPC maintenance overhead (binary builds, proto sync)
Resolves stability issues that caused the 2025/07/02 downtime
Uses reliable cloud service instead of self-hosted resolver
Leverages AI for better content extraction and summarization

Closes #373

🤖 Generated with Claude Code

gemini-code-assist

Summary of Changes

Hello @MrOrz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the application's URL scraping capabilities by replacing the existing gRPC-based resolver with a new implementation utilizing Google's Gemini urlContext tool. The primary goal is to enhance system stability, improve performance, and streamline maintenance by leveraging a reliable cloud-based AI service for content extraction. The changes involve introducing a new Gemini-specific scraping utility, updating the main URL processing logic to use this new utility, and adjusting test suites accordingly. An experimental script is also added to facilitate testing and performance monitoring of the new scraper.

Highlights

Core URL Resolver Migration: The core URL resolution mechanism has been migrated from a gRPC-based system to Google's Gemini urlContext tool. This change aims to improve stability, performance, and reduce maintenance overhead.
Enhanced Content Extraction: The new Gemini-powered scraper leverages AI to extract more relevant and structured information from web pages, specifically focusing on accurate titles, comprehensive summaries, and the most representative top image URLs.
New Experimental Testing Script: A new experimental script has been introduced to facilitate testing and evaluation of the Gemini URL scraper, complete with integration for Langfuse to track performance metrics like success rates and processing times.
Removal of gRPC Dependency: The dependency on gRPC has been removed, addressing previous stability issues (like the 2025/07/02 downtime) and simplifying the overall architecture by moving to a managed cloud service.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully migrates the URL scraping functionality from a gRPC-based resolver to a new implementation using the Gemini urlContext tool. The changes are well-organized and include a helpful experimentation script and updated tests. My review includes suggestions to improve the robustness of the experiment script and a performance optimization for the new Gemini scraper.

One critical point to note is that while the tests in src/util/__tests__/scrapUrls.js have been updated for the new mock, there seems to be a leftover reference to the old resolveUrl mock on lines 59-60. Since this is outside the changed lines of this PR, I couldn't leave a direct comment, but this will cause the tests to fail and should be addressed.

gemini-code-assist · 2025-08-12T07:11:50Z

src/scripts/experimentUrlScraper.ts

+    console.info('No URLs specified, using default test URLs:', testUrls);
+  }
+
+  console.info(`Testing URL scraping with ${testUrls.length} URLs`);


If the urls argument results in an empty list of URLs after trimming and filtering, the script could encounter division-by-zero errors later when calculating statistics. It's safer to add a check to exit early if testUrls is empty.

if (testUrls.length === 0) { console.info('No valid URLs to process. Exiting.'); return; } console.info(`Testing URL scraping with ${testUrls.length} URLs`);

gemini-code-assist · 2025-08-12T07:11:50Z

src/util/geminiUrlScraper.js

+  const project = await new GoogleAuth().getProjectId();
+  const genAI = new GoogleGenAI({
+    vertexai: true,
+    project,
+    location: 'us-central1', // Using us-central1 for better availability
+  });


Creating new GoogleAuth and GoogleGenAI instances on every function call is inefficient. These clients are designed to be long-lived and can be reused across multiple requests. Initializing them on each call can add unnecessary latency due to authentication and setup processes. Consider initializing the client once at the module level and reusing it.

For example, you could use a lazy-initialized singleton pattern:

let genAI; async function getGenAIClient() { if (genAI) return genAI; const project = await new GoogleAuth().getProjectId(); genAI = new GoogleGenAI({ vertexai: true, project, location: 'us-central1', }); return genAI; } export default async function scrapeUrlsWithGemini(urls) { // ... const genAIClient = await getGenAIClient(); // ... use genAIClient in the rest of the function }

gemini-code-assist · 2025-08-12T07:11:50Z

src/scripts/experimentUrlScraper.ts

+  if (single) {
+    testUrls = [single];
+  } else if (urls) {
+    testUrls = urls.split(',').map(url => url.trim());


It's good practice to filter out empty strings that might result from splitting the urls string (e.g., from trailing commas or empty entries). This prevents sending empty URL strings to the scraper, which would result in unnecessary and failing API calls.

Suggested change

testUrls = urls.split(',').map(url => url.trim());

testUrls = urls.split(',').map(url => url.trim()).filter(Boolean);

MrOrz

@claude please help fix my comments

src/util/geminiUrlScraper.js

claude · 2025-08-12T07:38:22Z

Claude finished @MrOrz's task —— View job

Fixed review comments ✅

I've successfully addressed all the review feedback from @MrOrz:

Changes Made

1. Fixed urlContext tool configuration

✅ Updated tool configuration from functionDeclarations to the correct format: config: { tools: [{ urlContext: {} }] }
✅ Removed unnecessary content parameters for urlContext tool

2. Optimized for multiple URLs in single call

✅ Refactored to process all URLs in one LLM call instead of individual requests
✅ Updated prompt to handle multiple URLs and return JSON array of results
✅ Added fallback handling to ensure all input URLs get results

3. Switched to gemini-2.5-flash model

✅ Changed from gemini-2.0-flash-001 to gemini-2.5-flash as requested

4. Added canonical URL extraction

✅ Updated prompt to extract canonical URLs from meta tags
✅ Falls back to original URL if no canonical URL found

5. Performance optimizations

✅ Implemented singleton pattern for GoogleAuth and GoogleGenAI client reuse
✅ Added lazy initialization to avoid creating clients on every function call

6. Fixed experimental script safety

✅ Added filter(Boolean) to remove empty strings from URL splitting
✅ Added early exit if no valid URLs to process (prevents division by zero)

Technical Improvements

Efficiency: Single LLM call handles multiple URLs vs. previous one-by-one approach
Performance: Client reuse eliminates authentication overhead on subsequent calls
Reliability: Better error handling and fallback mechanisms
Safety: Prevents edge cases in experimental script

All changes maintain backward compatibility with the existing API while addressing the performance and maintenance concerns mentioned in the review.
| Latest commit

- Add Gemini-based URL scraping using @google/genai urlContext tool - Remove gRPC dependency from scrapUrls.js - Create experimental testing script for manual evaluation - Update tests to mock new Gemini scraper - Maintain backward compatibility with existing API - Leave html field empty as requested, focus on title/summary/topImageUrl 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Johnson Liang <[email protected]>

- Use correct urlContext tool configuration format - Process multiple URLs in single LLM call for efficiency - Switch to gemini-2.5-flash model - Add canonical URL extraction capability - Implement client reuse pattern for performance - Add safety check for empty URL list in experiment script Co-authored-by: Johnson Liang <[email protected]>

MrOrz · 2025-08-24T14:33:59Z

src/util/geminiUrlScraper.js

+    "url": "original URL from the list",
+    "canonical": "canonical URL if different from original, or same as original",
+    "title": "The main title of the page",
+    "summary": "A comprehensive summary of the content that captures the key information for fact-checking purposes",


@claude I noticed that under this prompt, the LLM would rewrite the whole target document as a summarization. However, what I want here is the real text from the URL so that I can index the content and the exact claims in it. Rewritten summaries does not work in my scenario. Please help rewrite prompt so that LLM can extract the exact content from the URL I stead of a summary.

Claude finished @MrOrz's task —— View job

Fixed Gemini URL Scraper Prompt ✅

I've successfully updated the prompt to extract exact content instead of summaries for proper indexing.

Todo List:

Read current geminiUrlScraper.js to understand the prompt issue

Modify prompt to extract exact content instead of summaries

Commit and push the changes

Changes Made

Updated the prompt in src/util/geminiUrlScraper.js:54 to:

Request "exact original text content" instead of "comprehensive summary"

Add explicit instructions: "DO NOT summarize, rewrite, or paraphrase"

Emphasize preserving "all original claims, statements, facts, and information exactly as written"

Clarify that the text will be used for indexing and search purposes

The key change was replacing the summarization instruction with clear direction to extract the complete original text content without any AI rewriting, ensuring you get the actual claims and statements from the source material for proper indexing.

Latest commit |

…d of summaries - Changed prompt to explicitly request original text content without rewriting - Added clear instructions to preserve all claims and statements as written - This ensures content can be properly indexed for fact-checking purposes Co-authored-by: Johnson Liang <[email protected]>

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

MrOrz commented Aug 12, 2025

View reviewed changes

src/util/geminiUrlScraper.js Outdated Show resolved Hide resolved

src/util/geminiUrlScraper.js Outdated Show resolved Hide resolved

src/util/geminiUrlScraper.js Outdated Show resolved Hide resolved

src/util/geminiUrlScraper.js Outdated Show resolved Hide resolved

claude bot and others added 2 commits August 24, 2025 12:58

MrOrz force-pushed the claude/issue-373-20250812-0421 branch from 54b8cfc to 5194dc4 Compare August 24, 2025 04:58

MrOrz added 3 commits August 24, 2025 13:10

fix: lint

1402d39

fix: improve error logging in URL scraper experiment

125194b

fix: format

2faeb47

MrOrz commented Aug 24, 2025

View reviewed changes

claude bot and others added 4 commits August 24, 2025 14:35

fix: show full summary logging in URL scraper experiments

9706414

fix: refine summary extraction criteria in Gemini URL scraper

9d5f697

feat: add YouTube video transcription function using Gemini model

c24425b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace gRPC URL resolver with Gemini urlContext implementation #377

Replace gRPC URL resolver with Gemini urlContext implementation #377

Uh oh!

MrOrz commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

MrOrz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

MrOrz Aug 24, 2025

Uh oh!

claude bot Aug 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

	testUrls = urls.split(',').map(url => url.trim());
	testUrls = urls.split(',').map(url => url.trim()).filter(Boolean);

Replace gRPC URL resolver with Gemini urlContext implementation #377

Are you sure you want to change the base?

Replace gRPC URL resolver with Gemini urlContext implementation #377

Uh oh!

Conversation

MrOrz commented Aug 12, 2025

Changes

Benefits

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

MrOrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixed review comments ✅

Changes Made

1. Fixed urlContext tool configuration

2. Optimized for multiple URLs in single call

3. Switched to gemini-2.5-flash model

4. Added canonical URL extraction

5. Performance optimizations

6. Fixed experimental script safety

Technical Improvements

Uh oh!

MrOrz Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Fixed Gemini URL Scraper Prompt ✅

Changes Made

Uh oh!

Uh oh!

claude bot commented Aug 12, 2025 •

edited

Loading

claude bot Aug 24, 2025 •

edited

Loading