Skip to content

Conversation

MrOrz
Copy link
Member

@MrOrz MrOrz commented Aug 12, 2025

This PR implements the suggested migration from gRPC url-resolver to Gemini's urlContext tool to address performance and maintenance issues.

Changes

  • Replace gRPC dependency with Google AI SDK Gemini urlContext tool
  • Extract title, summary, and topImageUrl as requested
  • Leave html field empty per requirements
  • Add experimental testing script for manual evaluation
  • Update tests to use new implementation
  • Maintain backward compatibility with existing API

Benefits

  • Eliminates gRPC maintenance overhead (binary builds, proto sync)
  • Resolves stability issues that caused the 2025/07/02 downtime
  • Uses reliable cloud service instead of self-hosted resolver
  • Leverages AI for better content extraction and summarization

Closes #373

🤖 Generated with Claude Code

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @MrOrz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the application's URL scraping capabilities by replacing the existing gRPC-based resolver with a new implementation utilizing Google's Gemini urlContext tool. The primary goal is to enhance system stability, improve performance, and streamline maintenance by leveraging a reliable cloud-based AI service for content extraction. The changes involve introducing a new Gemini-specific scraping utility, updating the main URL processing logic to use this new utility, and adjusting test suites accordingly. An experimental script is also added to facilitate testing and performance monitoring of the new scraper.

Highlights

  • Core URL Resolver Migration: The core URL resolution mechanism has been migrated from a gRPC-based system to Google's Gemini urlContext tool. This change aims to improve stability, performance, and reduce maintenance overhead.
  • Enhanced Content Extraction: The new Gemini-powered scraper leverages AI to extract more relevant and structured information from web pages, specifically focusing on accurate titles, comprehensive summaries, and the most representative top image URLs.
  • New Experimental Testing Script: A new experimental script has been introduced to facilitate testing and evaluation of the Gemini URL scraper, complete with integration for Langfuse to track performance metrics like success rates and processing times.
  • Removal of gRPC Dependency: The dependency on gRPC has been removed, addressing previous stability issues (like the 2025/07/02 downtime) and simplifying the overall architecture by moving to a managed cloud service.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully migrates the URL scraping functionality from a gRPC-based resolver to a new implementation using the Gemini urlContext tool. The changes are well-organized and include a helpful experimentation script and updated tests. My review includes suggestions to improve the robustness of the experiment script and a performance optimization for the new Gemini scraper.

One critical point to note is that while the tests in src/util/__tests__/scrapUrls.js have been updated for the new mock, there seems to be a leftover reference to the old resolveUrl mock on lines 59-60. Since this is outside the changed lines of this PR, I couldn't leave a direct comment, but this will cause the tests to fail and should be addressed.

console.info('No URLs specified, using default test URLs:', testUrls);
}

console.info(`Testing URL scraping with ${testUrls.length} URLs`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the urls argument results in an empty list of URLs after trimming and filtering, the script could encounter division-by-zero errors later when calculating statistics. It's safer to add a check to exit early if testUrls is empty.

  if (testUrls.length === 0) {
    console.info('No valid URLs to process. Exiting.');
    return;
  }
  console.info(`Testing URL scraping with ${testUrls.length} URLs`);

Comment on lines 16 to 21
const project = await new GoogleAuth().getProjectId();
const genAI = new GoogleGenAI({
vertexai: true,
project,
location: 'us-central1', // Using us-central1 for better availability
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating new GoogleAuth and GoogleGenAI instances on every function call is inefficient. These clients are designed to be long-lived and can be reused across multiple requests. Initializing them on each call can add unnecessary latency due to authentication and setup processes. Consider initializing the client once at the module level and reusing it.

For example, you could use a lazy-initialized singleton pattern:

let genAI;

async function getGenAIClient() {
  if (genAI) return genAI;

  const project = await new GoogleAuth().getProjectId();
  genAI = new GoogleGenAI({
    vertexai: true,
    project,
    location: 'us-central1',
  });
  return genAI;
}

export default async function scrapeUrlsWithGemini(urls) {
  // ...
  const genAIClient = await getGenAIClient();
  // ... use genAIClient in the rest of the function
}

if (single) {
testUrls = [single];
} else if (urls) {
testUrls = urls.split(',').map(url => url.trim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good practice to filter out empty strings that might result from splitting the urls string (e.g., from trailing commas or empty entries). This prevents sending empty URL strings to the scraper, which would result in unnecessary and failing API calls.

Suggested change
testUrls = urls.split(',').map(url => url.trim());
testUrls = urls.split(',').map(url => url.trim()).filter(Boolean);

Copy link
Member Author

@MrOrz MrOrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude please help fix my comments

Copy link

claude bot commented Aug 12, 2025

Claude finished @MrOrz's task —— View job


Fixed review comments ✅

I've successfully addressed all the review feedback from @MrOrz:

Changes Made

1. Fixed urlContext tool configuration

  • ✅ Updated tool configuration from functionDeclarations to the correct format: config: { tools: [{ urlContext: {} }] }
  • ✅ Removed unnecessary content parameters for urlContext tool

2. Optimized for multiple URLs in single call

  • ✅ Refactored to process all URLs in one LLM call instead of individual requests
  • ✅ Updated prompt to handle multiple URLs and return JSON array of results
  • ✅ Added fallback handling to ensure all input URLs get results

3. Switched to gemini-2.5-flash model

  • ✅ Changed from gemini-2.0-flash-001 to gemini-2.5-flash as requested

4. Added canonical URL extraction

  • ✅ Updated prompt to extract canonical URLs from meta tags
  • ✅ Falls back to original URL if no canonical URL found

5. Performance optimizations

  • ✅ Implemented singleton pattern for GoogleAuth and GoogleGenAI client reuse
  • ✅ Added lazy initialization to avoid creating clients on every function call

6. Fixed experimental script safety

  • ✅ Added filter(Boolean) to remove empty strings from URL splitting
  • ✅ Added early exit if no valid URLs to process (prevents division by zero)

Technical Improvements

  • Efficiency: Single LLM call handles multiple URLs vs. previous one-by-one approach
  • Performance: Client reuse eliminates authentication overhead on subsequent calls
  • Reliability: Better error handling and fallback mechanisms
  • Safety: Prevents edge cases in experimental script

All changes maintain backward compatibility with the existing API while addressing the performance and maintenance concerns mentioned in the review.
| Latest commit

claude bot and others added 2 commits August 24, 2025 12:58
- Add Gemini-based URL scraping using @google/genai urlContext tool
- Remove gRPC dependency from scrapUrls.js
- Create experimental testing script for manual evaluation
- Update tests to mock new Gemini scraper
- Maintain backward compatibility with existing API
- Leave html field empty as requested, focus on title/summary/topImageUrl

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Johnson Liang <[email protected]>
- Use correct urlContext tool configuration format
- Process multiple URLs in single LLM call for efficiency
- Switch to gemini-2.5-flash model
- Add canonical URL extraction capability
- Implement client reuse pattern for performance
- Add safety check for empty URL list in experiment script

Co-authored-by: Johnson Liang <[email protected]>
@MrOrz MrOrz force-pushed the claude/issue-373-20250812-0421 branch from 54b8cfc to 5194dc4 Compare August 24, 2025 04:58
"url": "original URL from the list",
"canonical": "canonical URL if different from original, or same as original",
"title": "The main title of the page",
"summary": "A comprehensive summary of the content that captures the key information for fact-checking purposes",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude I noticed that under this prompt, the LLM would rewrite the whole target document as a summarization. However, what I want here is the real text from the URL so that I can index the content and the exact claims in it. Rewritten summaries does not work in my scenario. Please help rewrite prompt so that LLM can extract the exact content from the URL I stead of a summary.

Copy link

@claude claude bot Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @MrOrz's task —— View job


Fixed Gemini URL Scraper Prompt ✅

I've successfully updated the prompt to extract exact content instead of summaries for proper indexing.

Todo List:

  • Read current geminiUrlScraper.js to understand the prompt issue
  • Modify prompt to extract exact content instead of summaries
  • Commit and push the changes

Changes Made

Updated the prompt in src/util/geminiUrlScraper.js:54 to:

  • Request "exact original text content" instead of "comprehensive summary"
  • Add explicit instructions: "DO NOT summarize, rewrite, or paraphrase"
  • Emphasize preserving "all original claims, statements, facts, and information exactly as written"
  • Clarify that the text will be used for indexing and search purposes

The key change was replacing the summarization instruction with clear direction to extract the complete original text content without any AI rewriting, ensuring you get the actual claims and statements from the source material for proper indexing.


Latest commit |

claude bot and others added 4 commits August 24, 2025 14:35
…d of summaries

- Changed prompt to explicitly request original text content without rewriting
- Added clear instructions to preserve all claims and statements as written
- This ensures content can be properly indexed for fact-checking purposes

Co-authored-by: Johnson Liang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

追蹤 URL Resolver 效能問題與重構
1 participant