-
Notifications
You must be signed in to change notification settings - Fork 28
Replace gRPC URL resolver with Gemini urlContext implementation #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @MrOrz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the application's URL scraping capabilities by replacing the existing gRPC-based resolver with a new implementation utilizing Google's Gemini urlContext
tool. The primary goal is to enhance system stability, improve performance, and streamline maintenance by leveraging a reliable cloud-based AI service for content extraction. The changes involve introducing a new Gemini-specific scraping utility, updating the main URL processing logic to use this new utility, and adjusting test suites accordingly. An experimental script is also added to facilitate testing and performance monitoring of the new scraper.
Highlights
- Core URL Resolver Migration: The core URL resolution mechanism has been migrated from a gRPC-based system to Google's Gemini
urlContext
tool. This change aims to improve stability, performance, and reduce maintenance overhead. - Enhanced Content Extraction: The new Gemini-powered scraper leverages AI to extract more relevant and structured information from web pages, specifically focusing on accurate titles, comprehensive summaries, and the most representative top image URLs.
- New Experimental Testing Script: A new experimental script has been introduced to facilitate testing and evaluation of the Gemini URL scraper, complete with integration for Langfuse to track performance metrics like success rates and processing times.
- Removal of gRPC Dependency: The dependency on gRPC has been removed, addressing previous stability issues (like the 2025/07/02 downtime) and simplifying the overall architecture by moving to a managed cloud service.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully migrates the URL scraping functionality from a gRPC-based resolver to a new implementation using the Gemini urlContext tool. The changes are well-organized and include a helpful experimentation script and updated tests. My review includes suggestions to improve the robustness of the experiment script and a performance optimization for the new Gemini scraper.
One critical point to note is that while the tests in src/util/__tests__/scrapUrls.js
have been updated for the new mock, there seems to be a leftover reference to the old resolveUrl
mock on lines 59-60. Since this is outside the changed lines of this PR, I couldn't leave a direct comment, but this will cause the tests to fail and should be addressed.
console.info('No URLs specified, using default test URLs:', testUrls); | ||
} | ||
|
||
console.info(`Testing URL scraping with ${testUrls.length} URLs`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the urls
argument results in an empty list of URLs after trimming and filtering, the script could encounter division-by-zero errors later when calculating statistics. It's safer to add a check to exit early if testUrls
is empty.
if (testUrls.length === 0) {
console.info('No valid URLs to process. Exiting.');
return;
}
console.info(`Testing URL scraping with ${testUrls.length} URLs`);
src/util/geminiUrlScraper.js
Outdated
const project = await new GoogleAuth().getProjectId(); | ||
const genAI = new GoogleGenAI({ | ||
vertexai: true, | ||
project, | ||
location: 'us-central1', // Using us-central1 for better availability | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating new GoogleAuth
and GoogleGenAI
instances on every function call is inefficient. These clients are designed to be long-lived and can be reused across multiple requests. Initializing them on each call can add unnecessary latency due to authentication and setup processes. Consider initializing the client once at the module level and reusing it.
For example, you could use a lazy-initialized singleton pattern:
let genAI;
async function getGenAIClient() {
if (genAI) return genAI;
const project = await new GoogleAuth().getProjectId();
genAI = new GoogleGenAI({
vertexai: true,
project,
location: 'us-central1',
});
return genAI;
}
export default async function scrapeUrlsWithGemini(urls) {
// ...
const genAIClient = await getGenAIClient();
// ... use genAIClient in the rest of the function
}
src/scripts/experimentUrlScraper.ts
Outdated
if (single) { | ||
testUrls = [single]; | ||
} else if (urls) { | ||
testUrls = urls.split(',').map(url => url.trim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to filter out empty strings that might result from splitting the urls
string (e.g., from trailing commas or empty entries). This prevents sending empty URL strings to the scraper, which would result in unnecessary and failing API calls.
testUrls = urls.split(',').map(url => url.trim()); | |
testUrls = urls.split(',').map(url => url.trim()).filter(Boolean); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude please help fix my comments
Claude finished @MrOrz's task —— View job Fixed review comments ✅I've successfully addressed all the review feedback from @MrOrz: Changes Made1. Fixed urlContext tool configuration
2. Optimized for multiple URLs in single call
3. Switched to gemini-2.5-flash model
4. Added canonical URL extraction
5. Performance optimizations
6. Fixed experimental script safety
Technical Improvements
All changes maintain backward compatibility with the existing API while addressing the performance and maintenance concerns mentioned in the review. |
- Add Gemini-based URL scraping using @google/genai urlContext tool - Remove gRPC dependency from scrapUrls.js - Create experimental testing script for manual evaluation - Update tests to mock new Gemini scraper - Maintain backward compatibility with existing API - Leave html field empty as requested, focus on title/summary/topImageUrl 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Johnson Liang <[email protected]>
- Use correct urlContext tool configuration format - Process multiple URLs in single LLM call for efficiency - Switch to gemini-2.5-flash model - Add canonical URL extraction capability - Implement client reuse pattern for performance - Add safety check for empty URL list in experiment script Co-authored-by: Johnson Liang <[email protected]>
54b8cfc
to
5194dc4
Compare
src/util/geminiUrlScraper.js
Outdated
"url": "original URL from the list", | ||
"canonical": "canonical URL if different from original, or same as original", | ||
"title": "The main title of the page", | ||
"summary": "A comprehensive summary of the content that captures the key information for fact-checking purposes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude I noticed that under this prompt, the LLM would rewrite the whole target document as a summarization. However, what I want here is the real text from the URL so that I can index the content and the exact claims in it. Rewritten summaries does not work in my scenario. Please help rewrite prompt so that LLM can extract the exact content from the URL I stead of a summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @MrOrz's task —— View job
Fixed Gemini URL Scraper Prompt ✅
I've successfully updated the prompt to extract exact content instead of summaries for proper indexing.
Todo List:
- Read current geminiUrlScraper.js to understand the prompt issue
- Modify prompt to extract exact content instead of summaries
- Commit and push the changes
Changes Made
Updated the prompt in src/util/geminiUrlScraper.js:54
to:
- Request "exact original text content" instead of "comprehensive summary"
- Add explicit instructions: "DO NOT summarize, rewrite, or paraphrase"
- Emphasize preserving "all original claims, statements, facts, and information exactly as written"
- Clarify that the text will be used for indexing and search purposes
The key change was replacing the summarization instruction with clear direction to extract the complete original text content without any AI rewriting, ensuring you get the actual claims and statements from the source material for proper indexing.
…d of summaries - Changed prompt to explicitly request original text content without rewriting - Added clear instructions to preserve all claims and statements as written - This ensures content can be properly indexed for fact-checking purposes Co-authored-by: Johnson Liang <[email protected]>
This PR implements the suggested migration from gRPC url-resolver to Gemini's urlContext tool to address performance and maintenance issues.
Changes
Benefits
Closes #373
🤖 Generated with Claude Code