Add curl fallback for problematic HTTP requests by twardoch · Pull Request #5 · twardoch/twars-url2md

twardoch · 2025-06-25T00:01:17Z

Summary

ensure each file records its path via this_file
fallback to using curl when reqwest fails to fetch a page or returns an unexpected MIME type

Testing

cargo test --all-features

https://chatgpt.com/codex/tasks/task_e_685b338d8af8832aa54a598b37a01850

Summary by Sourcery

Provide a system curl fallback for HTML fetching to handle reqwest failures, timeouts, and unexpected MIME types

New Features:

Add fetch_html_with_curl function to retrieve and parse HTML via the system curl command

Enhancements:

Modify fetch_html to fall back to curl on reqwest errors, timeouts, or non-HTML content and log warnings instead of returning errors

sourcery-ai · 2025-06-25T00:01:22Z

Reviewer's Guide

Enhance HTML fetching by adding a fallback to the system’s curl binary whenever reqwest fails (due to errors, timeouts, or unexpected content types), and encapsulating that logic in a new helper function.

Sequence diagram for HTML fetch with reqwest and curl fallback

sequenceDiagram
    participant Caller
    participant fetch_html
    participant reqwest
    participant fetch_html_with_curl
    participant curl

    Caller->>fetch_html: fetch_html(client, url)
    fetch_html->>reqwest: client.get(url).send()
    alt reqwest succeeds and returns HTML
        reqwest-->>fetch_html: HTML response
        fetch_html-->>Caller: HTML body
    else reqwest fails (error, timeout, or non-HTML)
        fetch_html->>fetch_html_with_curl: fetch_html_with_curl(url)
        fetch_html_with_curl->>curl: spawn curl process
        curl-->>fetch_html_with_curl: curl output
        alt curl returns HTML
            fetch_html_with_curl-->>fetch_html: HTML body
            fetch_html-->>Caller: HTML body
        else curl fails or non-HTML
            fetch_html_with_curl-->>fetch_html: error
            fetch_html-->>Caller: error
        end
    end

Class diagram for new HTML fetch logic with curl fallback

classDiagram
    class fetch_html {
        +async fn fetch_html(client: &Client, url: &str) -> Result<String>
    }
    class fetch_html_with_curl {
        +async fn fetch_html_with_curl(url: &str) -> Result<String>
    }
    fetch_html --> fetch_html_with_curl : fallback on error/timeout/non-HTML

File-Level Changes

Change	Details	Files
Inject fallback calls in fetch_html when reqwest errors, times out, or sees a non-HTML MIME type	Replace error returns on reqwest failures with tracing::warn and a call to fetch_html_with_curl Replace timeout error handling with tracing::warn and fallback invocation Replace non-HTML content-type error with a fallback invocation	`src/html.rs`
Introduce async helper fetch_html_with_curl to perform HTML fetch via system curl	Spawn curl with flags for redirects, headers, and UA Capture stdout, split headers and body, validate content-type Return body on success or propagate detailed errors	`src/html.rs`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

qodo-code-review · 2025-06-25T00:01:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns Command injection: The curl command construction uses direct string formatting with the URL parameter without proper sanitization. A malicious URL could potentially inject additional curl arguments or commands, leading to command injection vulnerabilities.
⚡ Recommended focus areas for review Error Handling The curl fallback function has basic error handling but may not handle all edge cases properly. The header parsing logic assumes specific format and could fail with malformed responses. async fn fetch_html_with_curl(url: &str) -> Result<String> { let output = Command::new("curl") .arg("-L") .arg("-s") .arg("-D") .arg("-") .arg("-H") .arg(format!("User-Agent: {}", crate::USER_AGENT_STRING)) .arg("-H") .arg("Accept: text/html") .arg(url) .output() .await .context("Failed to spawn curl process")?; if !output.status.success() { return Err(anyhow::anyhow!( "curl exited with status {}", output.status )); } let resp = String::from_utf8_lossy(&output.stdout); let parts: Vec<&str> = resp.splitn(2, "\r\n\r\n").collect(); if parts.len() != 2 { return Err(anyhow::anyhow!("Unexpected curl output")); } let headers = parts[0].to_lowercase(); if !headers.contains("content-type: text/html") { return Err(anyhow::anyhow!("Not an HTML page (curl)")); } Ok(parts[1].to_string()) } Dependency Risk The fallback relies on system curl being available, which may not be guaranteed on all systems. No validation is performed to check if curl exists before attempting to use it. let output = Command::new("curl") .arg("-L") .arg("-s") .arg("-D") .arg("-") .arg("-H") .arg(format!("User-Agent: {}", crate::USER_AGENT_STRING)) .arg("-H") .arg("Accept: text/html") .arg(url) .output() .await .context("Failed to spawn curl process")?; Content Validation The curl response parsing uses simple string splitting and case-insensitive header checking which may not be robust enough for all HTTP response formats and could miss valid HTML content. let parts: Vec<&str> = resp.splitn(2, "\r\n\r\n").collect(); if parts.len() != 2 { return Err(anyhow::anyhow!("Unexpected curl output")); } let headers = parts[0].to_lowercase(); if !headers.contains("content-type: text/html") { return Err(anyhow::anyhow!("Not an HTML page (curl)")); } Ok(parts[1].to_string())

qodo-code-review · 2025-06-25T00:01:44Z

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Test Suite
Failed stage: Check formatting [❌]
Failure summary: The action failed because the code formatting check (`cargo fmt --check`) detected formatting issues in the Rust code. Specifically: • In `/home/runner/work/twars-url2md/twars-url2md/src/html.rs` at line 254: A `tracing::warn!` macro call needs to be reformatted to split arguments across multiple lines • In `/home/runner/work/twars-url2md/twars-url2md/src/html.rs` at line 413: An `anyhow::anyhow!` macro call needs to be reformatted to a single line The process completed with exit code 1, indicating the formatting check failed.
Relevant error logs: 1: ##[group]Operating System 2: Ubuntu ... 148: RUSTFLAGS: -D warnings 149: targets: 150: components: rustfmt, clippy 151: ##[endgroup] 152: ##[group]Run : set $CARGO_HOME 153: �[36;1m: set $CARGO_HOME�[0m 154: �[36;1mecho CARGO_HOME=${CARGO_HOME:-"$HOME/.cargo"} >> $GITHUB_ENV�[0m 155: shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 156: env: 157: CARGO_TERM_COLOR: always 158: RUSTFLAGS: -D warnings 159: ##[endgroup] 160: ##[group]Run : install rustup if needed 161: �[36;1m: install rustup if needed�[0m 162: �[36;1mif ! command -v rustup &>/dev/null; then�[0m 163: �[36;1m curl --proto '=https' --tlsv1.2 --retry 10 --retry-connrefused --location --silent --show-error --fail https://sh.rustup.rs \| sh -s -- --default-toolchain none -y�[0m 164: �[36;1m echo "$CARGO_HOME/bin" >> $GITHUB_PATH�[0m ... 234: �[36;1m if rustc +stable --version --verbose \| grep -q '^release: 1\.6[89]\.'; then�[0m 235: �[36;1m touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol \|\| true�[0m 236: �[36;1m echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse >> $GITHUB_ENV�[0m 237: �[36;1m elif rustc +stable --version --verbose \| grep -q '^release: 1\.6[67]\.'; then�[0m 238: �[36;1m touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol \|\| true�[0m 239: �[36;1m echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=git >> $GITHUB_ENV�[0m 240: �[36;1m fi�[0m 241: �[36;1mfi�[0m 242: shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 243: env: 244: CARGO_TERM_COLOR: always 245: RUSTFLAGS: -D warnings 246: CARGO_HOME: /home/runner/.cargo 247: CARGO_INCREMENTAL: 0 248: ##[endgroup] 249: ##[group]Run : work around spurious network errors in curl 8.0 250: �[36;1m: work around spurious network errors in curl 8.0�[0m 251: �[36;1m# https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/timeout.20investigation�[0m ... 322: Cache hit for: v0-rust-test-Linux-x64-e3397364-8bd239b5 323: Received 213909504 of 411071766 (52.0%), 203.6 MBs/sec 324: Received 411071766 of 411071766 (100.0%), 199.9 MBs/sec 325: Cache Size: ~392 MB (411071766 B) 326: [command]/usr/bin/tar -xf /home/runner/work/_temp/b2c7eeb8-4caf-4c5d-b5be-e400f551fb88/cache.tzst -P -C /home/runner/work/twars-url2md/twars-url2md --use-compress-program unzstd 327: Cache restored successfully 328: Restored from cache key "v0-rust-test-Linux-x64-e3397364-8bd239b5" full match: true. 329: ##[group]Run cargo fmt --check 330: �[36;1mcargo fmt --check�[0m 331: shell: /usr/bin/bash -e {0} 332: env: 333: CARGO_TERM_COLOR: always 334: RUSTFLAGS: -D warnings 335: CARGO_HOME: /home/runner/.cargo 336: CARGO_INCREMENTAL: 0 337: CACHE_ON_FAILURE: false 338: ##[endgroup] ... 358: return fetch_html_with_curl(url).await; 359: } 360: }; 361: Diff in /home/runner/work/twars-url2md/twars-url2md/src/html.rs:254: 362: // Skip non-HTML content 363: if !content_type.contains("text/html") { 364: - tracing::warn!("Unexpected content type '{}' for {} via reqwest. Falling back to curl", content_type, url); 365: + tracing::warn!( 366: + "Unexpected content type '{}' for {} via reqwest. Falling back to curl", 367: + content_type, 368: + url 369: + ); 370: return fetch_html_with_curl(url).await; 371: } 372: Diff in /home/runner/work/twars-url2md/twars-url2md/src/html.rs:413: 373: .context("Failed to spawn curl process")?; 374: if !output.status.success() { 375: - return Err(anyhow::anyhow!( 376: - "curl exited with status {}", 377: - output.status 378: - )); 379: + return Err(anyhow::anyhow!("curl exited with status {}", output.status)); 380: } 381: let resp = String::from_utf8_lossy(&output.stdout); 382: ##[error]Process completed with exit code 1. 383: Post job cleanup.

sourcery-ai

Hey @twardoch - I've reviewed your changes - here's some feedback:

Check for the curl binary’s availability before falling back and return a clear error if it’s missing.
Wrap the spawned curl process in a timeout (e.g., via tokio::time::timeout) to avoid hanging indefinitely.
Improve HTTP response parsing in fetch_html_with_curl to handle LF-only delimiters or consider using a proper HTTP parser.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Check for the `curl` binary’s availability before falling back and return a clear error if it’s missing.
- Wrap the spawned `curl` process in a timeout (e.g., via `tokio::time::timeout`) to avoid hanging indefinitely.
- Improve HTTP response parsing in `fetch_html_with_curl` to handle LF-only delimiters or consider using a proper HTTP parser.

## Individual Comments

### Comment 1
<location> `src/html.rs:401` </location>
<code_context>
+/// platforms where reqwest/rustls has trouble negotiating TLS. The function
+/// attempts to fetch the given URL and returns the HTML body if successful.
+async fn fetch_html_with_curl(url: &str) -> Result<String> {
+    let output = Command::new("curl")
+        .arg("-L")
+        .arg("-s")
</code_context>

<issue_to_address>
No timeout is set for the curl subprocess, risking indefinite hangs.

Add a timeout to the subprocess to avoid indefinite waits if curl hangs.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-25T00:02:10Z

+/// platforms where reqwest/rustls has trouble negotiating TLS. The function
+/// attempts to fetch the given URL and returns the HTML body if successful.
+async fn fetch_html_with_curl(url: &str) -> Result<String> {
+    let output = Command::new("curl")


issue (bug_risk): No timeout is set for the curl subprocess, risking indefinite hangs.

Add a timeout to the subprocess to avoid indefinite waits if curl hangs.

qodo-code-review · 2025-06-25T00:02:18Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Include stderr in error messages Include stderr output in the error message to provide more context about why curl failed. This will help with debugging network issues or invalid URLs. src/html.rs [415-420] if !output.status.success() { + let stderr = String::from_utf8_lossy(&output.stderr); return Err(anyhow::anyhow!( - "curl exited with status {}", - output.status + "curl exited with status {}: {}", + output.status, + stderr )); } Apply / Chat Suggestion importance[1-10]: 8 __ Why: This is a valuable suggestion for improving debuggability. When the `curl` command fails, including its `stderr` output in the error message provides crucial context for diagnosing the root cause, such as network issues, certificate problems, or incorrect URLs.	Medium
General	Fix content type validation logic The content-type validation is too strict and will reject valid HTML responses with charset specifications or other parameters. Use a more flexible check that accepts any HTML content type. src/html.rs [427-430] let headers = parts[0].to_lowercase(); -if !headers.contains("content-type: text/html") { +if !headers.contains("content-type:") \|\| !headers.contains("text/html") { return Err(anyhow::anyhow!("Not an HTML page (curl)")); } Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that checking for the exact string `"content-type: text/html"` is too strict and would fail for valid headers like `Content-Type: text/html; charset=utf-8`. The `improved_code` provides a more flexible check, improving the robustness of the `curl` fallback logic.	Medium
More

Fix HTTP fetching with curl fallback

2cdc477

twardoch added the codex label Jun 25, 2025 — with ChatGPT Codex Connector

qodo-code-review Bot added Review effort 3/5 Possible security concern labels Jun 25, 2025

sourcery-ai Bot reviewed Jun 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add curl fallback for problematic HTTP requests#5

Add curl fallback for problematic HTTP requests#5
twardoch wants to merge 1 commit intomainfrom
w4mcc4-codex/fix-url-fetching-issue-on-macos-and-other-systems

twardoch commented Jun 25, 2025 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jun 25, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Jun 25, 2025

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

twardoch commented Jun 25, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for HTML fetch with reqwest and curl fallback

Class diagram for new HTML fetch logic with curl fallback

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

CI Feedback 🧐

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented Jun 25, 2025

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

twardoch commented Jun 25, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jun 25, 2025 •

edited

Loading