Skip to content

Add curl fallback for problematic HTTP requests#5

Open
twardoch wants to merge 1 commit intomainfrom
w4mcc4-codex/fix-url-fetching-issue-on-macos-and-other-systems
Open

Add curl fallback for problematic HTTP requests#5
twardoch wants to merge 1 commit intomainfrom
w4mcc4-codex/fix-url-fetching-issue-on-macos-and-other-systems

Conversation

@twardoch
Copy link
Copy Markdown
Owner

@twardoch twardoch commented Jun 25, 2025

Summary

  • ensure each file records its path via this_file
  • fallback to using curl when reqwest fails to fetch a page or returns an unexpected MIME type

Testing

  • cargo test --all-features

https://chatgpt.com/codex/tasks/task_e_685b338d8af8832aa54a598b37a01850

Summary by Sourcery

Provide a system curl fallback for HTML fetching to handle reqwest failures, timeouts, and unexpected MIME types

New Features:

  • Add fetch_html_with_curl function to retrieve and parse HTML via the system curl command

Enhancements:

  • Modify fetch_html to fall back to curl on reqwest errors, timeouts, or non-HTML content and log warnings instead of returning errors

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Jun 25, 2025

Reviewer's Guide

Enhance HTML fetching by adding a fallback to the system’s curl binary whenever reqwest fails (due to errors, timeouts, or unexpected content types), and encapsulating that logic in a new helper function.

Sequence diagram for HTML fetch with reqwest and curl fallback

sequenceDiagram
    participant Caller
    participant fetch_html
    participant reqwest
    participant fetch_html_with_curl
    participant curl

    Caller->>fetch_html: fetch_html(client, url)
    fetch_html->>reqwest: client.get(url).send()
    alt reqwest succeeds and returns HTML
        reqwest-->>fetch_html: HTML response
        fetch_html-->>Caller: HTML body
    else reqwest fails (error, timeout, or non-HTML)
        fetch_html->>fetch_html_with_curl: fetch_html_with_curl(url)
        fetch_html_with_curl->>curl: spawn curl process
        curl-->>fetch_html_with_curl: curl output
        alt curl returns HTML
            fetch_html_with_curl-->>fetch_html: HTML body
            fetch_html-->>Caller: HTML body
        else curl fails or non-HTML
            fetch_html_with_curl-->>fetch_html: error
            fetch_html-->>Caller: error
        end
    end
Loading

Class diagram for new HTML fetch logic with curl fallback

classDiagram
    class fetch_html {
        +async fn fetch_html(client: &Client, url: &str) -> Result<String>
    }
    class fetch_html_with_curl {
        +async fn fetch_html_with_curl(url: &str) -> Result<String>
    }
    fetch_html --> fetch_html_with_curl : fallback on error/timeout/non-HTML
Loading

File-Level Changes

Change Details Files
Inject fallback calls in fetch_html when reqwest errors, times out, or sees a non-HTML MIME type
  • Replace error returns on reqwest failures with tracing::warn and a call to fetch_html_with_curl
  • Replace timeout error handling with tracing::warn and fallback invocation
  • Replace non-HTML content-type error with a fallback invocation
src/html.rs
Introduce async helper fetch_html_with_curl to perform HTML fetch via system curl
  • Spawn curl with flags for redirects, headers, and UA
  • Capture stdout, split headers and body, validate content-type
  • Return body on success or propagate detailed errors
src/html.rs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 Security concerns

Command injection:
The curl command construction uses direct string formatting with the URL parameter without proper sanitization. A malicious URL could potentially inject additional curl arguments or commands, leading to command injection vulnerabilities.

⚡ Recommended focus areas for review

Error Handling

The curl fallback function has basic error handling but may not handle all edge cases properly. The header parsing logic assumes specific format and could fail with malformed responses.

async fn fetch_html_with_curl(url: &str) -> Result<String> {
    let output = Command::new("curl")
        .arg("-L")
        .arg("-s")
        .arg("-D")
        .arg("-")
        .arg("-H")
        .arg(format!("User-Agent: {}", crate::USER_AGENT_STRING))
        .arg("-H")
        .arg("Accept: text/html")
        .arg(url)
        .output()
        .await
        .context("Failed to spawn curl process")?;

    if !output.status.success() {
        return Err(anyhow::anyhow!(
            "curl exited with status {}",
            output.status
        ));
    }

    let resp = String::from_utf8_lossy(&output.stdout);
    let parts: Vec<&str> = resp.splitn(2, "\r\n\r\n").collect();
    if parts.len() != 2 {
        return Err(anyhow::anyhow!("Unexpected curl output"));
    }
    let headers = parts[0].to_lowercase();
    if !headers.contains("content-type: text/html") {
        return Err(anyhow::anyhow!("Not an HTML page (curl)"));
    }
    Ok(parts[1].to_string())
}
Dependency Risk

The fallback relies on system curl being available, which may not be guaranteed on all systems. No validation is performed to check if curl exists before attempting to use it.

let output = Command::new("curl")
    .arg("-L")
    .arg("-s")
    .arg("-D")
    .arg("-")
    .arg("-H")
    .arg(format!("User-Agent: {}", crate::USER_AGENT_STRING))
    .arg("-H")
    .arg("Accept: text/html")
    .arg(url)
    .output()
    .await
    .context("Failed to spawn curl process")?;
Content Validation

The curl response parsing uses simple string splitting and case-insensitive header checking which may not be robust enough for all HTTP response formats and could miss valid HTML content.

let parts: Vec<&str> = resp.splitn(2, "\r\n\r\n").collect();
if parts.len() != 2 {
    return Err(anyhow::anyhow!("Unexpected curl output"));
}
let headers = parts[0].to_lowercase();
if !headers.contains("content-type: text/html") {
    return Err(anyhow::anyhow!("Not an HTML page (curl)"));
}
Ok(parts[1].to_string())

@qodo-code-review
Copy link
Copy Markdown

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Test Suite

Failed stage: Check formatting [❌]

Failure summary:

The action failed because the code formatting check (cargo fmt --check) detected formatting issues
in the Rust code. Specifically:
• In /home/runner/work/twars-url2md/twars-url2md/src/html.rs at line
254: A tracing::warn! macro call needs to be reformatted to split arguments across multiple lines

In /home/runner/work/twars-url2md/twars-url2md/src/html.rs at line 413: An anyhow::anyhow! macro
call needs to be reformatted to a single line
The process completed with exit code 1, indicating the
formatting check failed.

Relevant error logs:
1:  ##[group]Operating System
2:  Ubuntu
...

148:  RUSTFLAGS: -D warnings
149:  targets: 
150:  components: rustfmt, clippy
151:  ##[endgroup]
152:  ##[group]Run : set $CARGO_HOME
153:  �[36;1m: set $CARGO_HOME�[0m
154:  �[36;1mecho CARGO_HOME=${CARGO_HOME:-"$HOME/.cargo"} >> $GITHUB_ENV�[0m
155:  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
156:  env:
157:  CARGO_TERM_COLOR: always
158:  RUSTFLAGS: -D warnings
159:  ##[endgroup]
160:  ##[group]Run : install rustup if needed
161:  �[36;1m: install rustup if needed�[0m
162:  �[36;1mif ! command -v rustup &>/dev/null; then�[0m
163:  �[36;1m  curl --proto '=https' --tlsv1.2 --retry 10 --retry-connrefused --location --silent --show-error --fail https://sh.rustup.rs | sh -s -- --default-toolchain none -y�[0m
164:  �[36;1m  echo "$CARGO_HOME/bin" >> $GITHUB_PATH�[0m
...

234:  �[36;1m  if rustc +stable --version --verbose | grep -q '^release: 1\.6[89]\.'; then�[0m
235:  �[36;1m    touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol || true�[0m
236:  �[36;1m    echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse >> $GITHUB_ENV�[0m
237:  �[36;1m  elif rustc +stable --version --verbose | grep -q '^release: 1\.6[67]\.'; then�[0m
238:  �[36;1m    touch "/home/runner/work/_temp"/.implicit_cargo_registries_crates_io_protocol || true�[0m
239:  �[36;1m    echo CARGO_REGISTRIES_CRATES_IO_PROTOCOL=git >> $GITHUB_ENV�[0m
240:  �[36;1m  fi�[0m
241:  �[36;1mfi�[0m
242:  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
243:  env:
244:  CARGO_TERM_COLOR: always
245:  RUSTFLAGS: -D warnings
246:  CARGO_HOME: /home/runner/.cargo
247:  CARGO_INCREMENTAL: 0
248:  ##[endgroup]
249:  ##[group]Run : work around spurious network errors in curl 8.0
250:  �[36;1m: work around spurious network errors in curl 8.0�[0m
251:  �[36;1m# https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/timeout.20investigation�[0m
...

322:  Cache hit for: v0-rust-test-Linux-x64-e3397364-8bd239b5
323:  Received 213909504 of 411071766 (52.0%), 203.6 MBs/sec
324:  Received 411071766 of 411071766 (100.0%), 199.9 MBs/sec
325:  Cache Size: ~392 MB (411071766 B)
326:  [command]/usr/bin/tar -xf /home/runner/work/_temp/b2c7eeb8-4caf-4c5d-b5be-e400f551fb88/cache.tzst -P -C /home/runner/work/twars-url2md/twars-url2md --use-compress-program unzstd
327:  Cache restored successfully
328:  Restored from cache key "v0-rust-test-Linux-x64-e3397364-8bd239b5" full match: true.
329:  ##[group]Run cargo fmt --check
330:  �[36;1mcargo fmt --check�[0m
331:  shell: /usr/bin/bash -e {0}
332:  env:
333:  CARGO_TERM_COLOR: always
334:  RUSTFLAGS: -D warnings
335:  CARGO_HOME: /home/runner/.cargo
336:  CARGO_INCREMENTAL: 0
337:  CACHE_ON_FAILURE: false
338:  ##[endgroup]
...

358:  return fetch_html_with_curl(url).await;
359:  }
360:  };
361:  Diff in /home/runner/work/twars-url2md/twars-url2md/src/html.rs:254:
362:  // Skip non-HTML content
363:  if !content_type.contains("text/html") {
364:  -        tracing::warn!("Unexpected content type '{}' for {} via reqwest. Falling back to curl", content_type, url);
365:  +        tracing::warn!(
366:  +            "Unexpected content type '{}' for {} via reqwest. Falling back to curl",
367:  +            content_type,
368:  +            url
369:  +        );
370:  return fetch_html_with_curl(url).await;
371:  }
372:  Diff in /home/runner/work/twars-url2md/twars-url2md/src/html.rs:413:
373:  .context("Failed to spawn curl process")?;
374:  if !output.status.success() {
375:  -        return Err(anyhow::anyhow!(
376:  -            "curl exited with status {}",
377:  -            output.status
378:  -        ));
379:  +        return Err(anyhow::anyhow!("curl exited with status {}", output.status));
380:  }
381:  let resp = String::from_utf8_lossy(&output.stdout);
382:  ##[error]Process completed with exit code 1.
383:  Post job cleanup.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @twardoch - I've reviewed your changes - here's some feedback:

  • Check for the curl binary’s availability before falling back and return a clear error if it’s missing.
  • Wrap the spawned curl process in a timeout (e.g., via tokio::time::timeout) to avoid hanging indefinitely.
  • Improve HTTP response parsing in fetch_html_with_curl to handle LF-only delimiters or consider using a proper HTTP parser.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Check for the `curl` binary’s availability before falling back and return a clear error if it’s missing.
- Wrap the spawned `curl` process in a timeout (e.g., via `tokio::time::timeout`) to avoid hanging indefinitely.
- Improve HTTP response parsing in `fetch_html_with_curl` to handle LF-only delimiters or consider using a proper HTTP parser.

## Individual Comments

### Comment 1
<location> `src/html.rs:401` </location>
<code_context>
+/// platforms where reqwest/rustls has trouble negotiating TLS. The function
+/// attempts to fetch the given URL and returns the HTML body if successful.
+async fn fetch_html_with_curl(url: &str) -> Result<String> {
+    let output = Command::new("curl")
+        .arg("-L")
+        .arg("-s")
</code_context>

<issue_to_address>
No timeout is set for the curl subprocess, risking indefinite hangs.

Add a timeout to the subprocess to avoid indefinite waits if curl hangs.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/html.rs
/// platforms where reqwest/rustls has trouble negotiating TLS. The function
/// attempts to fetch the given URL and returns the HTML body if successful.
async fn fetch_html_with_curl(url: &str) -> Result<String> {
let output = Command::new("curl")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): No timeout is set for the curl subprocess, risking indefinite hangs.

Add a timeout to the subprocess to avoid indefinite waits if curl hangs.

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Include stderr in error messages

Include stderr output in the error message to provide more context about why
curl failed. This will help with debugging network issues or invalid URLs.

src/html.rs [415-420]

 if !output.status.success() {
+    let stderr = String::from_utf8_lossy(&output.stderr);
     return Err(anyhow::anyhow!(
-        "curl exited with status {}",
-        output.status
+        "curl exited with status {}: {}",
+        output.status,
+        stderr
     ));
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This is a valuable suggestion for improving debuggability. When the curl command fails, including its stderr output in the error message provides crucial context for diagnosing the root cause, such as network issues, certificate problems, or incorrect URLs.

Medium
Fix content type validation logic

The content-type validation is too strict and will reject valid HTML responses
with charset specifications or other parameters. Use a more flexible check that
accepts any HTML content type.

src/html.rs [427-430]

 let headers = parts[0].to_lowercase();
-if !headers.contains("content-type: text/html") {
+if !headers.contains("content-type:") || !headers.contains("text/html") {
     return Err(anyhow::anyhow!("Not an HTML page (curl)"));
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that checking for the exact string "content-type: text/html" is too strict and would fail for valid headers like Content-Type: text/html; charset=utf-8. The improved_code provides a more flexible check, improving the robustness of the curl fallback logic.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant