Skip to content

Conversation

dmpetrov
Copy link
Member

@dmpetrov dmpetrov commented Sep 12, 2025

Code is Claude AI generated.

It enables:

read_storage("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet")
read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet")
(
    dc.read_values(url=[
        "d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet",
        "d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-02.parquet",
    ])
    .map(file=lambda url: dc.File(source="https://", path=url), output=dc.File)
    .to_storage("output/", placement="filename")
)

Summary by Sourcery

Add HTTPClient to support read-only HTTP/HTTPS file operations and integrate it into the generic Client dispatch logic.

New Features:

  • Register HTTPClient for http and https protocols in Client.get_implementation
  • Implement HTTPClient using fsspec HTTPFileSystem for URL parsing, path resolution, file info conversion, and directory listing over HTTP/HTTPS

Tests:

  • Add comprehensive unit tests for HTTPClient covering protocol detection, URL splitting, URI handling, path resolution, file info mapping, directory fetching, and error handling

Summary by Sourcery

Enable read-only HTTP/HTTPS file operations by adding HTTPClient and HTTPSClient implementations using fsspec’s HTTPFileSystem, integrating them into the generic Client dispatch, and covering the functionality with unit tests.

New Features:

  • Implement HTTPClient and HTTPSClient for read-only HTTP/HTTPS access using fsspec HTTPFileSystem
  • Register HTTP and HTTPS protocols in Client.get_implementation for automatic client dispatch
  • Support URL splitting, path resolution, and metadata conversion for HTTP resources

Tests:

  • Add unit tests for protocol detection, URL splitting, root URL detection, from_name behavior, full path construction, and upload error for HTTPClient

@dmpetrov dmpetrov marked this pull request as draft September 12, 2025 01:07
Copy link
Contributor

sourcery-ai bot commented Sep 12, 2025

Reviewer's Guide

Introduces HTTP/HTTPS support by registering new HTTPClient and HTTPSClient implementations using fsspec’s HTTPFileSystem, handling URL parsing, path resolution, and read-only operations, and verifies behavior with unit tests.

File-Level Changes

Change Details Files
Register HTTP and HTTPS protocols in generic client dispatch
  • Import HTTPClient and HTTPSClient in get_implementation
  • Add protocol checks for HTTPClient and HTTPSClient
  • Raise NotImplementedError for unsupported protocols
src/datachain/client/fsspec.py
Implement HTTPClient for read-only HTTP/HTTPS access
  • Define HTTPClient with FS_CLASS, PREFIX, and protocol
  • Implement filesystem creation, URL splitting, URI construction, and path resolution
  • Map remote file info to internal File objects
  • Enforce read-only operations and override directory listing
  • Create HTTPSClient subclass overriding protocol and PREFIX
src/datachain/client/http.py
Add comprehensive unit tests for HTTPClient behavior
  • Test protocol detection and dispatch
  • Validate URL splitting, root URL checks, and full path generation
  • Verify from_name behavior for HTTP and HTTPS
  • Ensure upload raises NotImplementedError
tests/unit/test_client_http.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/client/http.py:133` </location>
<code_context>
+            # Normal case: prepend protocol and name
+            base_url = f"{self.protocol}://{self.name}"
+
+        if rel_path:
+            # Ensure single slash between base and path
+            if not base_url.endswith("/") and not rel_path.startswith("/"):
+                base_url += "/"
+            full_url = base_url + rel_path
+        else:
+            full_url = base_url
</code_context>

<issue_to_address>
URL concatenation may result in double slashes or missing slashes.

Using string concatenation for URLs can lead to incorrect formatting. Prefer urllib.parse.urljoin for reliable URL construction.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            base_url = f"{self.protocol}://{self.name}"

        if rel_path:
            # Ensure single slash between base and path
            if not base_url.endswith("/") and not rel_path.startswith("/"):
                base_url += "/"
            full_url = base_url + rel_path
        else:
            full_url = base_url

        return full_url
=======
            base_url = f"{self.protocol}://{self.name}"

        if rel_path:
            # Use urljoin for reliable URL construction
            from urllib.parse import urljoin
            full_url = urljoin(base_url + "/", rel_path)
        else:
            full_url = base_url

        return full_url
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `src/datachain/client/http.py:212` </location>
<code_context>
+        if use_cache and (cache_path := self.cache.get_path(file)):
+            return open(cache_path, mode="rb")
+        
+        assert not file.location
+        # Don't pass version to fs.open for HTTP
+        return FileWrapper(
</code_context>

<issue_to_address>
The assertion on file.location may be too strict and could cause runtime errors.

If file.location is set, this assertion will raise an exception. Consider handling this scenario more gracefully or providing a clearer error message.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        assert not file.location
        # Don't pass version to fs.open for HTTP
        return FileWrapper(
            self.fs.open(self.get_full_path(file.get_path_normalized())),
            cb or (lambda x: None),
        )
=======
        if file.location:
            raise ValueError(
                f"Cannot open file with location set: {file.location}. "
                "HTTP/HTTPS files must not have a location specified."
            )
        # Don't pass version to fs.open for HTTP
        return FileWrapper(
            self.fs.open(self.get_full_path(file.get_path_normalized())),
            cb or (lambda x: None),
        )
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `tests/unit/test_client_http.py:93` </location>
<code_context>
+    def test_get_full_path(self):
</code_context>

<issue_to_address>
Consider adding a test for malformed or invalid URLs in get_full_path.

Adding tests for malformed or invalid URLs will help verify that get_full_path handles errors correctly and fails gracefully.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 131 to 141
base_url = f"{self.protocol}://{self.name}"

if rel_path:
# Ensure single slash between base and path
if not base_url.endswith("/") and not rel_path.startswith("/"):
base_url += "/"
full_url = base_url + rel_path
else:
full_url = base_url

return full_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: URL concatenation may result in double slashes or missing slashes.

Using string concatenation for URLs can lead to incorrect formatting. Prefer urllib.parse.urljoin for reliable URL construction.

Suggested change
base_url = f"{self.protocol}://{self.name}"
if rel_path:
# Ensure single slash between base and path
if not base_url.endswith("/") and not rel_path.startswith("/"):
base_url += "/"
full_url = base_url + rel_path
else:
full_url = base_url
return full_url
base_url = f"{self.protocol}://{self.name}"
if rel_path:
# Use urljoin for reliable URL construction
from urllib.parse import urljoin
full_url = urljoin(base_url + "/", rel_path)
else:
full_url = base_url
return full_url

Comment on lines 212 to 217
assert not file.location
# Don't pass version to fs.open for HTTP
return FileWrapper(
self.fs.open(self.get_full_path(file.get_path_normalized())),
cb or (lambda x: None),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): The assertion on file.location may be too strict and could cause runtime errors.

If file.location is set, this assertion will raise an exception. Consider handling this scenario more gracefully or providing a clearer error message.

Suggested change
assert not file.location
# Don't pass version to fs.open for HTTP
return FileWrapper(
self.fs.open(self.get_full_path(file.get_path_normalized())),
cb or (lambda x: None),
)
if file.location:
raise ValueError(
f"Cannot open file with location set: {file.location}. "
"HTTP/HTTPS files must not have a location specified."
)
# Don't pass version to fs.open for HTTP
return FileWrapper(
self.fs.open(self.get_full_path(file.get_path_normalized())),
cb or (lambda x: None),
)

Comment on lines 93 to 102
def test_get_full_path(self):
"""Test full path construction"""
cache = Mock(spec=Cache)
client = HTTPClient("example.com", {}, cache, protocol="https")

assert client.get_full_path("file.txt") == "https://example.com/file.txt"
assert client.get_full_path("path/to/file.txt") == "https://example.com/path/to/file.txt"
assert client.get_full_path("") == "https://example.com"

# Test case when path contains full domain (e.g., File with source="https://")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding a test for malformed or invalid URLs in get_full_path.

Adding tests for malformed or invalid URLs will help verify that get_full_path handles errors correctly and fails gracefully.

filename = parsed.path.split("/")[-1] if parsed.path else "file"

# Create file info for this single file
file = self.info_to_file(info, prefix if prefix else filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace if-expression with or (or-if-exp-identity)

Suggested change
file = self.info_to_file(info, prefix if prefix else filename)
file = self.info_to_file(info, prefix or filename)


ExplanationHere we find ourselves setting a value if it evaluates to True, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
input_currency.

It works because the left-hand side is evaluated first. If it evaluates to
true then currency will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
currency will be set to DEFAULT_CURRENCY.

Construct full HTTP/HTTPS URL from relative path.
Note: version_id is ignored as HTTP doesn't support versioning.
"""
if version_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

This uses the HTTPFileSystem's HTML parsing capabilities.
"""
full_url = self.get_full_path(prefix)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

cloudflare-workers-and-pages bot commented Sep 12, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7fa3d28
Status: ✅  Deploy successful!
Preview URL: https://e86d3d94.datachain-documentation.pages.dev
Branch Preview URL: https://http-client.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 60.20408% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.37%. Comparing base (7175f12) to head (7fa3d28).

Files with missing lines Patch % Lines
src/datachain/client/http.py 57.60% 31 Missing and 8 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1331      +/-   ##
==========================================
- Coverage   87.56%   87.37%   -0.20%     
==========================================
  Files         156      157       +1     
  Lines       14493    14590      +97     
  Branches     2081     2096      +15     
==========================================
+ Hits        12691    12748      +57     
- Misses       1331     1362      +31     
- Partials      471      480       +9     
Flag Coverage Δ
datachain 87.31% <60.20%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/client/fsspec.py 92.30% <100.00%> (-0.23%) ⬇️
src/datachain/client/http.py 57.60% <57.60%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmpetrov dmpetrov marked this pull request as ready for review September 12, 2025 08:01
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/client/http.py:26` </location>
<code_context>
+        fs.invalidate_cache()
+        return cast("HTTPFileSystem", fs)
+
+    @classmethod
+    def from_name(
+        cls,
+        name: str,
+        cache: "Cache",
+        kwargs: dict[str, Any],
+    ) -> "HTTPClient":
+        # Determine protocol from the name if it includes it
+        parsed = urlparse(name)
+
+        # Extract just the host/path part without protocol
+        if parsed.scheme:
+            name = parsed.netloc + parsed.path
+
+        return cls(name, kwargs, cache)
+
+    @classmethod
</code_context>

<issue_to_address>
Parsing logic in from_name may lead to unexpected results for URLs with query/fragment.

Currently, query and fragment components are omitted when constructing the name. Please confirm if this is intentional, or update the logic to include these components if needed.
</issue_to_address>

### Comment 2
<location> `src/datachain/client/http.py:93` </location>
<code_context>
+    def get_full_path(self, rel_path: str, version_id: Optional[str] = None) -> str:
</code_context>

<issue_to_address>
Heuristic for domain detection in get_full_path may misclassify some relative paths.

The current logic may incorrectly treat valid file names containing a dot as domains. Please refine the detection or clarify the expected input format.
</issue_to_address>

### Comment 3
<location> `src/datachain/client/http.py:133` </location>
<code_context>
+    def info_to_file(self, v: dict[str, Any], path: str) -> File:
</code_context>

<issue_to_address>
Fallback to current time for last_modified may mask upstream issues.

Consider logging a warning or surfacing the error when last_modified parsing fails, rather than silently defaulting to the current time.

Suggested implementation:

```python
    def info_to_file(self, v: dict[str, Any], path: str) -> File:
        """
        Convert HTTP file info to DataChain File object.
        """
        import logging
        logger = logging.getLogger(__name__)

        # Extract ETag if available (remove quotes if present)
        etag = v.get("ETag", "").strip('"')

        # Parse last modified time
        last_modified = v.get("last_modified")
        if last_modified:
            if isinstance(last_modified, str):

```

```python
            if isinstance(last_modified, str):
                try:
                    # Try parsing RFC 1123 format
                    from email.utils import parsedate_to_datetime
                    last_modified_dt = parsedate_to_datetime(last_modified)
                    last_modified = last_modified_dt.timestamp()
                except Exception as e:
                    logger.warning(
                        "Failed to parse last_modified '%s' for path '%s': %s. Falling back to current time.",
                        last_modified, path, e
                    )
                    import time
                    last_modified = time.time()

```
</issue_to_address>

### Comment 4
<location> `src/datachain/client/http.py:209` </location>
<code_context>
+        # Don't pass version_id to HTTP filesystem
+        return await self.fs._get_file(lpath, rpath, callback=callback)
+
+    async def _fetch_dir(self, prefix: str, pbar, result_queue) -> set[str]:
+        """
+        Override to reject directory listing for HTTP/HTTPS.
+        HTTP doesn't support directory listing in a standard way.
+        """
+        full_url = self.get_full_path(prefix)
+
+        # HTTP/HTTPS doesn't support directory listing - always raise error
+        raise NotImplementedError(f"Cannot download file from {full_url}")
+
+
</code_context>

<issue_to_address>
Error message in _fetch_dir may be misleading about operation type.

Update the error message to specify that directory listing is not supported for HTTP/HTTPS, rather than referring to file download.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        # HTTP/HTTPS doesn't support directory listing - always raise error
        raise NotImplementedError(f"Cannot download file from {full_url}")
=======
        # HTTP/HTTPS doesn't support directory listing - always raise error
        raise NotImplementedError(f"Directory listing is not supported for HTTP/HTTPS at {full_url}")
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `tests/unit/test_client_http.py:18` </location>
<code_context>
+    assert client_class == HTTPClient
+
+
+def test_split_url_https():
+    """Test URL splitting for HTTPS URLs"""
+    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
+    assert domain == "example.com"
+    assert path == "path/to/file.txt"
+
+
</code_context>

<issue_to_address>
Missing test for split_url with query and fragment components.

Add tests for URLs containing query parameters and fragments to verify correct handling by split_url.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_split_url_https():
    """Test URL splitting for HTTPS URLs"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
    assert domain == "example.com"
    assert path == "path/to/file.txt"
=======
def test_split_url_https():
    """Test URL splitting for HTTPS URLs"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_query():
    """Test URL splitting for URLs with query parameters"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar&baz=qux")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_fragment():
    """Test URL splitting for URLs with fragment components"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt#section1")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_query_and_fragment():
    """Test URL splitting for URLs with both query and fragment components"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar#section1")
    assert domain == "example.com"
    assert path == "path/to/file.txt"
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/unit/test_client_http.py:25` </location>
<code_context>
+    assert path == "path/to/file.txt"
+
+
+def test_is_root_url():
+    """Test root URL detection"""
+    assert HTTPClient.is_root_url("https://example.com")
+    assert HTTPClient.is_root_url("https://example.com/")
+    assert not HTTPClient.is_root_url("https://example.com/path")
+    assert not HTTPClient.is_root_url("https://example.com?query=1")
+    assert not HTTPClient.is_root_url("https://example.com#fragment")
+
+
</code_context>

<issue_to_address>
Missing test for is_root_url with HTTP (not HTTPS) URLs.

Please add tests for 'http://' URLs to ensure is_root_url works correctly for both HTTP and HTTPS.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_is_root_url():
    """Test root URL detection"""
    assert HTTPClient.is_root_url("https://example.com")
    assert HTTPClient.is_root_url("https://example.com/")
    assert not HTTPClient.is_root_url("https://example.com/path")
    assert not HTTPClient.is_root_url("https://example.com?query=1")
    assert not HTTPClient.is_root_url("https://example.com#fragment")
=======
def test_is_root_url():
    """Test root URL detection"""
    # HTTPS URLs
    assert HTTPClient.is_root_url("https://example.com")
    assert HTTPClient.is_root_url("https://example.com/")
    assert not HTTPClient.is_root_url("https://example.com/path")
    assert not HTTPClient.is_root_url("https://example.com?query=1")
    assert not HTTPClient.is_root_url("https://example.com#fragment")
    # HTTP URLs
    assert HTTPClient.is_root_url("http://example.com")
    assert HTTPClient.is_root_url("http://example.com/")
    assert not HTTPClient.is_root_url("http://example.com/path")
    assert not HTTPClient.is_root_url("http://example.com?query=1")
    assert not HTTPClient.is_root_url("http://example.com#fragment")
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `tests/unit/test_client_http.py:34` </location>
<code_context>
+    assert not HTTPClient.is_root_url("https://example.com#fragment")
+
+
+def test_from_name_with_https():
+    """Test creating client from HTTPS URL"""
+    cache = Mock(spec=Cache)
+    client = HTTPSClient.from_name("https://example.com/path", cache, {})
+    assert client.protocol == "https"
+    assert client.name == "example.com/path"
+    assert client.PREFIX == "https://"
+
+
</code_context>

<issue_to_address>
Missing test for from_name with URLs lacking protocol.

Add tests for from_name methods using URLs without a protocol (e.g., 'example.com/path') to confirm default protocol assignment.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_from_name_with_https():
    """Test creating client from HTTPS URL"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("https://example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"
=======
def test_from_name_with_https():
    """Test creating client from HTTPS URL"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("https://example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"

def test_from_name_without_protocol():
    """Test creating client from URL without protocol"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"
>>>>>>> REPLACE

</suggested_fix>

### Comment 8
<location> `tests/unit/test_client_http.py:52` </location>
<code_context>
+    assert client.PREFIX == "http://"
+
+
+def test_get_full_path_http():
+    """Test full path construction for HTTP"""
+    cache = Mock(spec=Cache)
+    client = HTTPClient("example.com:8080", {}, cache)
+
+    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
+
+
</code_context>

<issue_to_address>
Missing test for get_full_path with empty rel_path and with rel_path containing slashes.

Please add tests for get_full_path with an empty rel_path and with rel_path values that start with or contain multiple slashes to verify correct URL handling.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_get_full_path_http():
    """Test full path construction for HTTP"""
    cache = Mock(spec=Cache)
    client = HTTPClient("example.com:8080", {}, cache)

    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
=======
def test_get_full_path_http():
    """Test full path construction for HTTP"""
    cache = Mock(spec=Cache)
    client = HTTPClient("example.com:8080", {}, cache)

    # Basic file path
    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"

    # Empty rel_path
    assert client.get_full_path("") == "http://example.com:8080/"

    # rel_path starts with a slash
    assert client.get_full_path("/file.txt") == "http://example.com:8080/file.txt"

    # rel_path contains multiple slashes
    assert client.get_full_path("dir//file.txt") == "http://example.com:8080/dir//file.txt"
    assert client.get_full_path("/dir//file.txt") == "http://example.com:8080/dir//file.txt"
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 26 to 35
@classmethod
def create_fs(cls, **kwargs) -> HTTPFileSystem:
# Configure HTTPFileSystem options
kwargs.setdefault("simple_links", True)
kwargs.setdefault("same_scheme", True)
kwargs.setdefault("cache_type", "bytes")

# HTTPFileSystem doesn't support version_aware, remove it if present
kwargs.pop("version_aware", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Parsing logic in from_name may lead to unexpected results for URLs with query/fragment.

Currently, query and fragment components are omitted when constructing the name. Please confirm if this is intentional, or update the logic to include these components if needed.

Comment on lines 93 to 102
def get_full_path(self, rel_path: str, version_id: Optional[str] = None) -> str:
"""
Construct full HTTP/HTTPS URL from relative path.
Note: version_id is ignored as HTTP doesn't support versioning.
"""
if version_id:
# HTTP doesn't support versioning, ignore it silently
pass

if self.name.startswith(("http://", "https://")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Heuristic for domain detection in get_full_path may misclassify some relative paths.

The current logic may incorrectly treat valid file names containing a dot as domains. Please refine the detection or clarify the expected input format.

Comment on lines 133 to 142
def info_to_file(self, v: dict[str, Any], path: str) -> File:
"""
Convert HTTP file info to DataChain File object.
"""
# Extract ETag if available (remove quotes if present)
etag = v.get("ETag", "").strip('"')

# Parse last modified time
last_modified = v.get("last_modified")
if last_modified:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Fallback to current time for last_modified may mask upstream issues.

Consider logging a warning or surfacing the error when last_modified parsing fails, rather than silently defaulting to the current time.

Suggested implementation:

    def info_to_file(self, v: dict[str, Any], path: str) -> File:
        """
        Convert HTTP file info to DataChain File object.
        """
        import logging
        logger = logging.getLogger(__name__)

        # Extract ETag if available (remove quotes if present)
        etag = v.get("ETag", "").strip('"')

        # Parse last modified time
        last_modified = v.get("last_modified")
        if last_modified:
            if isinstance(last_modified, str):
            if isinstance(last_modified, str):
                try:
                    # Try parsing RFC 1123 format
                    from email.utils import parsedate_to_datetime
                    last_modified_dt = parsedate_to_datetime(last_modified)
                    last_modified = last_modified_dt.timestamp()
                except Exception as e:
                    logger.warning(
                        "Failed to parse last_modified '%s' for path '%s': %s. Falling back to current time.",
                        last_modified, path, e
                    )
                    import time
                    last_modified = time.time()

Comment on lines 216 to 217
# HTTP/HTTPS doesn't support directory listing - always raise error
raise NotImplementedError(f"Cannot download file from {full_url}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Error message in _fetch_dir may be misleading about operation type.

Update the error message to specify that directory listing is not supported for HTTP/HTTPS, rather than referring to file download.

Suggested change
# HTTP/HTTPS doesn't support directory listing - always raise error
raise NotImplementedError(f"Cannot download file from {full_url}")
# HTTP/HTTPS doesn't support directory listing - always raise error
raise NotImplementedError(f"Directory listing is not supported for HTTP/HTTPS at {full_url}")

Comment on lines 18 to 22
def test_split_url_https():
"""Test URL splitting for HTTPS URLs"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
assert domain == "example.com"
assert path == "path/to/file.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing test for split_url with query and fragment components.

Add tests for URLs containing query parameters and fragments to verify correct handling by split_url.

Suggested change
def test_split_url_https():
"""Test URL splitting for HTTPS URLs"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
assert domain == "example.com"
assert path == "path/to/file.txt"
def test_split_url_https():
"""Test URL splitting for HTTPS URLs"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
assert domain == "example.com"
assert path == "path/to/file.txt"
def test_split_url_with_query():
"""Test URL splitting for URLs with query parameters"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar&baz=qux")
assert domain == "example.com"
assert path == "path/to/file.txt"
def test_split_url_with_fragment():
"""Test URL splitting for URLs with fragment components"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt#section1")
assert domain == "example.com"
assert path == "path/to/file.txt"
def test_split_url_with_query_and_fragment():
"""Test URL splitting for URLs with both query and fragment components"""
domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar#section1")
assert domain == "example.com"
assert path == "path/to/file.txt"

Comment on lines 25 to 31
def test_is_root_url():
"""Test root URL detection"""
assert HTTPClient.is_root_url("https://example.com")
assert HTTPClient.is_root_url("https://example.com/")
assert not HTTPClient.is_root_url("https://example.com/path")
assert not HTTPClient.is_root_url("https://example.com?query=1")
assert not HTTPClient.is_root_url("https://example.com#fragment")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing test for is_root_url with HTTP (not HTTPS) URLs.

Please add tests for 'http://' URLs to ensure is_root_url works correctly for both HTTP and HTTPS.

Suggested change
def test_is_root_url():
"""Test root URL detection"""
assert HTTPClient.is_root_url("https://example.com")
assert HTTPClient.is_root_url("https://example.com/")
assert not HTTPClient.is_root_url("https://example.com/path")
assert not HTTPClient.is_root_url("https://example.com?query=1")
assert not HTTPClient.is_root_url("https://example.com#fragment")
def test_is_root_url():
"""Test root URL detection"""
# HTTPS URLs
assert HTTPClient.is_root_url("https://example.com")
assert HTTPClient.is_root_url("https://example.com/")
assert not HTTPClient.is_root_url("https://example.com/path")
assert not HTTPClient.is_root_url("https://example.com?query=1")
assert not HTTPClient.is_root_url("https://example.com#fragment")
# HTTP URLs
assert HTTPClient.is_root_url("http://example.com")
assert HTTPClient.is_root_url("http://example.com/")
assert not HTTPClient.is_root_url("http://example.com/path")
assert not HTTPClient.is_root_url("http://example.com?query=1")
assert not HTTPClient.is_root_url("http://example.com#fragment")

Comment on lines 34 to 40
def test_from_name_with_https():
"""Test creating client from HTTPS URL"""
cache = Mock(spec=Cache)
client = HTTPSClient.from_name("https://example.com/path", cache, {})
assert client.protocol == "https"
assert client.name == "example.com/path"
assert client.PREFIX == "https://"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing test for from_name with URLs lacking protocol.

Add tests for from_name methods using URLs without a protocol (e.g., 'example.com/path') to confirm default protocol assignment.

Suggested change
def test_from_name_with_https():
"""Test creating client from HTTPS URL"""
cache = Mock(spec=Cache)
client = HTTPSClient.from_name("https://example.com/path", cache, {})
assert client.protocol == "https"
assert client.name == "example.com/path"
assert client.PREFIX == "https://"
def test_from_name_with_https():
"""Test creating client from HTTPS URL"""
cache = Mock(spec=Cache)
client = HTTPSClient.from_name("https://example.com/path", cache, {})
assert client.protocol == "https"
assert client.name == "example.com/path"
assert client.PREFIX == "https://"
def test_from_name_without_protocol():
"""Test creating client from URL without protocol"""
cache = Mock(spec=Cache)
client = HTTPSClient.from_name("example.com/path", cache, {})
assert client.protocol == "https"
assert client.name == "example.com/path"
assert client.PREFIX == "https://"

Comment on lines 52 to 57
def test_get_full_path_http():
"""Test full path construction for HTTP"""
cache = Mock(spec=Cache)
client = HTTPClient("example.com:8080", {}, cache)

assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing test for get_full_path with empty rel_path and with rel_path containing slashes.

Please add tests for get_full_path with an empty rel_path and with rel_path values that start with or contain multiple slashes to verify correct URL handling.

Suggested change
def test_get_full_path_http():
"""Test full path construction for HTTP"""
cache = Mock(spec=Cache)
client = HTTPClient("example.com:8080", {}, cache)
assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
def test_get_full_path_http():
"""Test full path construction for HTTP"""
cache = Mock(spec=Cache)
client = HTTPClient("example.com:8080", {}, cache)
# Basic file path
assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
# Empty rel_path
assert client.get_full_path("") == "http://example.com:8080/"
# rel_path starts with a slash
assert client.get_full_path("/file.txt") == "http://example.com:8080/file.txt"
# rel_path contains multiple slashes
assert client.get_full_path("dir//file.txt") == "http://example.com:8080/dir//file.txt"
assert client.get_full_path("/dir//file.txt") == "http://example.com:8080/dir//file.txt"

Construct full HTTP/HTTPS URL from relative path.
Note: version_id is ignored as HTTP doesn't support versioning.
"""
if version_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant