Http client #1331

dmpetrov · 2025-09-12T01:07:07Z

Code is Claude AI generated.

It enables:

read_storage("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet")

read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet")

(
    dc.read_values(url=[
        "d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet",
        "d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-02.parquet",
    ])
    .map(file=lambda url: dc.File(source="https://", path=url), output=dc.File)
    .to_storage("output/", placement="filename")
)

Summary by Sourcery

Add HTTPClient to support read-only HTTP/HTTPS file operations and integrate it into the generic Client dispatch logic.

New Features:

Register HTTPClient for http and https protocols in Client.get_implementation
Implement HTTPClient using fsspec HTTPFileSystem for URL parsing, path resolution, file info conversion, and directory listing over HTTP/HTTPS

Tests:

Add comprehensive unit tests for HTTPClient covering protocol detection, URL splitting, URI handling, path resolution, file info mapping, directory fetching, and error handling

Summary by Sourcery

Enable read-only HTTP/HTTPS file operations by adding HTTPClient and HTTPSClient implementations using fsspec’s HTTPFileSystem, integrating them into the generic Client dispatch, and covering the functionality with unit tests.

New Features:

Implement HTTPClient and HTTPSClient for read-only HTTP/HTTPS access using fsspec HTTPFileSystem
Register HTTP and HTTPS protocols in Client.get_implementation for automatic client dispatch
Support URL splitting, path resolution, and metadata conversion for HTTP resources

Tests:

Add unit tests for protocol detection, URL splitting, root URL detection, from_name behavior, full path construction, and upload error for HTTPClient

sourcery-ai · 2025-09-12T01:07:13Z

Reviewer's Guide

Introduces HTTP/HTTPS support by registering new HTTPClient and HTTPSClient implementations using fsspec’s HTTPFileSystem, handling URL parsing, path resolution, and read-only operations, and verifies behavior with unit tests.

File-Level Changes

Change	Details	Files
Register HTTP and HTTPS protocols in generic client dispatch	Import HTTPClient and HTTPSClient in get_implementation Add protocol checks for HTTPClient and HTTPSClient Raise NotImplementedError for unsupported protocols	`src/datachain/client/fsspec.py`
Implement HTTPClient for read-only HTTP/HTTPS access	Define HTTPClient with FS_CLASS, PREFIX, and protocol Implement filesystem creation, URL splitting, URI construction, and path resolution Map remote file info to internal File objects Enforce read-only operations and override directory listing Create HTTPSClient subclass overriding protocol and PREFIX	`src/datachain/client/http.py`
Add comprehensive unit tests for HTTPClient behavior	Test protocol detection and dispatch Validate URL splitting, root URL checks, and full path generation Verify from_name behavior for HTTP and HTTPS Ensure upload raises NotImplementedError	`tests/unit/test_client_http.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/client/http.py:133` </location>
<code_context>
+            # Normal case: prepend protocol and name
+            base_url = f"{self.protocol}://{self.name}"
+
+        if rel_path:
+            # Ensure single slash between base and path
+            if not base_url.endswith("/") and not rel_path.startswith("/"):
+                base_url += "/"
+            full_url = base_url + rel_path
+        else:
+            full_url = base_url
</code_context>

<issue_to_address>
URL concatenation may result in double slashes or missing slashes.

Using string concatenation for URLs can lead to incorrect formatting. Prefer urllib.parse.urljoin for reliable URL construction.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
            base_url = f"{self.protocol}://{self.name}"

        if rel_path:
            # Ensure single slash between base and path
            if not base_url.endswith("/") and not rel_path.startswith("/"):
                base_url += "/"
            full_url = base_url + rel_path
        else:
            full_url = base_url

        return full_url
=======
            base_url = f"{self.protocol}://{self.name}"

        if rel_path:
            # Use urljoin for reliable URL construction
            from urllib.parse import urljoin
            full_url = urljoin(base_url + "/", rel_path)
        else:
            full_url = base_url

        return full_url
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `src/datachain/client/http.py:212` </location>
<code_context>
+        if use_cache and (cache_path := self.cache.get_path(file)):
+            return open(cache_path, mode="rb")
+        
+        assert not file.location
+        # Don't pass version to fs.open for HTTP
+        return FileWrapper(
</code_context>

<issue_to_address>
The assertion on file.location may be too strict and could cause runtime errors.

If file.location is set, this assertion will raise an exception. Consider handling this scenario more gracefully or providing a clearer error message.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        assert not file.location
        # Don't pass version to fs.open for HTTP
        return FileWrapper(
            self.fs.open(self.get_full_path(file.get_path_normalized())),
            cb or (lambda x: None),
        )
=======
        if file.location:
            raise ValueError(
                f"Cannot open file with location set: {file.location}. "
                "HTTP/HTTPS files must not have a location specified."
            )
        # Don't pass version to fs.open for HTTP
        return FileWrapper(
            self.fs.open(self.get_full_path(file.get_path_normalized())),
            cb or (lambda x: None),
        )
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `tests/unit/test_client_http.py:93` </location>
<code_context>
+    def test_get_full_path(self):
</code_context>

<issue_to_address>
Consider adding a test for malformed or invalid URLs in get_full_path.

Adding tests for malformed or invalid URLs will help verify that get_full_path handles errors correctly and fails gracefully.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-09-12T01:08:32Z

src/datachain/client/http.py

+            base_url = f"{self.protocol}://{self.name}"
+
+        if rel_path:
+            # Ensure single slash between base and path
+            if not base_url.endswith("/") and not rel_path.startswith("/"):
+                base_url += "/"
+            full_url = base_url + rel_path
+        else:
+            full_url = base_url
+
+        return full_url


suggestion: URL concatenation may result in double slashes or missing slashes.

Using string concatenation for URLs can lead to incorrect formatting. Prefer urllib.parse.urljoin for reliable URL construction.

Suggested change

base_url = f"{self.protocol}://{self.name}"

if rel_path:

# Ensure single slash between base and path

if not base_url.endswith("/") and not rel_path.startswith("/"):

base_url += "/"

full_url = base_url + rel_path

else:

full_url = base_url

return full_url

base_url = f"{self.protocol}://{self.name}"

if rel_path:

# Use urljoin for reliable URL construction

from urllib.parse import urljoin

full_url = urljoin(base_url + "/", rel_path)

else:

full_url = base_url

return full_url

sourcery-ai · 2025-09-12T01:08:32Z

src/datachain/client/http.py

+        assert not file.location
+        # Don't pass version to fs.open for HTTP
+        return FileWrapper(
+            self.fs.open(self.get_full_path(file.get_path_normalized())),
+            cb or (lambda x: None),
+        )


suggestion (bug_risk): The assertion on file.location may be too strict and could cause runtime errors.

If file.location is set, this assertion will raise an exception. Consider handling this scenario more gracefully or providing a clearer error message.

Suggested change

assert not file.location

# Don't pass version to fs.open for HTTP

return FileWrapper(

self.fs.open(self.get_full_path(file.get_path_normalized())),

cb or (lambda x: None),

)

if file.location:

raise ValueError(

f"Cannot open file with location set: {file.location}. "

"HTTP/HTTPS files must not have a location specified."

)

# Don't pass version to fs.open for HTTP

return FileWrapper(

self.fs.open(self.get_full_path(file.get_path_normalized())),

cb or (lambda x: None),

)

sourcery-ai · 2025-09-12T01:08:33Z

tests/unit/test_client_http.py

+    def test_get_full_path(self):
+        """Test full path construction"""
+        cache = Mock(spec=Cache)
+        client = HTTPClient("example.com", {}, cache, protocol="https")
+
+        assert client.get_full_path("file.txt") == "https://example.com/file.txt"
+        assert client.get_full_path("path/to/file.txt") == "https://example.com/path/to/file.txt"
+        assert client.get_full_path("") == "https://example.com"
+
+        # Test case when path contains full domain (e.g., File with source="https://")


suggestion (testing): Consider adding a test for malformed or invalid URLs in get_full_path.

Adding tests for malformed or invalid URLs will help verify that get_full_path handles errors correctly and fails gracefully.

sourcery-ai · 2025-09-12T01:08:33Z

src/datachain/client/http.py

+                    filename = parsed.path.split("/")[-1] if parsed.path else "file"
+
+                    # Create file info for this single file
+                    file = self.info_to_file(info, prefix if prefix else filename)


suggestion (code-quality): Replace if-expression with or (or-if-exp-identity)

Suggested change

file = self.info_to_file(info, prefix if prefix else filename)

file = self.info_to_file(info, prefix or filename)

Explanation
Here we find ourselves setting a value if it evaluates to True, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
input_currency.

It works because the left-hand side is evaluated first. If it evaluates to
true then currency will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
currency will be set to DEFAULT_CURRENCY.

sourcery-ai · 2025-09-12T01:08:33Z

src/datachain/client/http.py

+        Construct full HTTP/HTTPS URL from relative path.
+        Note: version_id is ignored as HTTP doesn't support versioning.
+        """
+        if version_id:


issue (code-quality): We've found these issues:

Remove redundant conditional (remove-redundant-if)

Lift return into if (lift-return-into-if)

sourcery-ai · 2025-09-12T01:08:33Z

src/datachain/client/http.py

+        This uses the HTTPFileSystem's HTML parsing capabilities.
+        """
+        full_url = self.get_full_path(prefix)
+


issue (code-quality): We've found these issues:

Remove redundant exceptions from an except clause [×2] (remove-redundant-exception)

Replace length-one exception tuple with exception [×2] (simplify-single-exception-tuple)

for more information, see https://pre-commit.ci

cloudflare-workers-and-pages · 2025-09-12T01:11:25Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`7fa3d28`
Status:	✅ Deploy successful!
Preview URL:	https://e86d3d94.datachain-documentation.pages.dev
Branch Preview URL:	https://http-client.datachain-documentation.pages.dev

View logs

codecov · 2025-09-12T06:08:49Z

Codecov Report

❌ Patch coverage is 60.20408% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.37%. Comparing base (7175f12) to head (7fa3d28).

Files with missing lines	Patch %	Lines
src/datachain/client/http.py	57.60%	31 Missing and 8 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1331      +/-   ##
==========================================
- Coverage   87.56%   87.37%   -0.20%     
==========================================
  Files         156      157       +1     
  Lines       14493    14590      +97     
  Branches     2081     2096      +15     
==========================================
+ Hits        12691    12748      +57     
- Misses       1331     1362      +31     
- Partials      471      480       +9

Flag	Coverage Δ
datachain	`87.31% <60.20%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/client/fsspec.py	`92.30% <100.00%> (-0.23%)`	⬇️
src/datachain/client/http.py	`57.60% <57.60%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/client/http.py:26` </location>
<code_context>
+        fs.invalidate_cache()
+        return cast("HTTPFileSystem", fs)
+
+    @classmethod
+    def from_name(
+        cls,
+        name: str,
+        cache: "Cache",
+        kwargs: dict[str, Any],
+    ) -> "HTTPClient":
+        # Determine protocol from the name if it includes it
+        parsed = urlparse(name)
+
+        # Extract just the host/path part without protocol
+        if parsed.scheme:
+            name = parsed.netloc + parsed.path
+
+        return cls(name, kwargs, cache)
+
+    @classmethod
</code_context>

<issue_to_address>
Parsing logic in from_name may lead to unexpected results for URLs with query/fragment.

Currently, query and fragment components are omitted when constructing the name. Please confirm if this is intentional, or update the logic to include these components if needed.
</issue_to_address>

### Comment 2
<location> `src/datachain/client/http.py:93` </location>
<code_context>
+    def get_full_path(self, rel_path: str, version_id: Optional[str] = None) -> str:
</code_context>

<issue_to_address>
Heuristic for domain detection in get_full_path may misclassify some relative paths.

The current logic may incorrectly treat valid file names containing a dot as domains. Please refine the detection or clarify the expected input format.
</issue_to_address>

### Comment 3
<location> `src/datachain/client/http.py:133` </location>
<code_context>
+    def info_to_file(self, v: dict[str, Any], path: str) -> File:
</code_context>

<issue_to_address>
Fallback to current time for last_modified may mask upstream issues.

Consider logging a warning or surfacing the error when last_modified parsing fails, rather than silently defaulting to the current time.

Suggested implementation:

```python
    def info_to_file(self, v: dict[str, Any], path: str) -> File:
        """
        Convert HTTP file info to DataChain File object.
        """
        import logging
        logger = logging.getLogger(__name__)

        # Extract ETag if available (remove quotes if present)
        etag = v.get("ETag", "").strip('"')

        # Parse last modified time
        last_modified = v.get("last_modified")
        if last_modified:
            if isinstance(last_modified, str):

```

```python
            if isinstance(last_modified, str):
                try:
                    # Try parsing RFC 1123 format
                    from email.utils import parsedate_to_datetime
                    last_modified_dt = parsedate_to_datetime(last_modified)
                    last_modified = last_modified_dt.timestamp()
                except Exception as e:
                    logger.warning(
                        "Failed to parse last_modified '%s' for path '%s': %s. Falling back to current time.",
                        last_modified, path, e
                    )
                    import time
                    last_modified = time.time()

```
</issue_to_address>

### Comment 4
<location> `src/datachain/client/http.py:209` </location>
<code_context>
+        # Don't pass version_id to HTTP filesystem
+        return await self.fs._get_file(lpath, rpath, callback=callback)
+
+    async def _fetch_dir(self, prefix: str, pbar, result_queue) -> set[str]:
+        """
+        Override to reject directory listing for HTTP/HTTPS.
+        HTTP doesn't support directory listing in a standard way.
+        """
+        full_url = self.get_full_path(prefix)
+
+        # HTTP/HTTPS doesn't support directory listing - always raise error
+        raise NotImplementedError(f"Cannot download file from {full_url}")
+
+
</code_context>

<issue_to_address>
Error message in _fetch_dir may be misleading about operation type.

Update the error message to specify that directory listing is not supported for HTTP/HTTPS, rather than referring to file download.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
        # HTTP/HTTPS doesn't support directory listing - always raise error
        raise NotImplementedError(f"Cannot download file from {full_url}")
=======
        # HTTP/HTTPS doesn't support directory listing - always raise error
        raise NotImplementedError(f"Directory listing is not supported for HTTP/HTTPS at {full_url}")
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `tests/unit/test_client_http.py:18` </location>
<code_context>
+    assert client_class == HTTPClient
+
+
+def test_split_url_https():
+    """Test URL splitting for HTTPS URLs"""
+    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
+    assert domain == "example.com"
+    assert path == "path/to/file.txt"
+
+
</code_context>

<issue_to_address>
Missing test for split_url with query and fragment components.

Add tests for URLs containing query parameters and fragments to verify correct handling by split_url.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_split_url_https():
    """Test URL splitting for HTTPS URLs"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
    assert domain == "example.com"
    assert path == "path/to/file.txt"
=======
def test_split_url_https():
    """Test URL splitting for HTTPS URLs"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_query():
    """Test URL splitting for URLs with query parameters"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar&baz=qux")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_fragment():
    """Test URL splitting for URLs with fragment components"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt#section1")
    assert domain == "example.com"
    assert path == "path/to/file.txt"

def test_split_url_with_query_and_fragment():
    """Test URL splitting for URLs with both query and fragment components"""
    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar#section1")
    assert domain == "example.com"
    assert path == "path/to/file.txt"
>>>>>>> REPLACE

</suggested_fix>

### Comment 6
<location> `tests/unit/test_client_http.py:25` </location>
<code_context>
+    assert path == "path/to/file.txt"
+
+
+def test_is_root_url():
+    """Test root URL detection"""
+    assert HTTPClient.is_root_url("https://example.com")
+    assert HTTPClient.is_root_url("https://example.com/")
+    assert not HTTPClient.is_root_url("https://example.com/path")
+    assert not HTTPClient.is_root_url("https://example.com?query=1")
+    assert not HTTPClient.is_root_url("https://example.com#fragment")
+
+
</code_context>

<issue_to_address>
Missing test for is_root_url with HTTP (not HTTPS) URLs.

Please add tests for 'http://' URLs to ensure is_root_url works correctly for both HTTP and HTTPS.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_is_root_url():
    """Test root URL detection"""
    assert HTTPClient.is_root_url("https://example.com")
    assert HTTPClient.is_root_url("https://example.com/")
    assert not HTTPClient.is_root_url("https://example.com/path")
    assert not HTTPClient.is_root_url("https://example.com?query=1")
    assert not HTTPClient.is_root_url("https://example.com#fragment")
=======
def test_is_root_url():
    """Test root URL detection"""
    # HTTPS URLs
    assert HTTPClient.is_root_url("https://example.com")
    assert HTTPClient.is_root_url("https://example.com/")
    assert not HTTPClient.is_root_url("https://example.com/path")
    assert not HTTPClient.is_root_url("https://example.com?query=1")
    assert not HTTPClient.is_root_url("https://example.com#fragment")
    # HTTP URLs
    assert HTTPClient.is_root_url("http://example.com")
    assert HTTPClient.is_root_url("http://example.com/")
    assert not HTTPClient.is_root_url("http://example.com/path")
    assert not HTTPClient.is_root_url("http://example.com?query=1")
    assert not HTTPClient.is_root_url("http://example.com#fragment")
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `tests/unit/test_client_http.py:34` </location>
<code_context>
+    assert not HTTPClient.is_root_url("https://example.com#fragment")
+
+
+def test_from_name_with_https():
+    """Test creating client from HTTPS URL"""
+    cache = Mock(spec=Cache)
+    client = HTTPSClient.from_name("https://example.com/path", cache, {})
+    assert client.protocol == "https"
+    assert client.name == "example.com/path"
+    assert client.PREFIX == "https://"
+
+
</code_context>

<issue_to_address>
Missing test for from_name with URLs lacking protocol.

Add tests for from_name methods using URLs without a protocol (e.g., 'example.com/path') to confirm default protocol assignment.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_from_name_with_https():
    """Test creating client from HTTPS URL"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("https://example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"
=======
def test_from_name_with_https():
    """Test creating client from HTTPS URL"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("https://example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"

def test_from_name_without_protocol():
    """Test creating client from URL without protocol"""
    cache = Mock(spec=Cache)
    client = HTTPSClient.from_name("example.com/path", cache, {})
    assert client.protocol == "https"
    assert client.name == "example.com/path"
    assert client.PREFIX == "https://"
>>>>>>> REPLACE

</suggested_fix>

### Comment 8
<location> `tests/unit/test_client_http.py:52` </location>
<code_context>
+    assert client.PREFIX == "http://"
+
+
+def test_get_full_path_http():
+    """Test full path construction for HTTP"""
+    cache = Mock(spec=Cache)
+    client = HTTPClient("example.com:8080", {}, cache)
+
+    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
+
+
</code_context>

<issue_to_address>
Missing test for get_full_path with empty rel_path and with rel_path containing slashes.

Please add tests for get_full_path with an empty rel_path and with rel_path values that start with or contain multiple slashes to verify correct URL handling.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_get_full_path_http():
    """Test full path construction for HTTP"""
    cache = Mock(spec=Cache)
    client = HTTPClient("example.com:8080", {}, cache)

    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"
=======
def test_get_full_path_http():
    """Test full path construction for HTTP"""
    cache = Mock(spec=Cache)
    client = HTTPClient("example.com:8080", {}, cache)

    # Basic file path
    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"

    # Empty rel_path
    assert client.get_full_path("") == "http://example.com:8080/"

    # rel_path starts with a slash
    assert client.get_full_path("/file.txt") == "http://example.com:8080/file.txt"

    # rel_path contains multiple slashes
    assert client.get_full_path("dir//file.txt") == "http://example.com:8080/dir//file.txt"
    assert client.get_full_path("/dir//file.txt") == "http://example.com:8080/dir//file.txt"
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-09-12T08:04:22Z

src/datachain/client/http.py

+    @classmethod
+    def create_fs(cls, **kwargs) -> HTTPFileSystem:
+        # Configure HTTPFileSystem options
+        kwargs.setdefault("simple_links", True)
+        kwargs.setdefault("same_scheme", True)
+        kwargs.setdefault("cache_type", "bytes")
+
+        # HTTPFileSystem doesn't support version_aware, remove it if present
+        kwargs.pop("version_aware", None)
+


issue: Parsing logic in from_name may lead to unexpected results for URLs with query/fragment.

Currently, query and fragment components are omitted when constructing the name. Please confirm if this is intentional, or update the logic to include these components if needed.

sourcery-ai · 2025-09-12T08:04:22Z

src/datachain/client/http.py

+    def get_full_path(self, rel_path: str, version_id: Optional[str] = None) -> str:
+        """
+        Construct full HTTP/HTTPS URL from relative path.
+        Note: version_id is ignored as HTTP doesn't support versioning.
+        """
+        if version_id:
+            # HTTP doesn't support versioning, ignore it silently
+            pass
+
+        if self.name.startswith(("http://", "https://")):


suggestion: Heuristic for domain detection in get_full_path may misclassify some relative paths.

The current logic may incorrectly treat valid file names containing a dot as domains. Please refine the detection or clarify the expected input format.

sourcery-ai · 2025-09-12T08:04:22Z

src/datachain/client/http.py

+    def info_to_file(self, v: dict[str, Any], path: str) -> File:
+        """
+        Convert HTTP file info to DataChain File object.
+        """
+        # Extract ETag if available (remove quotes if present)
+        etag = v.get("ETag", "").strip('"')
+
+        # Parse last modified time
+        last_modified = v.get("last_modified")
+        if last_modified:


suggestion (bug_risk): Fallback to current time for last_modified may mask upstream issues.

Consider logging a warning or surfacing the error when last_modified parsing fails, rather than silently defaulting to the current time.

Suggested implementation:

def info_to_file(self, v: dict[str, Any], path: str) -> File: """ Convert HTTP file info to DataChain File object. """ import logging logger = logging.getLogger(__name__) # Extract ETag if available (remove quotes if present) etag = v.get("ETag", "").strip('"') # Parse last modified time last_modified = v.get("last_modified") if last_modified: if isinstance(last_modified, str):

if isinstance(last_modified, str): try: # Try parsing RFC 1123 format from email.utils import parsedate_to_datetime last_modified_dt = parsedate_to_datetime(last_modified) last_modified = last_modified_dt.timestamp() except Exception as e: logger.warning( "Failed to parse last_modified '%s' for path '%s': %s. Falling back to current time.", last_modified, path, e ) import time last_modified = time.time()

sourcery-ai · 2025-09-12T08:04:22Z

src/datachain/client/http.py

+        # HTTP/HTTPS doesn't support directory listing - always raise error
+        raise NotImplementedError(f"Cannot download file from {full_url}")


suggestion: Error message in _fetch_dir may be misleading about operation type.

Update the error message to specify that directory listing is not supported for HTTP/HTTPS, rather than referring to file download.

Suggested change

# HTTP/HTTPS doesn't support directory listing - always raise error

raise NotImplementedError(f"Cannot download file from {full_url}")

# HTTP/HTTPS doesn't support directory listing - always raise error

raise NotImplementedError(f"Directory listing is not supported for HTTP/HTTPS at {full_url}")

sourcery-ai · 2025-09-12T08:04:23Z

tests/unit/test_client_http.py

+def test_split_url_https():
+    """Test URL splitting for HTTPS URLs"""
+    domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")
+    assert domain == "example.com"
+    assert path == "path/to/file.txt"


suggestion (testing): Missing test for split_url with query and fragment components.

Add tests for URLs containing query parameters and fragments to verify correct handling by split_url.

Suggested change

def test_split_url_https():

"""Test URL splitting for HTTPS URLs"""

domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")

assert domain == "example.com"

assert path == "path/to/file.txt"

def test_split_url_https():

"""Test URL splitting for HTTPS URLs"""

domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt")

assert domain == "example.com"

assert path == "path/to/file.txt"

def test_split_url_with_query():

"""Test URL splitting for URLs with query parameters"""

domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar&baz=qux")

assert domain == "example.com"

assert path == "path/to/file.txt"

def test_split_url_with_fragment():

"""Test URL splitting for URLs with fragment components"""

domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt#section1")

assert domain == "example.com"

assert path == "path/to/file.txt"

def test_split_url_with_query_and_fragment():

"""Test URL splitting for URLs with both query and fragment components"""

domain, path = HTTPClient.split_url("https://example.com/path/to/file.txt?foo=bar#section1")

assert domain == "example.com"

assert path == "path/to/file.txt"

sourcery-ai · 2025-09-12T08:04:23Z

tests/unit/test_client_http.py

+def test_is_root_url():
+    """Test root URL detection"""
+    assert HTTPClient.is_root_url("https://example.com")
+    assert HTTPClient.is_root_url("https://example.com/")
+    assert not HTTPClient.is_root_url("https://example.com/path")
+    assert not HTTPClient.is_root_url("https://example.com?query=1")
+    assert not HTTPClient.is_root_url("https://example.com#fragment")


suggestion (testing): Missing test for is_root_url with HTTP (not HTTPS) URLs.

Please add tests for 'http://' URLs to ensure is_root_url works correctly for both HTTP and HTTPS.

Suggested change

def test_is_root_url():

"""Test root URL detection"""

assert HTTPClient.is_root_url("https://example.com")

assert HTTPClient.is_root_url("https://example.com/")

assert not HTTPClient.is_root_url("https://example.com/path")

assert not HTTPClient.is_root_url("https://example.com?query=1")

assert not HTTPClient.is_root_url("https://example.com#fragment")

def test_is_root_url():

"""Test root URL detection"""

# HTTPS URLs

assert HTTPClient.is_root_url("https://example.com")

assert HTTPClient.is_root_url("https://example.com/")

assert not HTTPClient.is_root_url("https://example.com/path")

assert not HTTPClient.is_root_url("https://example.com?query=1")

assert not HTTPClient.is_root_url("https://example.com#fragment")

# HTTP URLs

assert HTTPClient.is_root_url("http://example.com")

assert HTTPClient.is_root_url("http://example.com/")

assert not HTTPClient.is_root_url("http://example.com/path")

assert not HTTPClient.is_root_url("http://example.com?query=1")

assert not HTTPClient.is_root_url("http://example.com#fragment")

sourcery-ai · 2025-09-12T08:04:23Z

tests/unit/test_client_http.py

+def test_from_name_with_https():
+    """Test creating client from HTTPS URL"""
+    cache = Mock(spec=Cache)
+    client = HTTPSClient.from_name("https://example.com/path", cache, {})
+    assert client.protocol == "https"
+    assert client.name == "example.com/path"
+    assert client.PREFIX == "https://"


suggestion (testing): Missing test for from_name with URLs lacking protocol.

Add tests for from_name methods using URLs without a protocol (e.g., 'example.com/path') to confirm default protocol assignment.

Suggested change

def test_from_name_with_https():

"""Test creating client from HTTPS URL"""

cache = Mock(spec=Cache)

client = HTTPSClient.from_name("https://example.com/path", cache, {})

assert client.protocol == "https"

assert client.name == "example.com/path"

assert client.PREFIX == "https://"

def test_from_name_with_https():

"""Test creating client from HTTPS URL"""

cache = Mock(spec=Cache)

client = HTTPSClient.from_name("https://example.com/path", cache, {})

assert client.protocol == "https"

assert client.name == "example.com/path"

assert client.PREFIX == "https://"

def test_from_name_without_protocol():

"""Test creating client from URL without protocol"""

cache = Mock(spec=Cache)

client = HTTPSClient.from_name("example.com/path", cache, {})

assert client.protocol == "https"

assert client.name == "example.com/path"

assert client.PREFIX == "https://"

sourcery-ai · 2025-09-12T08:04:23Z

tests/unit/test_client_http.py

+def test_get_full_path_http():
+    """Test full path construction for HTTP"""
+    cache = Mock(spec=Cache)
+    client = HTTPClient("example.com:8080", {}, cache)
+
+    assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"


suggestion (testing): Missing test for get_full_path with empty rel_path and with rel_path containing slashes.

Please add tests for get_full_path with an empty rel_path and with rel_path values that start with or contain multiple slashes to verify correct URL handling.

Suggested change

def test_get_full_path_http():

"""Test full path construction for HTTP"""

cache = Mock(spec=Cache)

client = HTTPClient("example.com:8080", {}, cache)

assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"

def test_get_full_path_http():

"""Test full path construction for HTTP"""

cache = Mock(spec=Cache)

client = HTTPClient("example.com:8080", {}, cache)

# Basic file path

assert client.get_full_path("file.txt") == "http://example.com:8080/file.txt"

# Empty rel_path

assert client.get_full_path("") == "http://example.com:8080/"

# rel_path starts with a slash

assert client.get_full_path("/file.txt") == "http://example.com:8080/file.txt"

# rel_path contains multiple slashes

assert client.get_full_path("dir//file.txt") == "http://example.com:8080/dir//file.txt"

assert client.get_full_path("/dir//file.txt") == "http://example.com:8080/dir//file.txt"

sourcery-ai · 2025-09-12T08:04:23Z

src/datachain/client/http.py

+        Construct full HTTP/HTTPS URL from relative path.
+        Note: version_id is ignored as HTTP doesn't support versioning.
+        """
+        if version_id:


issue (code-quality): We've found these issues:

Remove redundant conditional (remove-redundant-if)

Lift return into if (lift-return-into-if)

dmpetrov added 4 commits September 11, 2025 17:11

AI first pass

4675962

AI 2nd pass

17b3a00

AI pass 3

bdfe8a8

AI pass 4

13ca0ca

dmpetrov marked this pull request as draft September 12, 2025 01:07

sourcery-ai bot reviewed Sep 12, 2025

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b14a76

for more information, see https://pre-commit.ci

dmpetrov added 2 commits September 11, 2025 22:58

clean up tests

e36fc88

Merge branch 'main' into http_client

7788195

dmpetrov and others added 9 commits September 11, 2025 23:14

clean up

6195e19

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ee9d71

for more information, see https://pre-commit.ci

no dir lists

42afd36

rm backup file

a86bd05

Resolve modify/delete: drop backup file

5acd7e1

simplify test

57b3130

rm async

a083be5

rm await

c614538

tests

c142cae

dmpetrov marked this pull request as ready for review September 12, 2025 08:01

sourcery-ai bot reviewed Sep 12, 2025

View reviewed changes

dmpetrov added 2 commits September 12, 2025 01:30

unit tests

d528a33

remove obvious comments

7fa3d28

	file = self.info_to_file(info, prefix if prefix else filename)
	file = self.info_to_file(info, prefix or filename)

		# HTTP/HTTPS doesn't support directory listing - always raise error
		raise NotImplementedError(f"Cannot download file from {full_url}")

Http client #1331

Are you sure you want to change the base?

Http client #1331

Conversation

dmpetrov commented Sep 12, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmpetrov commented Sep 12, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 12, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 12, 2025 •

edited

Loading

codecov bot commented Sep 12, 2025 •

edited

Loading