Skip to content

fix(api): confine /tasks/file reads to OUTPUT_DIR (path traversal)#223

Open
Osamaali313 wants to merge 2 commits into
zai-org:mainfrom
Osamaali313:fix/file-endpoint-path-traversal
Open

fix(api): confine /tasks/file reads to OUTPUT_DIR (path traversal)#223
Osamaali313 wants to merge 2 commits into
zai-org:mainfrom
Osamaali313:fix/file-endpoint-path-traversal

Conversation

@Osamaali313

Copy link
Copy Markdown

Problem (arbitrary file read / path traversal)

GET /api/v1/tasks/file (read_file in apps/backend/app/api/tasks.py) returns the contents of any path the client supplies, after checking only that it exists and is a file:

@router.get("/file")
async def read_file(path: str):
    ...
    file_path = Path(path)
    if not file_path.exists(): ...404
    if not file_path.is_file(): ...400
    with open(file_path, "rb") as f:
        content = f.read()        # served back (image bytes, or JSON text for other types)

There is no confinement to an allowed directory, so the endpoint reads and returns any file the service process can access:

GET /api/v1/tasks/file?path=/etc/passwd
GET /api/v1/tasks/file?path=C:\Windows\win.ini
GET /api/v1/tasks/file?path=../../some/other/file

For a self-hostable service this is an arbitrary file read — config files, source, SQLite DB, keys, etc. become readable by anyone who can reach the API.

Fix

Confine reads to settings.OUTPUT_DIR, which is the only location this app writes the files this endpoint is meant to serve:

  • uploads default to OUTPUT_DIR (upload_file_manager.save_to_path),
  • task artifacts (the OCR result images/markdown referenced via this endpoint) live in OUTPUT_DIR/<task_id>/ (pipeline_flow).

The requested path is resolved (collapsing ../ and symlinks) and required to be under OUTPUT_DIR; otherwise it returns 403.

Validation

Verified the confinement decision against the real serving logic:

request before (exists && is_file) after (confined to OUTPUT_DIR)
legit file under OUTPUT_DIR/<task_id>/img.png served ✅ served ✅
?path=<file outside OUTPUT_DIR> served 403 ✅
?path=<OUTPUT_DIR>/../../secret (traversal) served/blocked inconsistently 403 ✅
?path=/etc/hostname (or C:\Windows\win.ini) served 403 ✅

Legitimate use (the markdown image URLs this endpoint backs) is unchanged; only out-of-tree paths are rejected.

The GET /api/v1/tasks/file endpoint (read_file) returned the contents of
any client-supplied path after only checking that it exists and is a file,
with no confinement to an allowed directory. A request such as
`/api/v1/tasks/file?path=/etc/passwd` (or a "../" traversal) would read and
return any file readable by the service process -- an arbitrary file read.

Resolve the requested path (collapsing "../" and symlinks) and require it
to live under settings.OUTPUT_DIR -- the only location task outputs and
uploads are written (uploads default to OUTPUT_DIR; task artifacts live in
OUTPUT_DIR/<task_id>/). Legitimate files served by this endpoint (the OCR
result images/markdown) are unaffected; out-of-tree paths now return 403.
Copilot AI review requested due to automatic review settings June 15, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens the read_file API by preventing arbitrary file reads, confining accessible paths to the configured OUTPUT_DIR.

Changes:

  • Resolve and validate requested file paths against settings.OUTPUT_DIR
  • Return 400 for invalid paths and 403 for paths outside the allowed directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to +131
# Confine reads to the configured output directory. Resolve the
# requested path (collapsing any "../" traversal and symlinks) and
# require it to live under OUTPUT_DIR; otherwise an arbitrary `path`
# (e.g. "/etc/passwd") would be read and returned, exposing any file
# readable by the service.
base_dir = Path(settings.OUTPUT_DIR).resolve()
try:
file_path = Path(path).resolve()
except (OSError, ValueError, RuntimeError):
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Invalid path",
)
if not file_path.is_relative_to(base_dir):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="Access denied: path is outside the allowed directory",
)
status_code=status.HTTP_400_BAD_REQUEST,
detail="Invalid path",
)
if not file_path.is_relative_to(base_dir):
Per review: Path(path).resolve() resolved a relative `path` against the
process CWD, so a legitimate OUTPUT_DIR-relative request (e.g.
"task1/img.png") would be rejected. Resolve non-absolute paths under
OUTPUT_DIR before the containment check -- preserving the intended
semantics while still blocking traversal and absolute out-of-tree paths.
@Osamaali313

Copy link
Copy Markdown
Author

Thanks @copilot — addressed in 62ade70.

Relative paths vs CWD: good catch. Non-absolute paths are now joined to OUTPUT_DIR before resolution, so an OUTPUT_DIR-relative request like task1/img.png is honored instead of being resolved against the process CWD and rejected. Traversal and absolute out-of-tree paths still return 403. Verified:

request result
absolute path under OUTPUT_DIR (what the app emits) served
task1/img.png (relative, under OUTPUT_DIR) served
../../secret, /etc/passwd 403

Path.is_relative_to (3.9+): this backend declares requires-python = ">=3.12" (apps/backend/pyproject.toml), so is_relative_to is always available and no 3.8 fallback is needed. Happy to switch to the relative_to + try/except ValueError pattern if you'd prefer it regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants