docs(skills): cover the new spark function namespace

timsaucer · claude · timsaucer · commit e2eceb40c0fc · 2026-05-30T09:45:03.000-04:00
`functions.spark` mirrors `pyspark.sql.functions` and now ships on this
branch. Update every skill that references the function surface:

- skills/datafusion_python/SKILL.md (user-facing): add an import
  reference, a Core Abstractions row, and a "Spark-Compatible Functions"
  subsection listing coverage by category, the SQL-vs-DataFrame usage
  (`enable_spark_functions`), and the divergent-semantics table
  (concat NULL, round HALF_UP, trunc) so callers know which namespace
  to pick.
- .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark`
  crate with the coverage policy (parity with pyspark, extras allowed
  when positional pyspark calls still work). Hygiene check also now
  spans `functions/spark.py`'s `__all__`.
- .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the
  surface table and a `spark-functions` scope so this audit also
  validates the new subsection and divergent-semantics table.
- .ai/skills/make-pythonic/SKILL.md: explicit scope note that the
  spark namespace is a deliberate pyspark mirror — generic native-type
  coercion does not apply there. Path references updated to the new
  `functions/__init__.py` module layout.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md
@@ -48,7 +48,8 @@ exposed at the package root), include it.
 | `SessionContext` | `python/datafusion/context.py` | "Data Loading" |
 | `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" |
 | `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" |
-| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout |
+| `functions` | `python/datafusion/functions/__init__.py` | "Available Functions (Categorized)", scattered uses throughout |
+| `functions.spark` | `python/datafusion/functions/spark.py` | "Available Functions (Categorized)" → "Spark-Compatible Functions" subsection |
 | Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" |
 
 ## Scope argument
@@ -61,7 +62,8 @@ is given or `all` is specified, audit every area.
 | `session-context` | `SessionContext` methods and the "Data Loading" section |
 | `dataframe` | `DataFrame` methods and the operations / executing / patterns sections |
 | `expr` | `Expr` methods/operators and the "Expression Building" section |
-| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section |
+| `functions` | `functions/__init__.py` `__all__` and the "Available Functions (Categorized)" section |
+| `spark-functions` | `functions/spark.py` `__all__`, the "Spark-Compatible Functions" subsection, and the divergent-semantics table |
 | `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style |
 | `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream |
 | `version-notes` | Cross-check version annotations (see below) |
@@ -123,7 +125,11 @@ For each function name, method name, or import shown in `SKILL.md`, verify it
 still exists in the current API:
 
 - Function names mentioned in prose or in the categorized list should appear
-  in `python/datafusion/functions.py`'s `__all__`.
+  in `python/datafusion/functions/__init__.py`'s `__all__`.
+- Spark function names mentioned in the "Spark-Compatible Functions"
+  subsection should appear in `python/datafusion/functions/spark.py`'s
+  `__all__`. Also confirm the divergent-semantics table still matches the
+  current spark vs. main signatures.
 - Method calls in code blocks should resolve against the current class.
 - Imports (`from datafusion import ...`) should succeed against the current
   `__init__.py`.
diff --git a/.ai/skills/check-upstream/SKILL.md b/.ai/skills/check-upstream/SKILL.md
@@ -209,18 +209,58 @@ These upstream FFI types have been reviewed and do not need to be independently
    - FFI example in `examples/datafusion-ffi-example/`
    - Type appears in union type hints where accepted
 
-### 8. `__all__` Hygiene (functions.py)
+### 8. Spark-Compatible Functions (`datafusion-spark` crate)
+
+**Upstream source of truth:**
+- Crate source: https://github.com/apache/datafusion/tree/main/datafusion/spark/src
+- Rust docs: https://docs.rs/datafusion-spark/latest/datafusion_spark/
+
+**Where they are exposed in this project:**
+- Python API: `python/datafusion/functions/spark.py` — each function wraps
+  a call to `datafusion._internal.functions.spark`; the public surface is
+  the module's `__all__` list.
+- Rust bindings: `crates/core/src/spark_functions.rs` — `#[pyfunction]`
+  definitions registered via `init_module()` and re-exported under
+  `datafusion._internal.functions.spark`.
+
+**Coverage policy:** The spark namespace mirrors
+`pyspark.sql.functions` parameter names and shapes exactly so pyspark
+callers can paste code unchanged. Extras over pyspark are permitted as
+long as positional pyspark calls still work — for example, the spark
+`avg` / `try_sum` / `collect_list` / `collect_set` retain the
+`distinct`/`filter`/`order_by`/`null_treatment` kwargs from the main
+namespace while pyspark's single-positional form continues to work.
+
+**How to check:**
+1. Fetch the upstream `datafusion-spark` function list from the crate
+   source under `datafusion/spark/src/function/` (each subdirectory is a
+   category: `string/`, `math/`, `datetime/`, etc.). The crate's
+   `function.rs` collects all `ScalarUDF` factories.
+2. Cross-reference against `pyspark.sql.functions` for the public-facing
+   shape — pyspark is the contract this namespace is matching.
+3. Compare against the functions listed in
+   `python/datafusion/functions/spark.py`'s `__all__`. A function is
+   covered if it exists in the Python `spark` namespace, even if it
+   aliases another function's Rust binding.
+4. Report functions that are missing from the Python spark namespace.
+
+**Cross-cutting reference:** The longer-form roadmap for spark coverage
+lives in `PYSPARK_ALIGNMENT_PLAN.md` (root of repo). Use it as the source
+of truth for which gaps are intentionally deferred vs. ready to land.
+
+### 9. `__all__` Hygiene (functions.py and functions/spark.py)
 
 Independent of upstream parity, also flag public `def` symbols in
-`python/datafusion/functions.py` that are missing from the module's
-`__all__`. These are functions a user can call but that do not show up in
+`python/datafusion/functions.py` **and** `python/datafusion/functions/spark.py`
+that are missing from that file's `__all__`. These are functions a user
+can call but that do not show up in
 `from datafusion.functions import *`, in tab-completion against the
 namespace, or in generated API docs — typically an oversight rather than
 an intentional omission.
 
 **How to check:**
-1. Grep for `^def ([a-z_][a-z0-9_]*)\(` in `python/datafusion/functions.py`
-   to enumerate every public function definition.
+1. Grep for `^def ([a-z_][a-z0-9_]*)\(` in each file to enumerate every
+   public function definition.
 2. Read the `__all__` list at the top of the same file.
 3. Report any function in (1) that is not in (2). Skip private helpers
    (names starting with `_`).
diff --git a/.ai/skills/make-pythonic/SKILL.md b/.ai/skills/make-pythonic/SKILL.md
@@ -29,9 +29,30 @@ You are improving the datafusion-python API to feel more natural to Python users
 
 **Core principle:** A Python user should be able to write `split_part(col("a"), ",", 2)` instead of `split_part(col("a"), lit(","), lit(2))` when the arguments are contextually obvious literals.
 
+## Scope: `functions` vs `functions.spark`
+
+This skill targets the **default `datafusion.functions` namespace** (file:
+`python/datafusion/functions/__init__.py`). Do **not** apply pythonic
+coercion to `python/datafusion/functions/spark.py` — that namespace is a
+deliberate mirror of `pyspark.sql.functions`, so its parameter names,
+order, and types must match pyspark exactly. Adding `Expr | int` style
+unions there would diverge from the pyspark contract callers rely on.
+
+Two exceptions where pythonic-style additions in `functions.spark` are
+still on-brand:
+- **Pyspark itself accepts a native type.** Pyspark's `format_string`
+  takes `format: str | Column`; the spark wrapper already auto-promotes a
+  plain `str` to a literal — keep parity.
+- **Strictly additive optional kwargs** that pyspark also has (e.g.
+  `like(escapeChar=...)`). These belong in the [PYSPARK_ALIGNMENT_PLAN.md]
+  follow-up PRs, not in a make-pythonic pass.
+
+If the user explicitly scopes to "spark", validate parity with pyspark
+rather than applying generic coercion.
+
 ## How to Identify Candidates
 
-The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions.py`.
+The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions/__init__.py`.
 
 For each function, determine if any parameter can accept native Python types by evaluating **two complementary signals**:
 
@@ -309,7 +330,7 @@ For each function being updated:
 
 ### Step 1: Analyze the Function
 
-1. Read the current Python function signature in `python/datafusion/functions.py`
+1. Read the current Python function signature in `python/datafusion/functions/__init__.py`
 2. Read the Rust binding in `crates/core/src/functions.rs`
 3. Optionally check the upstream DataFusion docs for the function
 4. Determine which category (A, B, or C) applies to each parameter
@@ -346,7 +367,7 @@ dfn.functions.left(dfn.col("a"), 3)
 
 After making changes, run the doctests to verify:
 ```bash
-python -m pytest --doctest-modules python/datafusion/functions.py -v
+python -m pytest --doctest-modules python/datafusion/functions/__init__.py -v
 ```
 
 ## Coercion Helper Pattern
diff --git a/skills/datafusion_python/SKILL.md b/skills/datafusion_python/SKILL.md
@@ -26,12 +26,14 @@ can interoperate with DataFusion.
 | `DataFrame` | Lazy query builder. Each method returns a new DataFrame. | Returned by context methods |
 | `Expr` | Expression tree node (column ref, literal, function call, ...). | `from datafusion import col, lit` |
 | `functions` | 290+ built-in scalar, aggregate, and window functions. | `from datafusion import functions as F` |
+| `functions.spark` | PySpark-compatible function surface (parameter names match `pyspark.sql.functions`). | `from datafusion.functions import spark` |
 
 ## Import Conventions
 
 ```python
 from datafusion import SessionContext, col, lit
 from datafusion import functions as F
+from datafusion.functions import spark   # only when porting pyspark code
 ```
 
 ## Data Loading
@@ -762,3 +764,58 @@ F.left(col("c_phone"), lit(2))                # prefix shortcut
 
 **Other**: `in_list`, `order_by`, `alias`, `col`, `encode`, `decode`,
 `to_hex`, `to_char`, `uuid`, `version`, `bit_length`, `octet_length`
+
+### Spark-Compatible Functions
+
+A separate `datafusion.functions.spark` namespace mirrors the
+`pyspark.sql.functions` API for callers porting code from PySpark.
+
+```python
+from datafusion.functions import spark
+```
+
+Use it for DataFrame work; for SQL, register the Spark UDFs first:
+
+```python
+ctx = SessionContext()
+ctx.enable_spark_functions()                  # makes Spark UDFs visible to SQL
+ctx.sql("SELECT sha2('hello', 256)").show()
+```
+
+Coverage spans aggregate (`avg`, `try_sum`, `collect_list`, `collect_set`),
+array (`array`, `array_contains`, `array_repeat`, `shuffle`, `slice`,
+`size`), bitmap, bitwise (`shiftleft`, `shiftright`, `shiftrightunsigned`,
+`bit_get`, `bit_count`, `bitwise_not`), datetime (`add_months`,
+`date_add`, `date_sub`, `date_diff`, `date_trunc`, `time_trunc`, `trunc`,
+`next_day`, `from_utc_timestamp`, `to_utc_timestamp`, `unix_date`,
+`unix_micros`/`millis`/`seconds`, `make_interval`, `make_dt_interval`),
+hash (`crc32`, `sha1`, `sha2`, `xxhash64`), JSON (`json_tuple`),
+map (`map_from_arrays`, `map_from_entries`, `str_to_map`), math
+(`abs`, `ceil`, `floor`, `round`, `expm1`, `factorial`, `hex`,
+`modulus`/`pmod`, `rint`, `unhex`, `width_bucket`, `csc`/`sec`,
+`negative`, `bin`), string (`ascii`, `base64`/`unbase64`, `char`,
+`concat`, `elt`, `like`/`ilike`, `length`, `luhn_check`, `format_string`,
+`space`, `substring`, `soundex`, `is_valid_utf8`/`make_valid_utf8`),
+URL (`parse_url`/`try_parse_url`, `url_decode`/`url_encode`,
+`try_url_decode`), and conditional (`if_`, `spark_cast`).
+
+The full list is in the API reference; see
+`python/datafusion/functions/spark.py`.
+
+**Semantic divergences vs the default namespace.** Functions that exist in
+both `functions` and `functions.spark` may behave differently:
+
+| Function | Default `functions` | `functions.spark` |
+|---|---|---|
+| `concat` | NULL inputs treated as empty | NULL inputs propagate to NULL |
+| `round` | HALF_EVEN (banker's) | HALF_UP |
+| `trunc` | Numeric truncation | Date truncation |
+| `substring` | 1-indexed | 1-indexed (parity) |
+
+Pick the namespace whose semantics match your intent — both stay imported
+side by side; `enable_spark_functions()` only affects SQL.
+
+**Parameter names match pyspark exactly.** The spark namespace uses
+pyspark parameter names (`col`, `str`, `numBits`, `partToExtract`, ...) so
+you can paste pyspark code and keep keyword arguments working. The default
+namespace keeps DataFusion's parameter names.