Skip to content

Commit e2eceb4

Browse files
timsaucerclaude
andcommitted
docs(skills): cover the new spark function namespace
`functions.spark` mirrors `pyspark.sql.functions` and now ships on this branch. Update every skill that references the function surface: - skills/datafusion_python/SKILL.md (user-facing): add an import reference, a Core Abstractions row, and a "Spark-Compatible Functions" subsection listing coverage by category, the SQL-vs-DataFrame usage (`enable_spark_functions`), and the divergent-semantics table (concat NULL, round HALF_UP, trunc) so callers know which namespace to pick. - .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark` crate with the coverage policy (parity with pyspark, extras allowed when positional pyspark calls still work). Hygiene check also now spans `functions/spark.py`'s `__all__`. - .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the surface table and a `spark-functions` scope so this audit also validates the new subsection and divergent-semantics table. - .ai/skills/make-pythonic/SKILL.md: explicit scope note that the spark namespace is a deliberate pyspark mirror — generic native-type coercion does not apply there. Path references updated to the new `functions/__init__.py` module layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f4b5119 commit e2eceb4

4 files changed

Lines changed: 135 additions & 11 deletions

File tree

.ai/skills/audit-skill-md/SKILL.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,8 @@ exposed at the package root), include it.
4848
| `SessionContext` | `python/datafusion/context.py` | "Data Loading" |
4949
| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" |
5050
| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" |
51-
| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout |
51+
| `functions` | `python/datafusion/functions/__init__.py` | "Available Functions (Categorized)", scattered uses throughout |
52+
| `functions.spark` | `python/datafusion/functions/spark.py` | "Available Functions (Categorized)" → "Spark-Compatible Functions" subsection |
5253
| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" |
5354

5455
## Scope argument
@@ -61,7 +62,8 @@ is given or `all` is specified, audit every area.
6162
| `session-context` | `SessionContext` methods and the "Data Loading" section |
6263
| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections |
6364
| `expr` | `Expr` methods/operators and the "Expression Building" section |
64-
| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section |
65+
| `functions` | `functions/__init__.py` `__all__` and the "Available Functions (Categorized)" section |
66+
| `spark-functions` | `functions/spark.py` `__all__`, the "Spark-Compatible Functions" subsection, and the divergent-semantics table |
6567
| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style |
6668
| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream |
6769
| `version-notes` | Cross-check version annotations (see below) |
@@ -123,7 +125,11 @@ For each function name, method name, or import shown in `SKILL.md`, verify it
123125
still exists in the current API:
124126

125127
- Function names mentioned in prose or in the categorized list should appear
126-
in `python/datafusion/functions.py`'s `__all__`.
128+
in `python/datafusion/functions/__init__.py`'s `__all__`.
129+
- Spark function names mentioned in the "Spark-Compatible Functions"
130+
subsection should appear in `python/datafusion/functions/spark.py`'s
131+
`__all__`. Also confirm the divergent-semantics table still matches the
132+
current spark vs. main signatures.
127133
- Method calls in code blocks should resolve against the current class.
128134
- Imports (`from datafusion import ...`) should succeed against the current
129135
`__init__.py`.

.ai/skills/check-upstream/SKILL.md

Lines changed: 45 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -209,18 +209,58 @@ These upstream FFI types have been reviewed and do not need to be independently
209209
- FFI example in `examples/datafusion-ffi-example/`
210210
- Type appears in union type hints where accepted
211211

212-
### 8. `__all__` Hygiene (functions.py)
212+
### 8. Spark-Compatible Functions (`datafusion-spark` crate)
213+
214+
**Upstream source of truth:**
215+
- Crate source: https://github.com/apache/datafusion/tree/main/datafusion/spark/src
216+
- Rust docs: https://docs.rs/datafusion-spark/latest/datafusion_spark/
217+
218+
**Where they are exposed in this project:**
219+
- Python API: `python/datafusion/functions/spark.py` — each function wraps
220+
a call to `datafusion._internal.functions.spark`; the public surface is
221+
the module's `__all__` list.
222+
- Rust bindings: `crates/core/src/spark_functions.rs``#[pyfunction]`
223+
definitions registered via `init_module()` and re-exported under
224+
`datafusion._internal.functions.spark`.
225+
226+
**Coverage policy:** The spark namespace mirrors
227+
`pyspark.sql.functions` parameter names and shapes exactly so pyspark
228+
callers can paste code unchanged. Extras over pyspark are permitted as
229+
long as positional pyspark calls still work — for example, the spark
230+
`avg` / `try_sum` / `collect_list` / `collect_set` retain the
231+
`distinct`/`filter`/`order_by`/`null_treatment` kwargs from the main
232+
namespace while pyspark's single-positional form continues to work.
233+
234+
**How to check:**
235+
1. Fetch the upstream `datafusion-spark` function list from the crate
236+
source under `datafusion/spark/src/function/` (each subdirectory is a
237+
category: `string/`, `math/`, `datetime/`, etc.). The crate's
238+
`function.rs` collects all `ScalarUDF` factories.
239+
2. Cross-reference against `pyspark.sql.functions` for the public-facing
240+
shape — pyspark is the contract this namespace is matching.
241+
3. Compare against the functions listed in
242+
`python/datafusion/functions/spark.py`'s `__all__`. A function is
243+
covered if it exists in the Python `spark` namespace, even if it
244+
aliases another function's Rust binding.
245+
4. Report functions that are missing from the Python spark namespace.
246+
247+
**Cross-cutting reference:** The longer-form roadmap for spark coverage
248+
lives in `PYSPARK_ALIGNMENT_PLAN.md` (root of repo). Use it as the source
249+
of truth for which gaps are intentionally deferred vs. ready to land.
250+
251+
### 9. `__all__` Hygiene (functions.py and functions/spark.py)
213252

214253
Independent of upstream parity, also flag public `def` symbols in
215-
`python/datafusion/functions.py` that are missing from the module's
216-
`__all__`. These are functions a user can call but that do not show up in
254+
`python/datafusion/functions.py` **and** `python/datafusion/functions/spark.py`
255+
that are missing from that file's `__all__`. These are functions a user
256+
can call but that do not show up in
217257
`from datafusion.functions import *`, in tab-completion against the
218258
namespace, or in generated API docs — typically an oversight rather than
219259
an intentional omission.
220260

221261
**How to check:**
222-
1. Grep for `^def ([a-z_][a-z0-9_]*)\(` in `python/datafusion/functions.py`
223-
to enumerate every public function definition.
262+
1. Grep for `^def ([a-z_][a-z0-9_]*)\(` in each file to enumerate every
263+
public function definition.
224264
2. Read the `__all__` list at the top of the same file.
225265
3. Report any function in (1) that is not in (2). Skip private helpers
226266
(names starting with `_`).

.ai/skills/make-pythonic/SKILL.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,30 @@ You are improving the datafusion-python API to feel more natural to Python users
2929

3030
**Core principle:** A Python user should be able to write `split_part(col("a"), ",", 2)` instead of `split_part(col("a"), lit(","), lit(2))` when the arguments are contextually obvious literals.
3131

32+
## Scope: `functions` vs `functions.spark`
33+
34+
This skill targets the **default `datafusion.functions` namespace** (file:
35+
`python/datafusion/functions/__init__.py`). Do **not** apply pythonic
36+
coercion to `python/datafusion/functions/spark.py` — that namespace is a
37+
deliberate mirror of `pyspark.sql.functions`, so its parameter names,
38+
order, and types must match pyspark exactly. Adding `Expr | int` style
39+
unions there would diverge from the pyspark contract callers rely on.
40+
41+
Two exceptions where pythonic-style additions in `functions.spark` are
42+
still on-brand:
43+
- **Pyspark itself accepts a native type.** Pyspark's `format_string`
44+
takes `format: str | Column`; the spark wrapper already auto-promotes a
45+
plain `str` to a literal — keep parity.
46+
- **Strictly additive optional kwargs** that pyspark also has (e.g.
47+
`like(escapeChar=...)`). These belong in the [PYSPARK_ALIGNMENT_PLAN.md]
48+
follow-up PRs, not in a make-pythonic pass.
49+
50+
If the user explicitly scopes to "spark", validate parity with pyspark
51+
rather than applying generic coercion.
52+
3253
## How to Identify Candidates
3354

34-
The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions.py`.
55+
The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions/__init__.py`.
3556

3657
For each function, determine if any parameter can accept native Python types by evaluating **two complementary signals**:
3758

@@ -309,7 +330,7 @@ For each function being updated:
309330

310331
### Step 1: Analyze the Function
311332

312-
1. Read the current Python function signature in `python/datafusion/functions.py`
333+
1. Read the current Python function signature in `python/datafusion/functions/__init__.py`
313334
2. Read the Rust binding in `crates/core/src/functions.rs`
314335
3. Optionally check the upstream DataFusion docs for the function
315336
4. Determine which category (A, B, or C) applies to each parameter
@@ -346,7 +367,7 @@ dfn.functions.left(dfn.col("a"), 3)
346367

347368
After making changes, run the doctests to verify:
348369
```bash
349-
python -m pytest --doctest-modules python/datafusion/functions.py -v
370+
python -m pytest --doctest-modules python/datafusion/functions/__init__.py -v
350371
```
351372

352373
## Coercion Helper Pattern

skills/datafusion_python/SKILL.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,14 @@ can interoperate with DataFusion.
2626
| `DataFrame` | Lazy query builder. Each method returns a new DataFrame. | Returned by context methods |
2727
| `Expr` | Expression tree node (column ref, literal, function call, ...). | `from datafusion import col, lit` |
2828
| `functions` | 290+ built-in scalar, aggregate, and window functions. | `from datafusion import functions as F` |
29+
| `functions.spark` | PySpark-compatible function surface (parameter names match `pyspark.sql.functions`). | `from datafusion.functions import spark` |
2930

3031
## Import Conventions
3132

3233
```python
3334
from datafusion import SessionContext, col, lit
3435
from datafusion import functions as F
36+
from datafusion.functions import spark # only when porting pyspark code
3537
```
3638

3739
## Data Loading
@@ -762,3 +764,58 @@ F.left(col("c_phone"), lit(2)) # prefix shortcut
762764

763765
**Other**: `in_list`, `order_by`, `alias`, `col`, `encode`, `decode`,
764766
`to_hex`, `to_char`, `uuid`, `version`, `bit_length`, `octet_length`
767+
768+
### Spark-Compatible Functions
769+
770+
A separate `datafusion.functions.spark` namespace mirrors the
771+
`pyspark.sql.functions` API for callers porting code from PySpark.
772+
773+
```python
774+
from datafusion.functions import spark
775+
```
776+
777+
Use it for DataFrame work; for SQL, register the Spark UDFs first:
778+
779+
```python
780+
ctx = SessionContext()
781+
ctx.enable_spark_functions() # makes Spark UDFs visible to SQL
782+
ctx.sql("SELECT sha2('hello', 256)").show()
783+
```
784+
785+
Coverage spans aggregate (`avg`, `try_sum`, `collect_list`, `collect_set`),
786+
array (`array`, `array_contains`, `array_repeat`, `shuffle`, `slice`,
787+
`size`), bitmap, bitwise (`shiftleft`, `shiftright`, `shiftrightunsigned`,
788+
`bit_get`, `bit_count`, `bitwise_not`), datetime (`add_months`,
789+
`date_add`, `date_sub`, `date_diff`, `date_trunc`, `time_trunc`, `trunc`,
790+
`next_day`, `from_utc_timestamp`, `to_utc_timestamp`, `unix_date`,
791+
`unix_micros`/`millis`/`seconds`, `make_interval`, `make_dt_interval`),
792+
hash (`crc32`, `sha1`, `sha2`, `xxhash64`), JSON (`json_tuple`),
793+
map (`map_from_arrays`, `map_from_entries`, `str_to_map`), math
794+
(`abs`, `ceil`, `floor`, `round`, `expm1`, `factorial`, `hex`,
795+
`modulus`/`pmod`, `rint`, `unhex`, `width_bucket`, `csc`/`sec`,
796+
`negative`, `bin`), string (`ascii`, `base64`/`unbase64`, `char`,
797+
`concat`, `elt`, `like`/`ilike`, `length`, `luhn_check`, `format_string`,
798+
`space`, `substring`, `soundex`, `is_valid_utf8`/`make_valid_utf8`),
799+
URL (`parse_url`/`try_parse_url`, `url_decode`/`url_encode`,
800+
`try_url_decode`), and conditional (`if_`, `spark_cast`).
801+
802+
The full list is in the API reference; see
803+
`python/datafusion/functions/spark.py`.
804+
805+
**Semantic divergences vs the default namespace.** Functions that exist in
806+
both `functions` and `functions.spark` may behave differently:
807+
808+
| Function | Default `functions` | `functions.spark` |
809+
|---|---|---|
810+
| `concat` | NULL inputs treated as empty | NULL inputs propagate to NULL |
811+
| `round` | HALF_EVEN (banker's) | HALF_UP |
812+
| `trunc` | Numeric truncation | Date truncation |
813+
| `substring` | 1-indexed | 1-indexed (parity) |
814+
815+
Pick the namespace whose semantics match your intent — both stay imported
816+
side by side; `enable_spark_functions()` only affects SQL.
817+
818+
**Parameter names match pyspark exactly.** The spark namespace uses
819+
pyspark parameter names (`col`, `str`, `numBits`, `partToExtract`, ...) so
820+
you can paste pyspark code and keep keyword arguments working. The default
821+
namespace keeps DataFusion's parameter names.

0 commit comments

Comments
 (0)