You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(skills): cover the new spark function namespace
`functions.spark` mirrors `pyspark.sql.functions` and now ships on this
branch. Update every skill that references the function surface:
- skills/datafusion_python/SKILL.md (user-facing): add an import
reference, a Core Abstractions row, and a "Spark-Compatible Functions"
subsection listing coverage by category, the SQL-vs-DataFrame usage
(`enable_spark_functions`), and the divergent-semantics table
(concat NULL, round HALF_UP, trunc) so callers know which namespace
to pick.
- .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark`
crate with the coverage policy (parity with pyspark, extras allowed
when positional pyspark calls still work). Hygiene check also now
spans `functions/spark.py`'s `__all__`.
- .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the
surface table and a `spark-functions` scope so this audit also
validates the new subsection and divergent-semantics table.
- .ai/skills/make-pythonic/SKILL.md: explicit scope note that the
spark namespace is a deliberate pyspark mirror — generic native-type
coercion does not apply there. Path references updated to the new
`functions/__init__.py` module layout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: .ai/skills/make-pythonic/SKILL.md
+24-3Lines changed: 24 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,9 +29,30 @@ You are improving the datafusion-python API to feel more natural to Python users
29
29
30
30
**Core principle:** A Python user should be able to write `split_part(col("a"), ",", 2)` instead of `split_part(col("a"), lit(","), lit(2))` when the arguments are contextually obvious literals.
31
31
32
+
## Scope: `functions` vs `functions.spark`
33
+
34
+
This skill targets the **default `datafusion.functions` namespace** (file:
35
+
`python/datafusion/functions/__init__.py`). Do **not** apply pythonic
36
+
coercion to `python/datafusion/functions/spark.py` — that namespace is a
37
+
deliberate mirror of `pyspark.sql.functions`, so its parameter names,
38
+
order, and types must match pyspark exactly. Adding `Expr | int` style
39
+
unions there would diverge from the pyspark contract callers rely on.
40
+
41
+
Two exceptions where pythonic-style additions in `functions.spark` are
42
+
still on-brand:
43
+
-**Pyspark itself accepts a native type.** Pyspark's `format_string`
44
+
takes `format: str | Column`; the spark wrapper already auto-promotes a
45
+
plain `str` to a literal — keep parity.
46
+
-**Strictly additive optional kwargs** that pyspark also has (e.g.
47
+
`like(escapeChar=...)`). These belong in the [PYSPARK_ALIGNMENT_PLAN.md]
48
+
follow-up PRs, not in a make-pythonic pass.
49
+
50
+
If the user explicitly scopes to "spark", validate parity with pyspark
51
+
rather than applying generic coercion.
52
+
32
53
## How to Identify Candidates
33
54
34
-
The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions.py`.
55
+
The user may specify a scope via `$ARGUMENTS`. If no scope is given or "all" is specified, audit all functions in `python/datafusion/functions/__init__.py`.
35
56
36
57
For each function, determine if any parameter can accept native Python types by evaluating **two complementary signals**:
37
58
@@ -309,7 +330,7 @@ For each function being updated:
309
330
310
331
### Step 1: Analyze the Function
311
332
312
-
1. Read the current Python function signature in `python/datafusion/functions.py`
333
+
1. Read the current Python function signature in `python/datafusion/functions/__init__.py`
313
334
2. Read the Rust binding in `crates/core/src/functions.rs`
314
335
3. Optionally check the upstream DataFusion docs for the function
315
336
4. Determine which category (A, B, or C) applies to each parameter
0 commit comments