Skip to content

Commit af38866

Browse files
timsaucerclaude
andauthored
feat: expose lambda and higher-order array functions (#1561)
* feat: expose lambda and higher-order array functions Add a Pythonic API for DataFusion's higher-order array functions and the lambda expressions they consume. - Rust: lambda_, lambda_var, array_transform, and array_any_match pyfunctions, plus a ResolveLambdaVariables analyzer rule so expression-builder plans (which emit unresolved lambda variables) resolve before optimization. - Python: array_transform / array_any_match (with list_transform, any_match, list_any_match aliases) accept either a Python callable or an explicit lambda built with lambda_ / lambda_var. Callables are introspected so their parameter names become the lambda parameters. - Tests and docs (expressions guide + agent skill), noting v1 limits: lambda expressions are not serializable, and SQL arrow syntax needs the DuckDB dialect. * test: fold lambda tests into pytest parameterization Combine the eight higher-order function result tests into a single parametrized test_higher_order_function_results, and the two to_lambda rejection tests into test_to_lambda_rejects_invalid_arg. Each case keeps a readable id via pytest.param. Co-Authored-By: Claude <noreply@anthropic.com> * feat: expose array_filter higher-order function Add array_filter, the remaining lambda-based higher-order array function in DataFusion (alongside the already-exposed array_transform and array_any_match). Includes the list_filter alias matching upstream, tests, and documentation in the expressions guide and skill. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: emphasize lambda terminology, trim skill lambda section Lead user-facing array-lambda docs with "lambda function" instead of "higher-order function," which is less recognizable to users. Drop the alias list, serialization caveat, and DuckDB-dialect note from the skill to keep it lean; those details already live in the docstrings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: broaden SQL lambda dialect coverage Other dialects (ClickHouse, Snowflake, Databricks) also enable lambda parsing via sqlparser-rs. Document the full set and recommend the ``lambda x: x`` keyword form, since DuckDB will drop the ``x -> x`` arrow form in v2.1. Parametrize the SQL test over the four dialects using the keyword syntax. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 3d4c56c commit af38866

8 files changed

Lines changed: 534 additions & 1 deletion

File tree

crates/core/src/analyzer.rs

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
//! Analyzer rules layered on top of DataFusion's defaults.
19+
20+
use datafusion::common::Result;
21+
use datafusion::common::config::ConfigOptions;
22+
use datafusion::logical_expr::LogicalPlan;
23+
use datafusion::optimizer::AnalyzerRule;
24+
25+
/// Resolve [`LambdaVariable`] references into bound lambda parameters.
26+
///
27+
/// DataFusion's SQL planner resolves lambda variables inline as it plans a
28+
/// higher-order function call, so SQL-built plans never carry unresolved
29+
/// variables. Plans assembled programmatically through the Python expression
30+
/// builder (e.g. `array_transform(col("xs"), lambda_(["v"], lambda_var("v")))`)
31+
/// do carry them, and nothing in the default analyzer resolves them. This rule
32+
/// runs [`LogicalPlan::resolve_lambda_variables`] so both construction paths
33+
/// reach the optimizer with bound lambdas.
34+
///
35+
/// [`LambdaVariable`]: datafusion::logical_expr::expr::LambdaVariable
36+
#[derive(Debug)]
37+
pub struct ResolveLambdaVariables {}
38+
39+
impl ResolveLambdaVariables {
40+
pub fn new() -> Self {
41+
Self {}
42+
}
43+
}
44+
45+
impl Default for ResolveLambdaVariables {
46+
fn default() -> Self {
47+
Self::new()
48+
}
49+
}
50+
51+
impl AnalyzerRule for ResolveLambdaVariables {
52+
fn analyze(&self, plan: LogicalPlan, _config: &ConfigOptions) -> Result<LogicalPlan> {
53+
plan.resolve_lambda_variables().map(|t| t.data)
54+
}
55+
56+
fn name(&self) -> &str {
57+
"resolve_lambda_variables"
58+
}
59+
}

crates/core/src/context.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,7 @@ impl PySessionContext {
396396
.with_config(config)
397397
.with_runtime_env(runtime)
398398
.with_default_features()
399+
.with_analyzer_rule(Arc::new(crate::analyzer::ResolveLambdaVariables::new()))
399400
.build();
400401
let ctx = Arc::new(SessionContext::new_with_state(session_state));
401402
Ok(PySessionContext {

crates/core/src/functions.rs

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,44 @@ fn array_slice(array: PyExpr, begin: PyExpr, end: PyExpr, stride: Option<PyExpr>
159159
.into()
160160
}
161161

162+
/// Create a lambda expression from a list of parameter names and a body
163+
/// expression. The body should reference the parameters via [`lambda_var`].
164+
/// Exposed to Python as `lambda_` because `lambda` is a reserved keyword.
165+
#[pyfunction]
166+
#[pyo3(name = "lambda_")]
167+
fn py_lambda(params: Vec<String>, body: PyExpr) -> PyExpr {
168+
datafusion::logical_expr::lambda(params, body.into()).into()
169+
}
170+
171+
/// Create an unresolved lambda variable reference by name. The owning
172+
/// higher-order function resolves it against its lambda parameters during
173+
/// planning.
174+
#[pyfunction]
175+
fn lambda_var(name: String) -> PyExpr {
176+
datafusion::logical_expr::lambda_var(name).into()
177+
}
178+
179+
/// Higher-order function: apply `transform` (a lambda) to each element of
180+
/// `array`, returning a new array of the results.
181+
#[pyfunction]
182+
fn array_transform(array: PyExpr, transform: PyExpr) -> PyExpr {
183+
datafusion::functions_nested::expr_fn::array_transform(array.into(), transform.into()).into()
184+
}
185+
186+
/// Higher-order function: return true if any element of `array` satisfies
187+
/// `predicate` (a lambda returning a boolean).
188+
#[pyfunction]
189+
fn array_any_match(array: PyExpr, predicate: PyExpr) -> PyExpr {
190+
datafusion::functions_nested::expr_fn::array_any_match(array.into(), predicate.into()).into()
191+
}
192+
193+
/// Higher-order function: keep the elements of `array` for which `predicate`
194+
/// (a lambda returning a boolean) is true, returning a new filtered array.
195+
#[pyfunction]
196+
fn array_filter(array: PyExpr, predicate: PyExpr) -> PyExpr {
197+
datafusion::functions_nested::expr_fn::array_filter(array.into(), predicate.into()).into()
198+
}
199+
162200
/// Computes a binary hash of the given data. type is the algorithm to use.
163201
/// Standard algorithms are md5, sha224, sha256, sha384, sha512, blake2s, blake2b, and blake3.
164202
// #[pyfunction(value, method)]
@@ -1082,6 +1120,13 @@ pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
10821120
m.add_wrapped(wrap_pyfunction!(encode))?;
10831121
m.add_wrapped(wrap_pyfunction!(decode))?;
10841122

1123+
// Lambda / higher-order functions
1124+
m.add_wrapped(wrap_pyfunction!(py_lambda))?;
1125+
m.add_wrapped(wrap_pyfunction!(lambda_var))?;
1126+
m.add_wrapped(wrap_pyfunction!(array_transform))?;
1127+
m.add_wrapped(wrap_pyfunction!(array_any_match))?;
1128+
m.add_wrapped(wrap_pyfunction!(array_filter))?;
1129+
10851130
// Array Functions
10861131
m.add_wrapped(wrap_pyfunction!(array_append))?;
10871132
m.add_wrapped(wrap_pyfunction!(array_concat))?;

crates/core/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ use mimalloc::MiMalloc;
2727
use pyo3::prelude::*;
2828

2929
#[allow(clippy::borrow_deref_ref)]
30+
pub mod analyzer;
3031
pub mod catalog;
3132
pub mod codec;
3233
pub mod common;

docs/source/user-guide/common-operations/expressions.rst

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,55 @@ This function returns a new array with the elements repeated.
145145
146146
In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`.
147147

148+
Lambda functions
149+
----------------
150+
151+
Some array functions take a *lambda function*: a small function that runs once
152+
per element. :py:func:`~datafusion.functions.array_transform` maps a lambda over
153+
every element, :py:func:`~datafusion.functions.array_filter` keeps the elements
154+
for which a predicate lambda is true, and
155+
:py:func:`~datafusion.functions.array_any_match` returns whether any element
156+
satisfies a predicate lambda. (Functions that take another function as an
157+
argument are sometimes called *higher-order* functions.)
158+
159+
The simplest way to supply a lambda is a Python ``lambda``. Its parameter names
160+
become the lambda parameters, and its return value becomes the body.
161+
162+
.. ipython:: python
163+
164+
from datafusion import SessionContext, col
165+
from datafusion import functions as f
166+
167+
ctx = SessionContext()
168+
df = ctx.from_pydict({"a": [[1, 2, 3], [4, 5]]})
169+
df.select(f.array_transform(col("a"), lambda v: v * 2).alias("doubled"))
170+
df.select(f.array_filter(col("a"), lambda v: v > 2).alias("big_only"))
171+
df.select(f.array_any_match(col("a"), lambda v: v > 3).alias("has_big"))
172+
173+
If you need explicit control over parameter names, build the lambda with
174+
:py:func:`~datafusion.functions.lambda_` and reference its parameters with
175+
:py:func:`~datafusion.functions.lambda_var`. The following is equivalent to the
176+
``array_transform`` call above.
177+
178+
.. ipython:: python
179+
180+
from datafusion import lit
181+
182+
double_fn = f.lambda_(["v"], f.lambda_var("v") * lit(2))
183+
df.select(f.array_transform(col("a"), double_fn).alias("doubled"))
184+
185+
.. note::
186+
187+
Lambda expressions cannot yet be serialized: calling
188+
:py:meth:`~datafusion.expr.Expr.to_bytes` or pickling an expression that
189+
contains a lambda raises ``Lambda not implemented``. SQL lambda syntax is
190+
only parsed by dialects that support lambdas; set
191+
``datafusion.sql_parser.dialect`` to one of ``DuckDB``, ``ClickHouse``,
192+
``Snowflake``, or ``Databricks``. Both arrow syntax (``x -> x * 2``) and
193+
keyword syntax (``lambda x: x * 2``) parse. DuckDB will drop the arrow
194+
form in v2.1, so prefer ``lambda x: x * 2`` for forward compatibility.
195+
The Python expression builder shown above works regardless of dialect.
196+
148197

149198
Testing membership in a list
150199
----------------------------

0 commit comments

Comments
 (0)