Skip to content

Conversation

@aniketpalu
Copy link
Contributor

@aniketpalu aniketpalu commented Nov 13, 2025

What this PR does / why we need it:

  • Add support for entity_df=None in SparkOfflineStore.get_historical_features with start_date/end_date.
    • Builds a derived entity set by UNION DISTINCT of entity columns from each FeatureView within the window (uses date_partition_column when present), casts entities to STRING, and uses a literal as-of timestamp for PIT joins.
    • Why: Enables simple time-window backfills without requiring an entity_df.
  • Type fixes in spark.py: pass KeysView[str] for entity_df_columns.
    • Why: Satisfy mypy and keep PIT join builder signature consistent.
  • spark_source.py: ensure a SparkSession exists in remote threads.
    • Why: Prevent remote mode failures due to missing active session.
  • arrow_error_handler.py: re-raise non-Feast exceptions on the server.
    • Why: Ensure Arrow Flight returns actionable errors instead of None.
  • Test: add unit test for non-entity retrieval (Spark session mocked).

Which issue(s) this PR fixes:

RHOAIENG-38644

Misc

@aniketpalu aniketpalu requested a review from a team as a code owner November 13, 2025 08:57
Copy link
Contributor

@jyejare jyejare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewewd very high level, looks Good, but could you share the run logs ?

@aniketpalu aniketpalu requested a review from jyejare November 25, 2025 07:22
@aniketpalu
Copy link
Contributor Author

aniketpalu commented Nov 25, 2025

Reviewewd very high level, looks Good, but could you share the run logs ?

Feature Repo used: https://github.com/aniketpalu/feast_examples/tree/main/feature_repo_spark
Testing script: https://github.com/aniketpalu/feast_examples/blob/2d5cef285996f59818ee4d19c595c0fb2efed4da/test_historical_retrieval.py

Attempted to retrieve historical features using entity_df:

📁 Data in Parquet file: 100 records
📅 Timestamp range: 2025-10-13 00:31:16.754605 to 2025-11-12 00:31:16.754605
👥 User IDs range: 1 to 100

📊 Entity data for feature retrieval:
user_id event_timestamp
0 1 2025-11-11 01:31:16.754605
1 2 2025-10-20 01:31:16.754605
2 3 2025-11-06 01:31:16.754605

🔍 Retrieving features for 3 entities...
/Users/apaluska/Work/feast/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark_source.py:96: RuntimeWarning: The spark data source API is an experimental feature in alpha development. This API is unstable and it could and most probably will be changed in the future.
warnings.warn(
/Users/apaluska/Work/feast/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py:167: RuntimeWarning: The spark offline store is an experimental feature in alpha development. Some functionality may still be unstable so functionality can change in the future.
warnings.warn(
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/25 12:57:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

✅ Historical features retrieved:
📊 Result shape: (3, 4)

📋 Results:
user_id event_timestamp avg_transaction_amount total_transactions
0 1 2025-11-11 07:01:16.754605 773.62 60
1 2 2025-10-20 07:01:16.754605 84.12 369
2 3 2025-11-06 07:01:16.754605 472.25 382

(Failed) Attempt to retrieve historical features using start_date & end_date before implementing:

📁 Data in Parquet file: 100 records
📅 Data timestamp range: 2025-10-13 00:31:16.754605 to 2025-11-12 00:31:16.754605
📅 Query date range: 2025-10-23 00:31:16.754605 to 2025-11-12 00:31:16.754605

🔍 Retrieving features for date range (no entity_df)...

Users/apaluska/Work/feast/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark_source.py:96: RuntimeWarning: The spark data source API is an experimental feature in alpha development. This API is unstable and it could and most probably will be changed in the future.
warnings.warn(
❌ Error in date range-based retrieval: SparkOfflineStore.get_historical_features() got an unexpected keyword argument 'start_date'
Traceback (most recent call last):
File "/Users/apaluska/Work/EAP Demo/feast_example_minimal/test_historical_retrieval.py", line 159, in
historical_features = fs.get_historical_features(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/apaluska/Work/feast/sdk/python/feast/feature_store.py", line 1218, in get_historical_features
job = provider.get_historical_features(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/apaluska/Work/feast/sdk/python/feast/infra/passthrough_provider.py", line 469, in get_historical_features
job = self.offline_store.get_historical_features(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SparkOfflineStore.get_historical_features() got an unexpected keyword argument 'start_date'

(Successful) Attempt to retrieve historical features using start_date & end_date:

📁 Data in Parquet file: 100 records
📅 Data timestamp range: 2025-10-13 00:31:16.754605 to 2025-11-12 00:31:16.754605
📅 Query date range: 2025-10-23 00:31:16.754605 to 2025-11-12 00:31:16.754605

🔍 Retrieving features for date range (no entity_df)...
/Users/apaluska/Work/feast/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark_source.py:96: RuntimeWarning: The spark data source API is an experimental feature in alpha development. This API is unstable and it could and most probably will be changed in the future.
warnings.warn(
/Users/apaluska/Work/feast/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py:169: RuntimeWarning: The spark offline store is an experimental feature in alpha development. Some functionality may still be unstable so functionality can change in the future.
warnings.warn(

✅ Historical features retrieved:
📊 Result shape: (66, 4)

📋 First 30 rows (showing both event_timestamp and created_timestamp):
user_id entity_ts avg_transaction_amount total_transactions
0 25 2025-11-12 00:31:16.754605 NaN NaN
1 7 2025-11-12 00:31:16.754605 NaN NaN
2 77 2025-11-12 00:31:16.754605 494.04 127.0
3 3 2025-11-12 00:31:16.754605 472.25 382.0
4 57 2025-11-12 00:31:16.754605 NaN NaN
5 45 2025-11-12 00:31:16.754605 163.35 405.0
6 100 2025-11-12 00:31:16.754605 NaN NaN
7 35 2025-11-12 00:31:16.754605 51.20 113.0
8 96 2025-11-12 00:31:16.754605 NaN NaN
9 19 2025-11-12 00:31:16.754605 917.33 136.0
10 29 2025-11-12 00:31:16.754605 NaN NaN
11 22 2025-11-12 00:31:16.754605 NaN NaN
12 97 2025-11-12 00:31:16.754605 NaN NaN
13 81 2025-11-12 00:31:16.754605 NaN NaN
14 31 2025-11-12 00:31:16.754605 NaN NaN
15 83 2025-11-12 00:31:16.754605 315.60 313.0
16 40 2025-11-12 00:31:16.754605 305.82 267.0
17 82 2025-11-12 00:31:16.754605 NaN NaN
18 24 2025-11-12 00:31:16.754605 811.43 88.0
19 73 2025-11-12 00:31:16.754605 NaN NaN
20 74 2025-11-12 00:31:16.754605 NaN NaN
21 89 2025-11-12 00:31:16.754605 NaN NaN
22 95 2025-11-12 00:31:16.754605 65.17 464.0
23 48 2025-11-12 00:31:16.754605 NaN NaN
24 1 2025-11-12 00:31:16.754605 773.62 60.0
25 50 2025-11-12 00:31:16.754605 NaN NaN
26 59 2025-11-12 00:31:16.754605 921.00 426.0
27 86 2025-11-12 00:31:16.754605 NaN NaN
28 21 2025-11-12 00:31:16.754605 376.88 99.0
29 27 2025-11-12 00:31:16.754605 691.39 447.0

💡 Explanation:

  • event_timestamp: Point-in-time when features are queried (query timestamp)
  • created_timestamp_in_source: When the feature was actually created/updated in source data
  • When using start_date/end_date without entity_df, Feast uses end_date as event_timestamp
  • This is correct: 'What were the features as of 2025-11-12 00:31:16.754605?

@jyejare
Copy link
Contributor

jyejare commented Dec 2, 2025

@aniketpalu Did we validate the join works from multiple feature views in a feature service ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants