Skip to content

Commit 123693a

Browse files
sfc-gh-anavalosSnowflake Authors
andauthored
Project import generated by Copybara. (#112)
GitOrigin-RevId: 09f3289f4581ba7d81e145e9593ffcda9233f4bf Co-authored-by: Snowflake Authors <[email protected]>
1 parent 3cbf8f1 commit 123693a

File tree

187 files changed

+8781
-4167
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+8781
-4167
lines changed

.github/workflows/jira_issue.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,9 @@ jobs:
4040
summary: ${{ github.event.issue.title }}
4141
description: |
4242
${{ github.event.issue.body }} \\ \\ _Created from GitHub Action_ for ${{ github.event.issue.html_url }}
43-
# Assign triage-ml-platform-dl and set "Data Platform: ML Engineering" component.
44-
fields: '{"customfield_11401":{"id":"14538"}, "assignee":{"id":"639020ab3c26ca7fa0d6eb3f"},"components":[{"id":"16520"}]}'
43+
# Assign triage-ml-platform-dl and set "ML Platform" component (19112).
44+
# See https://snowflakecomputing.atlassian.net/rest/api/2/project/SNOW/components for component information.
45+
fields: '{"customfield_11401":{"id":"14538"}, "assignee":{"id":"639020ab3c26ca7fa0d6eb3f"},"components":[{"id":"19112"}]}'
4546

4647
- name: Update GitHub Issue
4748
uses: ./jira/gajira-issue-update

CHANGELOG.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,52 @@
11
# Release History
22

3-
## 1.5.4
3+
## 1.6.0
4+
5+
### Bug Fixes
6+
7+
- Modeling: `SimpleImputer` can impute integer columns with integer values.
8+
- Registry: Fix an issue when providing a pandas Dataframe whose index is not starting from 0 as the input to
9+
the `ModelVersion.run`.
10+
11+
### New Features
12+
13+
- Feature Store: Add overloads to APIs accept both object and name/version. Impacted APIs include read_feature_view(),
14+
refresh_feature_view(), get_refresh_history(), resume_feature_view(), suspend_feature_view(), delete_feature_view().
15+
- Feature Store: Add docstring inline examples for all public APIs.
16+
- Feature Store: Add new utility class `ExampleHelper` to help with load source data to simplify public notebooks.
17+
- Registry: Option to `enable_explainability` when registering XGBoost models as a pre-PuPr feature.
18+
- Feature Store: add new API `update_entity()`.
19+
- Registry: Option to `enable_explainability` when registering Catboost models as a pre-PuPr feature.
20+
- Feature Store: Add new argument warehouse to FeatureView constructor to overwrite the default warehouse. Also add
21+
a new column 'warehouse' to the output of list_feature_views().
22+
- Registry: Add support for logging model from a model version.
23+
- Modeling: Distributed Hyperparameter Optimization now announce GA refresh version. The latest memory efficient version
24+
will not have the 10GB training limitation for dataset any more. To turn off, please run
25+
`
26+
from snowflake.ml.modeling._internal.snowpark_implementations import (
27+
distributed_hpo_trainer,
28+
)
29+
distributed_hpo_trainer.ENABLE_EFFICIENT_MEMORY_USAGE = False
30+
`
31+
- Registry: Option to `enable_explainability` when registering LightGBM models as a pre-PuPr feature.
32+
33+
### Behavior Changes
34+
35+
- Feature Store: change some positional parameters to keyword arguments in following APIs:
36+
- Entity(): desc.
37+
- FeatureView(): timestamp_col, refresh_freq, desc.
38+
- FeatureStore(): creation_mode.
39+
- update_entity(): desc.
40+
- register_feature_view(): block, overwrite.
41+
- list_feature_views(): entity_name, feature_view_name.
42+
- get_refresh_history(): verbose.
43+
- retrieve_feature_values(): spine_timestamp_col, exclude_columns, include_feature_view_timestamp_col.
44+
- generate_training_set(): save_as, spine_timestamp_col, spine_label_cols, exclude_columns,
45+
include_feature_view_timestamp_col.
46+
- generate_dataset(): version, spine_timestamp_col, spine_label_cols, exclude_columns,
47+
include_feature_view_timestamp_col, desc, output_type.
48+
49+
## 1.5.4 (2024-07-11)
450

551
### Bug Fixes
652

ci/conda_recipe/meta.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ build:
1717
noarch: python
1818
package:
1919
name: snowflake-ml-python
20-
version: 1.5.4
20+
version: 1.6.0
2121
requirements:
2222
build:
2323
- python

ci/targets/quarantine/prod3.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,3 @@
22
//tests/integ/snowflake/ml/registry:model_registry_snowservice_integ_test
33
//tests/integ/snowflake/ml/model:spcs_llm_model_integ_test
44
//tests/integ/snowflake/ml/extra_tests:xgboost_external_memory_training_test
5-
//tests/integ/snowflake/ml/lineage:lineage_integ_test

codegen/build_file_autogen.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from absl import app
1515

1616
from codegen import sklearn_wrapper_autogen as swa
17-
from snowflake.ml.snowpark_pandas import imports
17+
from snowflake.ml._internal.snowpark_pandassnowpark_pandas import imports
1818

1919

2020
@dataclass(frozen=True)
@@ -188,7 +188,7 @@ def get_snowpark_pandas_test_build_file_content(module: imports.ModuleInfo, modu
188188
return (
189189
'load("//codegen:codegen_rules.bzl", "autogen_snowpark_pandas_tests")\n'
190190
f'load("//{module_root_dir}:estimators_info.bzl", "snowpark_pandas_estimator_info_list")\n'
191-
'package(default_visibility = ["//snowflake/ml/snowpark_pandas"])\n'
191+
'package(default_visibility = ["//snowflake/ml/_internal/snowpark_pandas"])\n'
192192
"\nautogen_snowpark_pandas_tests(\n"
193193
f' module = "{module.module_name}",\n'
194194
f' module_root_dir = "{module_root_dir}",\n'

codegen/codegen_rules.bzl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ def autogen_snowpark_pandas_tests(module, module_root_dir, snowpark_pandas_estim
178178
name = "{}_snowpark_pandas_test".format(e.normalized_class_name),
179179
srcs = [":generate_test_snowpark_pandas_{}".format(e.normalized_class_name)],
180180
deps = [
181-
"//snowflake/ml/snowpark_pandas:snowpark_pandas_lib",
181+
"//snowflake/ml/_internal/snowpark_pandas:snowpark_pandas_lib",
182182
"//snowflake/ml/utils:connection_params",
183183
],
184184
compatible_with_snowpark = False,

codegen/sklearn_wrapper_generator.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,18 @@ def _is_data_module_obj(class_object: Tuple[str, type]) -> bool:
205205
"""
206206
return class_object[1].__module__ == "sklearn.preprocessing._data"
207207

208+
@staticmethod
209+
def _is_preprocessing_module_obj(class_object: Tuple[str, type]) -> bool:
210+
"""Check if the given class belongs to the SKLearn preprocessing module.
211+
212+
Args:
213+
class_object: Meta class object which needs to be checked.
214+
215+
Returns:
216+
True if the class belongs to `sklearn.preprocessing` module, otherwise False.
217+
"""
218+
return class_object[1].__module__.startswith("sklearn.preprocessing")
219+
208220
@staticmethod
209221
def _is_cross_decomposition_module_obj(class_object: Tuple[str, type]) -> bool:
210222
"""Check if the given class belongs to the SKLearn cross_decomposition module.
@@ -675,6 +687,7 @@ def _populate_flags(self) -> None:
675687
self._is_cross_decomposition_module_obj = WrapperGeneratorFactory._is_cross_decomposition_module_obj(
676688
self.class_object
677689
)
690+
self._is_preprocessing_module_obj = WrapperGeneratorFactory._is_preprocessing_module_obj(self.class_object)
678691
self._is_regressor = WrapperGeneratorFactory._is_regressor_obj(self.class_object)
679692
self._is_classifier = WrapperGeneratorFactory._is_classifier_obj(self.class_object)
680693
self._is_meta_estimator = WrapperGeneratorFactory._is_meta_estimator_obj(self.class_object)
@@ -1014,6 +1027,14 @@ def generate(self) -> "SklearnWrapperGenerator":
10141027
if "random_state" in self.original_init_signature.parameters.keys():
10151028
self.test_estimator_input_args_list.append("random_state=0")
10161029

1030+
# Our preprocessing classes don't support sparse features
1031+
if "sparse" in self.original_init_signature.parameters.keys() and self._is_preprocessing_module_obj:
1032+
self.test_estimator_input_args_list.append("sparse=False")
1033+
1034+
# For the case of KBinsDiscretizer, we need to set encode to ordinal
1035+
# if "encode" in self.original_init_signature.parameters.keys() and self._is_preprocessing_module_obj:
1036+
# self.test_estimator_input_args_list.append("encode='ordinal'")
1037+
10171038
if (
10181039
"max_iter" in self.original_init_signature.parameters.keys()
10191040
and not self._is_hist_gradient_boosting_regressor

codegen/snowpark_pandas_autogen_test_template.py_template

Lines changed: 51 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ import pytest
1616
from typing import Any, Dict, List, Optional, Tuple, Union
1717
from absl.testing.absltest import TestCase, main
1818
{transform.test_snowpark_pandas_imports}
19-
# from snowflake.ml.beta import snowpark_pandas
19+
# from snowflake.ml import snowpark_pandas
2020
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
2121
from snowflake.snowpark import Session
22-
# from snowflake.snowpark.modin import pandas as SnowparkPandas
22+
# from snowflake.snowpark.modin import pandas as snowpark_pandas
2323

2424
_INFERENCE = "INFERENCE"
2525
_EXPECTED = "EXPECTED"
@@ -35,7 +35,7 @@ class DatasetType(enum.Enum):
3535
class {transform.test_class_name}(TestCase):
3636
def setUp(self) -> None:
3737
"""Creates Snowpark and Snowflake environments for testing."""
38-
self._session = Session.builder.configs(SnowflakeLoginOptions("sfc")).create()
38+
self._session = Session.builder.configs(SnowflakeLoginOptions()).create()
3939

4040
def tearDown(self) -> None:
4141
self._session.close()
@@ -114,12 +114,12 @@ class {transform.test_class_name}(TestCase):
114114
# inference_methods.remove("transform") # underlying estimators have no method 'transform'
115115
# if Sk{transform.original_class_name}.__name__ == "LocalOutlierFactor" and not reg.novelty:
116116
# inference_methods.remove("predict")
117-
117+
118118
# for m in inference_methods:
119119
# if callable(getattr(reg, m, None)):
120120
# res = getattr(reg, m)(dataset)
121-
# TODO(hayu): Remove the output manipulation as the results should be exactly the same as sklearn.
122-
# if isinstance(res, SnowparkPandas.DataFrame) or isinstance(res, pd.DataFrame):
121+
# # TODO(hayu): Remove the output manipulation as the results should be exactly the same as sklearn.
122+
# if isinstance(res, snowpark_pandas.DataFrame) or isinstance(res, pd.DataFrame):
123123
# arr = res.to_numpy()
124124
# elif isinstance(res, list):
125125
# arr = np.array(res)
@@ -128,14 +128,14 @@ class {transform.test_class_name}(TestCase):
128128
# if arr.ndim == 2 and arr.shape[1] == 1:
129129
# arr = arr.flatten()
130130
# if len(arr.shape) == 3:
131-
# VotingClassifier will return results of shape (n_classifiers, n_samples, n_classes)
132-
# when voting = "soft" and flatten_transform = False. We can't handle unflatten transforms,
133-
# so we ignore flatten_transform flag and flatten the results. We need flatten sklearn results
134-
# also to compare with snowflake results.
131+
# # VotingClassifier will return results of shape (n_classifiers, n_samples, n_classes)
132+
# # when voting = "soft" and flatten_transform = False. We can't handle unflatten transforms,
133+
# # so we ignore flatten_transform flag and flatten the results. We need flatten sklearn results
134+
# # also to compare with snowflake results.
135135
# arr = np.hstack(arr) # type: ignore[arg-type]
136136
# elif len(arr.shape) == 1:
137-
# Sometimes sklearn returns results as 1D array of shape (n_samples,), but snowflake always returns
138-
# response as 2D array of shape (n_samples, 1). Flatten the snowflake response to compare results.
137+
# # Sometimes sklearn returns results as 1D array of shape (n_samples,), but snowflake always returns
138+
# # response as 2D array of shape (n_samples, 1). Flatten the snowflake response to compare results.
139139
# arr = arr.flatten()
140140
# output[_INFERENCE].append(arr)
141141

@@ -152,7 +152,7 @@ class {transform.test_class_name}(TestCase):
152152
# for m in expected_methods:
153153
# if callable(getattr(reg, m, None)):
154154
# res = getattr(reg, m)(dataset)
155-
# if isinstance(res, SnowparkPandas.DataFrame) or isinstance(res, pd.DataFrame):
155+
# if isinstance(res, snowpark_pandas.DataFrame) or isinstance(res, pd.DataFrame):
156156
# arr = res.to_numpy()
157157
# elif isinstance(res, list):
158158
# arr = np.array(res)
@@ -161,8 +161,8 @@ class {transform.test_class_name}(TestCase):
161161
# if arr.ndim == 2 and arr.shape[1] == 1:
162162
# arr = arr.flatten()
163163
# if isinstance(arr, list):
164-
# In case of multioutput estimators predict_proba, decision_function, etc., returns a list of
165-
# ndarrays as output. We need to concatenate them to compare with snowflake output.
164+
# # In case of multioutput estimators predict_proba, decision_function, etc., returns a list of
165+
# # ndarrays as output. We need to concatenate them to compare with snowflake output.
166166
# arr = np.concatenate(arr, axis=1)
167167
# elif len(arr.shape) == 1:
168168
# # Sometimes sklearn returns results as 1D array of shape (n_samples,), but snowflake always returns
@@ -189,14 +189,18 @@ class {transform.test_class_name}(TestCase):
189189

190190
# reg = Sk{transform.original_class_name}({transform.test_estimator_input_args})
191191

192+
# # Special handle for label encoder: sklearn label encoder fit method only accept fit(y),
193+
# # but our SnowML API would treat it as fit(X)
194+
# _is_label_encoder = reg.__class__.__name__ == "LabelEncoder"
195+
192196
# input_df_pandas, input_cols, label_col = self._get_test_dataset(
193197
# sklearn_obj=reg,
194198
# add_sample_weight_col=use_weighted_dataset
195199
# )
196-
# input_df_snowpark_pandas = SnowparkPandas.DataFrame(input_df_pandas)
200+
# input_df_snowpandas = snow_pd.DataFrame(input_df_pandas)
197201

198202
# pd_X, pd_y = input_df_pandas[input_cols], input_df_pandas[label_col].squeeze()
199-
# snow_X, snow_y = input_df_snowpark_pandas[input_cols], input_df_snowpark_pandas[label_col].squeeze()
203+
# snow_X, snow_y = input_df_snowpandas[input_cols], input_df_snowpandas[label_col].squeeze()
200204
# pd_args = {{
201205
# 'X': pd_X,
202206
# 'y': pd_y,
@@ -205,21 +209,23 @@ class {transform.test_class_name}(TestCase):
205209
# 'X': snow_X,
206210
# 'y': snow_y,
207211
# }}
208-
# if use_weighted_dataset:
212+
213+
# # SnowML preprocessing class currently doesn't support sample weight
214+
# if use_weighted_dataset and not {transform._is_preprocessing_module_obj}:
209215
# pd_args['sample_weight'] = input_df_pandas["SAMPLE_WEIGHT"].squeeze()
210-
# snow_args['sample_weight'] = input_df_snowpark_pandas["SAMPLE_WEIGHT"].squeeze()
216+
# snow_args['sample_weight'] = input_df_snowpandas["SAMPLE_WEIGHT"].squeeze()
211217

212218
# pd_score_args = snow_score_args = None
213219
# if callable(getattr(reg, "score", None)):
214220
# pd_score_args = copy.deepcopy(pd_args)
215221
# snow_score_args = copy.deepcopy(snow_args)
216222
# score_argspec = inspect.getfullargspec(reg.score)
217-
# Some classes that has sample_weight argument in fit() but not in score().
223+
# # Some classes that has sample_weight argument in fit() but not in score().
218224
# if use_weighted_dataset and 'sample_weight' not in score_argspec.args:
219225
# del pd_score_args['sample_weight']
220226
# del snow_score_args['sample_weight']
221227

222-
# Some classes have different arg name in score: X -> X_test
228+
# # Some classes have different arg name in score: X -> X_test
223229
# if "X_test" in score_argspec.args:
224230
# pd_score_args['X_test'] = pd_score_args.pop('X')
225231
# snow_score_args['X_test'] = snow_score_args.pop('X')
@@ -229,24 +235,34 @@ class {transform.test_class_name}(TestCase):
229235
# pd_args['Y'] = pd_args.pop('y')
230236
# snow_args['Y'] = snow_args.pop('y')
231237

232-
# pandas
233-
# pd_output = self._compute_output(reg, pd_args, input_df_pandas[input_cols], pd_score_args)
238+
# # pandas
239+
# if _is_label_encoder:
240+
# pd_output = self._compute_output(reg, {{'y': input_df_pandas[label_col]}}, input_df_pandas[label_col], None)
241+
# else:
242+
# pd_output = self._compute_output(reg, pd_args, input_df_pandas[input_cols], pd_score_args)
234243

235-
# snowpark_pandas
244+
# # snowpandas
236245
# snowpark_pandas.init()
237246

247+
# # Integrate with native distributed preprocessing methods
238248
# snow_reg = Sk{transform.original_class_name}({transform.test_estimator_input_args})
239249
# args = snow_args if training == DatasetType.SNOWPARK_PANDAS else pd_args
240250
# dataset, score_args = (
241-
# (input_df_snowpark_pandas[input_cols], snow_score_args) if inference == DatasetType.SNOWPARK_PANDAS
251+
# (input_df_snowpandas[input_cols], snow_score_args) if inference == DatasetType.SNOWPARK_PANDAS
242252
# else (input_df_pandas[input_cols], pd_score_args)
243253
# )
244-
# snow_output = self._compute_output(snow_reg, args, dataset, score_args)
254+
# if _is_label_encoder:
255+
# if training == DatasetType.SNOWPARK_PANDAS:
256+
# snow_output = self._compute_output(reg, {{'X': input_df_snowpandas[label_col]}}, input_df_snowpandas[label_col], None)
257+
# else:
258+
# snow_output = self._compute_output(reg, {{'y': input_df_pandas[label_col]}}, input_df_pandas[label_col], None)
259+
# else:
260+
# snow_output = self._compute_output(snow_reg, args, dataset, score_args)
245261

246262
# for pd_arr, snow_arr in zip(pd_output[_INFERENCE], snow_output[_INFERENCE]):
247263
# snow_arr = snow_arr.astype(pd_arr.dtype) # type: ignore[union-attr]
248-
# TODO(snandamuri): HistGradientBoostingRegressor is returning different results in different envs.
249-
# Needs further debugging.
264+
# # TODO(snandamuri): HistGradientBoostingRegressor is returning different results in different envs.
265+
# # Needs further debugging.
250266
# if {transform._is_hist_gradient_boosting_regressor}:
251267
# num_diffs = (~np.isclose(snow_arr, pd_arr)).sum()
252268
# num_example = pd_arr.shape[0]
@@ -282,13 +298,13 @@ class {transform.test_class_name}(TestCase):
282298
# use_weighted_dataset=False
283299
# )
284300

285-
def _is_weighted_dataset_supported(self, klass: type) -> bool:
286-
is_weighted_dataset_supported = False
287-
for m in inspect.getmembers(klass):
288-
if inspect.isfunction(m[1]) and m[0] == "fit":
289-
argspec = inspect.getfullargspec(m[1])
290-
is_weighted_dataset_supported = True if "sample_weight" in argspec.args else False
291-
return is_weighted_dataset_supported
301+
# def _is_weighted_dataset_supported(self, klass: type) -> bool:
302+
# is_weighted_dataset_supported = False
303+
# for m in inspect.getmembers(klass):
304+
# if inspect.isfunction(m[1]) and m[0] == "fit":
305+
# argspec = inspect.getfullargspec(m[1])
306+
# is_weighted_dataset_supported = True if "sample_weight" in argspec.args else False
307+
# return is_weighted_dataset_supported
292308

293309
# def test_weighted_datasets_snow_snow(self) -> None:
294310
# if self._is_weighted_dataset_supported(Sk{transform.original_class_name}):

snowflake/cortex/BUILD.bazel

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,25 @@ py_library(
2929
srcs = ["_sse_client.py"],
3030
)
3131

32+
py_library(
33+
name = "classify_text",
34+
srcs = ["_classify_text.py"],
35+
deps = [
36+
":util",
37+
"//snowflake/ml/_internal:telemetry",
38+
],
39+
)
40+
41+
py_test(
42+
name = "classify_text_test",
43+
srcs = ["classify_text_test.py"],
44+
deps = [
45+
":classify_text",
46+
":test_util",
47+
"//snowflake/ml/utils:connection_params",
48+
],
49+
)
50+
3251
py_library(
3352
name = "complete",
3453
srcs = ["_complete.py"],
@@ -140,6 +159,7 @@ py_library(
140159
"__init__.py",
141160
],
142161
deps = [
162+
":classify_text",
143163
":complete",
144164
":extract_answer",
145165
":sentiment",

snowflake/cortex/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1+
from snowflake.cortex._classify_text import ClassifyText
12
from snowflake.cortex._complete import Complete, CompleteOptions
23
from snowflake.cortex._extract_answer import ExtractAnswer
34
from snowflake.cortex._sentiment import Sentiment
45
from snowflake.cortex._summarize import Summarize
56
from snowflake.cortex._translate import Translate
67

78
__all__ = [
9+
"ClassifyText",
810
"Complete",
911
"CompleteOptions",
1012
"ExtractAnswer",

0 commit comments

Comments
 (0)