Custom features #47

adku1173 · 2025-12-12T14:44:12Z

This pull request significantly refactors the feature handling and TensorFlow serialization pipeline for datasets, introducing a more modular and extensible approach to feature definition and collection. The main improvements include a new system for default feature instantiation, streamlined feature collection building, and enhanced support for variable-length and complex features in TensorFlow records.

Feature handling and extensibility:

Introduced a new method get_default_features and corresponding _get_default_feature_{name} hooks in dataset configs, enabling modular and extensible default feature instantiation based on feature names. This replaces the previous hardcoded logic and allows for easier customization and extension of features. [1] [2]
Refactored the feature collection builder (BaseFeatureCollectionBuilder) to accept a list of BaseFeatureCatalog instances, automatically registering feature functions and their TensorFlow encoding/shape/dtype using the new infer_tf_encoding utility. [1] [2] [3]
Removed the large, custom get_feature_collection implementation from experimental.py, delegating feature instantiation to the new default feature mechanism.

TensorFlow serialization improvements:

Improved handling of variable-length and complex features in the TFRecord writer and parser: features with multiple None dimensions are now tracked and their shapes are stored in the TFRecord, enabling correct parsing and reshaping at load time. Complex features are stored as float pairs and reconstructed during parsing. [1] [2]
The TFRecord parser now dynamically handles dense/sparse tensors and associated shape keys, ensuring robust support for features with dynamic shapes and complex types.

Codebase cleanup:

Removed unused imports and legacy feature collection builder classes from experimental.py, simplifying the code and reducing redundancy. [1] [2]
Minor fixes and improvements, such as correcting the extraction of source locations in LocFeature.

Overall, these changes make the feature pipeline more flexible, maintainable, and compatible with advanced TensorFlow serialization needs.

Most important changes:

Feature handling and extensibility

Added get_default_features and _get_default_feature_{name} hooks to enable modular, name-based feature instantiation in dataset configs, replacing hardcoded feature logic. [1] [2]
Refactored BaseFeatureCollectionBuilder to accept a list of BaseFeatureCatalog instances, automatically registering feature functions and TensorFlow encoding/shape/dtype mappings using infer_tf_encoding. [1] [2] [3]
Removed the custom feature collection builder and legacy feature logic from experimental.py, delegating feature instantiation to the new default feature mechanism.

TensorFlow serialization improvements

Enhanced TFRecord writing and parsing to handle variable-length and complex features, including dynamic shape tracking and correct reconstruction of complex types. [1] [2]

Codebase cleanup

Removed unused imports and made minor corrections, such as fixing source location extraction in LocFeature. [1] [2]

Copilot

Pull request overview

This pull request refactors the feature handling and TensorFlow serialization system for acoustic datasets, introducing a more modular architecture for defining and processing features. The changes replace hardcoded feature logic with a dynamic system based on convention-based method naming (_get_default_feature_{name}), and enhance TFRecord serialization to properly handle variable-length and complex-valued features.

Key changes:

Introduced get_default_features method in ConfigBase enabling dynamic feature instantiation via _get_default_feature_{name} hooks
Refactored BaseFeatureCollectionBuilder to accept feature catalog instances and automatically infer TensorFlow encoding/dtype/shape mappings
Enhanced TFRecord writer and parser to track shapes of variable-length features and properly encode/decode complex values

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/acoupipe/writer.py	Added `infer_tf_encoding` function, refactored encoding functions to handle array-like inputs, and implemented shape tracking for variable-length features in `WriteTFRecord`
src/acoupipe/datasets/features.py	Added `shape` trait to `BaseFeatureCatalog`, fixed source location extraction in `LocFeature`, removed `BaseFeatureCollection.__init__`, and refactored `BaseFeatureCollectionBuilder` to use feature catalogs with automatic mapper registration
src/acoupipe/datasets/base.py	Added `get_default_features` infrastructure to `ConfigBase`, updated `get_feature_collection` to support both custom and default features, and enhanced TFRecord parser to handle dynamic shapes and complex values
src/acoupipe/datasets/synthetic.py	Simplified `get_feature_collection` to use new default feature system, added `_get_default_feature_*` methods for all supported features, and removed legacy `DatasetSyntheticFeatureCollectionBuilder` class
src/acoupipe/datasets/experimental.py	Removed custom `get_feature_collection` and `MIRACLEFeatureCollectionBuilder`, added MIRACLE-specific override methods for source strength and targetmap features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-12T14:50:35Z

src/acoupipe/datasets/base.py

+        # get features with varying length to handle them correctly in the TFRecord writer
+        shape_features = []
+        for feature, shape in feature_collection.feature_tf_shape_mapper.items():
+            # if more then one None in shape, we have a varying length feature


The comment incorrectly states "if more then one None" but it should be "if more than one None".

Suggested change

# if more then one None in shape, we have a varying length feature

# if more than one None in shape, we have a varying length feature

Copilot · 2025-12-12T14:50:35Z

src/acoupipe/writer.py

+                    encoded_features = self._encode_sample(features, encoders)
                    example = tf.train.Example(features=tf.train.Features(feature=encoded_features))
                    # Serialize to string and write on the file
                    writer.write(example.SerializeToString())


The removed code includes writer.flush() and writer.close() calls that are no longer present. While the context manager (with statement) will automatically close the writer, the explicit flush() call ensured data was written to disk immediately after each sample. Without it, data may be buffered and only written at the end, which could be problematic if the process is interrupted. Consider whether the flush behavior is needed for robustness.

Suggested change

writer.write(example.SerializeToString())

writer.write(example.SerializeToString())

writer.flush()

Copilot · 2025-12-12T14:50:36Z

src/acoupipe/writer.py

+        """
+        Decide which encoder function, TF dtype and TF shape to use
+        for a feature with given numpy / Python dtype and shape.
+


Missing blank line before the Parameters section in the docstring. According to PEP 257 and numpydoc conventions, there should be a blank line separating the summary/description from the Parameters section.

Suggested change

Copilot · 2025-12-12T14:50:36Z

src/acoupipe/datasets/experimental.py

+
+    def _get_targetmap_feature(self, strength_type, **kwargs):  # noqa ARG002
+        fdim = self._get_fdim(kwargs['f'])
+        freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)


The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.

Suggested change

freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)

if strength_type == 'analytic':

freq_data = self.freq_data

elif self.mode == 'welch':

freq_data = self.fft_obs_spectra

else:

freq_data = self.freq_data

Copilot · 2025-12-12T14:50:36Z

src/acoupipe/datasets/synthetic.py

+        default_features = self.config.get_default_features(default_feature_names, f, num)
+        builder = BaseFeatureCollectionBuilder(features=default_features + custom_features)
+        builder.add_custom(self.config.get_prepare_func()) # add prepare function
+        return builder.build() # finally build the feature collection


Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.

Suggested change

return builder.build() # finally build the feature collection

return builder.build() # finally build the feature collection

Copilot · 2025-12-12T14:50:37Z

src/acoupipe/writer.py

-        if isinstance(value, type(tf.constant(0))):
-            value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
-        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+        """bytes_list from a bytes or str or array-like of those."""


The docstring has a grammatical error. It should be "bytes_list from bytes, str, or array-like of those" (with commas for proper list formatting).

Suggested change

"""bytes_list from a bytes or str or array-like of those."""

"""bytes_list from bytes, str, or array-like of those."""

Copilot · 2025-12-12T14:50:37Z

src/acoupipe/writer.py

    def int_list_feature(value):
-        """Return an int64_list from a list od int values."""
-        return tf.train.Feature(int64_list=tf.train.Int64List(value=value.reshape(-1)))
+        """int64_list from scalar or array-like of ints/bools."""


The docstring should use "a bytes or str" or "bytes or str" consistently. Also, for grammatical correctness with list items, use commas: "bytes, str, or array-like".

Suggested change

"""int64_list from scalar or array-like of ints/bools."""

"""int64_list from a scalar, int, bool, or array-like of ints or bools."""

Copilot · 2025-12-12T14:50:37Z

src/acoupipe/datasets/base.py

+        builder = BaseFeatureCollectionBuilder(features=feature_instances)
+        if hasattr(self.config, 'get_prepare_func'):
+            builder.add_custom(self.config.get_prepare_func()) # add prepare function
+        return builder.build() # finally build the feature collection


Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.

Suggested change

return builder.build() # finally build the feature collection

return builder.build() # finally build the feature collection

Copilot · 2025-12-12T14:50:38Z

src/acoupipe/datasets/synthetic.py

+        )
+
+    def _get_targetmap_feature(self, strength_type, **kwargs):  # noqa ARG002
+        freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)


The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.

Suggested change

freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)

if strength_type == 'analytic':

freq_data = self.freq_data

elif self.mode == 'welch':

freq_data = self.fft_obs_spectra

else:

freq_data = self.freq_data

adku1173 requested a review from Copilot December 12, 2025 14:44

Copilot started reviewing on behalf of adku1173 December 12, 2025 14:44 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

adku1173 added 7 commits December 28, 2025 21:46

allow custom features

7d467a4

wip

66cefff

wip

bf7a05e

custom features synthetic

22d80fc

appply changes to experimental

3f38f45

fix

beb164c

fix ps convolve

5ab8c3e

adku1173 force-pushed the custom_features branch from 97011c8 to 5ab8c3e Compare December 28, 2025 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom features #47

Custom features #47

Uh oh!

adku1173 commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Copilot AI Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# if more then one None in shape, we have a varying length feature
	# if more than one None in shape, we have a varying length feature

	writer.write(example.SerializeToString())
	writer.write(example.SerializeToString())
	writer.flush()

	return builder.build() # finally build the feature collection
	return builder.build() # finally build the feature collection

	"""bytes_list from a bytes or str or array-like of those."""
	"""bytes_list from bytes, str, or array-like of those."""

	"""int64_list from scalar or array-like of ints/bools."""
	"""int64_list from a scalar, int, bool, or array-like of ints or bools."""

Custom features #47

Are you sure you want to change the base?

Custom features #47

Uh oh!

Conversation

adku1173 commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants