Skip to content

Conversation

@adku1173
Copy link
Owner

This pull request significantly refactors the feature handling and TensorFlow serialization pipeline for datasets, introducing a more modular and extensible approach to feature definition and collection. The main improvements include a new system for default feature instantiation, streamlined feature collection building, and enhanced support for variable-length and complex features in TensorFlow records.

Feature handling and extensibility:

  • Introduced a new method get_default_features and corresponding _get_default_feature_{name} hooks in dataset configs, enabling modular and extensible default feature instantiation based on feature names. This replaces the previous hardcoded logic and allows for easier customization and extension of features. [1] [2]
  • Refactored the feature collection builder (BaseFeatureCollectionBuilder) to accept a list of BaseFeatureCatalog instances, automatically registering feature functions and their TensorFlow encoding/shape/dtype using the new infer_tf_encoding utility. [1] [2] [3]
  • Removed the large, custom get_feature_collection implementation from experimental.py, delegating feature instantiation to the new default feature mechanism.

TensorFlow serialization improvements:

  • Improved handling of variable-length and complex features in the TFRecord writer and parser: features with multiple None dimensions are now tracked and their shapes are stored in the TFRecord, enabling correct parsing and reshaping at load time. Complex features are stored as float pairs and reconstructed during parsing. [1] [2]
  • The TFRecord parser now dynamically handles dense/sparse tensors and associated shape keys, ensuring robust support for features with dynamic shapes and complex types.

Codebase cleanup:

  • Removed unused imports and legacy feature collection builder classes from experimental.py, simplifying the code and reducing redundancy. [1] [2]
  • Minor fixes and improvements, such as correcting the extraction of source locations in LocFeature.

Overall, these changes make the feature pipeline more flexible, maintainable, and compatible with advanced TensorFlow serialization needs.


Most important changes:

Feature handling and extensibility

  • Added get_default_features and _get_default_feature_{name} hooks to enable modular, name-based feature instantiation in dataset configs, replacing hardcoded feature logic. [1] [2]
  • Refactored BaseFeatureCollectionBuilder to accept a list of BaseFeatureCatalog instances, automatically registering feature functions and TensorFlow encoding/shape/dtype mappings using infer_tf_encoding. [1] [2] [3]
  • Removed the custom feature collection builder and legacy feature logic from experimental.py, delegating feature instantiation to the new default feature mechanism.

TensorFlow serialization improvements

  • Enhanced TFRecord writing and parsing to handle variable-length and complex features, including dynamic shape tracking and correct reconstruction of complex types. [1] [2]

Codebase cleanup

  • Removed unused imports and made minor corrections, such as fixing source location extraction in LocFeature. [1] [2]

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the feature handling and TensorFlow serialization system for acoustic datasets, introducing a more modular architecture for defining and processing features. The changes replace hardcoded feature logic with a dynamic system based on convention-based method naming (_get_default_feature_{name}), and enhance TFRecord serialization to properly handle variable-length and complex-valued features.

Key changes:

  • Introduced get_default_features method in ConfigBase enabling dynamic feature instantiation via _get_default_feature_{name} hooks
  • Refactored BaseFeatureCollectionBuilder to accept feature catalog instances and automatically infer TensorFlow encoding/dtype/shape mappings
  • Enhanced TFRecord writer and parser to track shapes of variable-length features and properly encode/decode complex values

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/acoupipe/writer.py Added infer_tf_encoding function, refactored encoding functions to handle array-like inputs, and implemented shape tracking for variable-length features in WriteTFRecord
src/acoupipe/datasets/features.py Added shape trait to BaseFeatureCatalog, fixed source location extraction in LocFeature, removed BaseFeatureCollection.__init__, and refactored BaseFeatureCollectionBuilder to use feature catalogs with automatic mapper registration
src/acoupipe/datasets/base.py Added get_default_features infrastructure to ConfigBase, updated get_feature_collection to support both custom and default features, and enhanced TFRecord parser to handle dynamic shapes and complex values
src/acoupipe/datasets/synthetic.py Simplified get_feature_collection to use new default feature system, added _get_default_feature_* methods for all supported features, and removed legacy DatasetSyntheticFeatureCollectionBuilder class
src/acoupipe/datasets/experimental.py Removed custom get_feature_collection and MIRACLEFeatureCollectionBuilder, added MIRACLE-specific override methods for source strength and targetmap features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# get features with varying length to handle them correctly in the TFRecord writer
shape_features = []
for feature, shape in feature_collection.feature_tf_shape_mapper.items():
# if more then one None in shape, we have a varying length feature
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment incorrectly states "if more then one None" but it should be "if more than one None".

Suggested change
# if more then one None in shape, we have a varying length feature
# if more than one None in shape, we have a varying length feature

Copilot uses AI. Check for mistakes.
encoded_features = self._encode_sample(features, encoders)
example = tf.train.Example(features=tf.train.Features(feature=encoded_features))
# Serialize to string and write on the file
writer.write(example.SerializeToString())
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed code includes writer.flush() and writer.close() calls that are no longer present. While the context manager (with statement) will automatically close the writer, the explicit flush() call ensured data was written to disk immediately after each sample. Without it, data may be buffered and only written at the end, which could be problematic if the process is interrupted. Consider whether the flush behavior is needed for robustness.

Suggested change
writer.write(example.SerializeToString())
writer.write(example.SerializeToString())
writer.flush()

Copilot uses AI. Check for mistakes.
"""
Decide which encoder function, TF dtype and TF shape to use
for a feature with given numpy / Python dtype and shape.
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line before the Parameters section in the docstring. According to PEP 257 and numpydoc conventions, there should be a blank line separating the summary/description from the Parameters section.

Suggested change

Copilot uses AI. Check for mistakes.

def _get_targetmap_feature(self, strength_type, **kwargs): # noqa ARG002
fdim = self._get_fdim(kwargs['f'])
freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.

Suggested change
freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)
if strength_type == 'analytic':
freq_data = self.freq_data
elif self.mode == 'welch':
freq_data = self.fft_obs_spectra
else:
freq_data = self.freq_data

Copilot uses AI. Check for mistakes.
default_features = self.config.get_default_features(default_feature_names, f, num)
builder = BaseFeatureCollectionBuilder(features=default_features + custom_features)
builder.add_custom(self.config.get_prepare_func()) # add prepare function
return builder.build() # finally build the feature collection
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.

Suggested change
return builder.build() # finally build the feature collection
return builder.build() # finally build the feature collection

Copilot uses AI. Check for mistakes.
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
"""bytes_list from a bytes or str or array-like of those."""
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has a grammatical error. It should be "bytes_list from bytes, str, or array-like of those" (with commas for proper list formatting).

Suggested change
"""bytes_list from a bytes or str or array-like of those."""
"""bytes_list from bytes, str, or array-like of those."""

Copilot uses AI. Check for mistakes.
def int_list_feature(value):
"""Return an int64_list from a list od int values."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=value.reshape(-1)))
"""int64_list from scalar or array-like of ints/bools."""
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should use "a bytes or str" or "bytes or str" consistently. Also, for grammatical correctness with list items, use commas: "bytes, str, or array-like".

Suggested change
"""int64_list from scalar or array-like of ints/bools."""
"""int64_list from a scalar, int, bool, or array-like of ints or bools."""

Copilot uses AI. Check for mistakes.
builder = BaseFeatureCollectionBuilder(features=feature_instances)
if hasattr(self.config, 'get_prepare_func'):
builder.add_custom(self.config.get_prepare_func()) # add prepare function
return builder.build() # finally build the feature collection
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.

Suggested change
return builder.build() # finally build the feature collection
return builder.build() # finally build the feature collection

Copilot uses AI. Check for mistakes.
)

def _get_targetmap_feature(self, strength_type, **kwargs): # noqa ARG002
freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.

Suggested change
freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data)
if strength_type == 'analytic':
freq_data = self.freq_data
elif self.mode == 'welch':
freq_data = self.fft_obs_spectra
else:
freq_data = self.freq_data

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants