-
Notifications
You must be signed in to change notification settings - Fork 4
Custom features #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Custom features #47
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request refactors the feature handling and TensorFlow serialization system for acoustic datasets, introducing a more modular architecture for defining and processing features. The changes replace hardcoded feature logic with a dynamic system based on convention-based method naming (_get_default_feature_{name}), and enhance TFRecord serialization to properly handle variable-length and complex-valued features.
Key changes:
- Introduced
get_default_featuresmethod inConfigBaseenabling dynamic feature instantiation via_get_default_feature_{name}hooks - Refactored
BaseFeatureCollectionBuilderto accept feature catalog instances and automatically infer TensorFlow encoding/dtype/shape mappings - Enhanced TFRecord writer and parser to track shapes of variable-length features and properly encode/decode complex values
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/acoupipe/writer.py | Added infer_tf_encoding function, refactored encoding functions to handle array-like inputs, and implemented shape tracking for variable-length features in WriteTFRecord |
| src/acoupipe/datasets/features.py | Added shape trait to BaseFeatureCatalog, fixed source location extraction in LocFeature, removed BaseFeatureCollection.__init__, and refactored BaseFeatureCollectionBuilder to use feature catalogs with automatic mapper registration |
| src/acoupipe/datasets/base.py | Added get_default_features infrastructure to ConfigBase, updated get_feature_collection to support both custom and default features, and enhanced TFRecord parser to handle dynamic shapes and complex values |
| src/acoupipe/datasets/synthetic.py | Simplified get_feature_collection to use new default feature system, added _get_default_feature_* methods for all supported features, and removed legacy DatasetSyntheticFeatureCollectionBuilder class |
| src/acoupipe/datasets/experimental.py | Removed custom get_feature_collection and MIRACLEFeatureCollectionBuilder, added MIRACLE-specific override methods for source strength and targetmap features |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # get features with varying length to handle them correctly in the TFRecord writer | ||
| shape_features = [] | ||
| for feature, shape in feature_collection.feature_tf_shape_mapper.items(): | ||
| # if more then one None in shape, we have a varying length feature |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment incorrectly states "if more then one None" but it should be "if more than one None".
| # if more then one None in shape, we have a varying length feature | |
| # if more than one None in shape, we have a varying length feature |
| encoded_features = self._encode_sample(features, encoders) | ||
| example = tf.train.Example(features=tf.train.Features(feature=encoded_features)) | ||
| # Serialize to string and write on the file | ||
| writer.write(example.SerializeToString()) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removed code includes writer.flush() and writer.close() calls that are no longer present. While the context manager (with statement) will automatically close the writer, the explicit flush() call ensured data was written to disk immediately after each sample. Without it, data may be buffered and only written at the end, which could be problematic if the process is interrupted. Consider whether the flush behavior is needed for robustness.
| writer.write(example.SerializeToString()) | |
| writer.write(example.SerializeToString()) | |
| writer.flush() |
| """ | ||
| Decide which encoder function, TF dtype and TF shape to use | ||
| for a feature with given numpy / Python dtype and shape. | ||
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing blank line before the Parameters section in the docstring. According to PEP 257 and numpydoc conventions, there should be a blank line separating the summary/description from the Parameters section.
|
|
||
| def _get_targetmap_feature(self, strength_type, **kwargs): # noqa ARG002 | ||
| fdim = self._get_fdim(kwargs['f']) | ||
| freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.
| freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data) | |
| if strength_type == 'analytic': | |
| freq_data = self.freq_data | |
| elif self.mode == 'welch': | |
| freq_data = self.fft_obs_spectra | |
| else: | |
| freq_data = self.freq_data |
| default_features = self.config.get_default_features(default_feature_names, f, num) | ||
| builder = BaseFeatureCollectionBuilder(features=default_features + custom_features) | ||
| builder.add_custom(self.config.get_prepare_func()) # add prepare function | ||
| return builder.build() # finally build the feature collection |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.
| return builder.build() # finally build the feature collection | |
| return builder.build() # finally build the feature collection |
| if isinstance(value, type(tf.constant(0))): | ||
| value = value.numpy() # BytesList won't unpack a string from an EagerTensor. | ||
| return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) | ||
| """bytes_list from a bytes or str or array-like of those.""" |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has a grammatical error. It should be "bytes_list from bytes, str, or array-like of those" (with commas for proper list formatting).
| """bytes_list from a bytes or str or array-like of those.""" | |
| """bytes_list from bytes, str, or array-like of those.""" |
| def int_list_feature(value): | ||
| """Return an int64_list from a list od int values.""" | ||
| return tf.train.Feature(int64_list=tf.train.Int64List(value=value.reshape(-1))) | ||
| """int64_list from scalar or array-like of ints/bools.""" |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should use "a bytes or str" or "bytes or str" consistently. Also, for grammatical correctness with list items, use commas: "bytes, str, or array-like".
| """int64_list from scalar or array-like of ints/bools.""" | |
| """int64_list from a scalar, int, bool, or array-like of ints or bools.""" |
| builder = BaseFeatureCollectionBuilder(features=feature_instances) | ||
| if hasattr(self.config, 'get_prepare_func'): | ||
| builder.add_custom(self.config.get_prepare_func()) # add prepare function | ||
| return builder.build() # finally build the feature collection |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after the # in inline comments. According to PEP 8, inline comments should be separated by at least two spaces from the code and should start with # (hash followed by a single space). These comments should be formatted as # add prepare function and # finally build the feature collection.
| return builder.build() # finally build the feature collection | |
| return builder.build() # finally build the feature collection |
| ) | ||
|
|
||
| def _get_targetmap_feature(self, strength_type, **kwargs): # noqa ARG002 | ||
| freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line length exceeds readability standards and PEP 8 guidelines (79-120 characters recommended). This nested ternary expression is hard to read. Consider breaking it into multiple lines or extracting the logic into a separate variable for clarity.
| freq_data = self.freq_data if strength_type == 'analytic' else (self.fft_obs_spectra if self.mode == 'welch' else self.freq_data) | |
| if strength_type == 'analytic': | |
| freq_data = self.freq_data | |
| elif self.mode == 'welch': | |
| freq_data = self.fft_obs_spectra | |
| else: | |
| freq_data = self.freq_data |
97011c8 to
5ab8c3e
Compare
This pull request significantly refactors the feature handling and TensorFlow serialization pipeline for datasets, introducing a more modular and extensible approach to feature definition and collection. The main improvements include a new system for default feature instantiation, streamlined feature collection building, and enhanced support for variable-length and complex features in TensorFlow records.
Feature handling and extensibility:
get_default_featuresand corresponding_get_default_feature_{name}hooks in dataset configs, enabling modular and extensible default feature instantiation based on feature names. This replaces the previous hardcoded logic and allows for easier customization and extension of features. [1] [2]BaseFeatureCollectionBuilder) to accept a list ofBaseFeatureCataloginstances, automatically registering feature functions and their TensorFlow encoding/shape/dtype using the newinfer_tf_encodingutility. [1] [2] [3]get_feature_collectionimplementation fromexperimental.py, delegating feature instantiation to the new default feature mechanism.TensorFlow serialization improvements:
Nonedimensions are now tracked and their shapes are stored in the TFRecord, enabling correct parsing and reshaping at load time. Complex features are stored as float pairs and reconstructed during parsing. [1] [2]Codebase cleanup:
experimental.py, simplifying the code and reducing redundancy. [1] [2]LocFeature.Overall, these changes make the feature pipeline more flexible, maintainable, and compatible with advanced TensorFlow serialization needs.
Most important changes:
Feature handling and extensibility
get_default_featuresand_get_default_feature_{name}hooks to enable modular, name-based feature instantiation in dataset configs, replacing hardcoded feature logic. [1] [2]BaseFeatureCollectionBuilderto accept a list ofBaseFeatureCataloginstances, automatically registering feature functions and TensorFlow encoding/shape/dtype mappings usinginfer_tf_encoding. [1] [2] [3]experimental.py, delegating feature instantiation to the new default feature mechanism.TensorFlow serialization improvements
Codebase cleanup
LocFeature. [1] [2]