Skip to content

Implement MDIO Dataset builder to create in-memory instance of schemas.v1.dataset.Dataset #568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 10, 2025

Conversation

dmitriyrepin
Copy link

@dmitriyrepin dmitriyrepin commented Jul 2, 2025

Implement MDIO Dataset builder API

Name                                                   Stmts   Miss Branch BrPart  Cover   Missing
--------------------------------------------------------------------------------------------------
src/mdio/schemas/v1/dataset_builder.py                   154      5     58      2    97%   55, 103-104, 322-323

Example of usage:

def make_campos_3d_dataset() -> Dataset:
    """Create in-memory campos_3d dataset."""
    ds = MDIODatasetBuilder(
        "campos_3d",
        attributes=UserAttributes(attributes={
            "textHeader": [
                "C01 .......................... ",
                "C02 .......................... ",
                "C03 .......................... ",
            ],
            "foo": "bar"
        }))

    # Add dimensions
    ds.add_dimension("inline", 256)
    ds.add_dimension("crossline", 512)
    ds.add_dimension("depth", 384)
    ds.add_coordinate("inline", dimensions=["inline"], data_type=ScalarType.UINT32) 
    ds.add_coordinate("crossline", dimensions=["crossline"], data_type=ScalarType.UINT32) 
    ds.add_coordinate("depth", dimensions=["depth"], data_type=ScalarType.FLOAT64, 
                      metadata_info=[
                          AllUnits(units_v1=LengthUnitModel(length=LengthUnitEnum.METER))
                      ])
    # Add coordinates
    ds.add_coordinate(
        "cdp-x",
        dimensions=["inline", "crossline"],
        data_type=ScalarType.FLOAT32,
        metadata_info=[
            AllUnits(units_v1=LengthUnitModel(length=LengthUnitEnum.METER))]
    )
    ds.add_coordinate(
        "cdp-y",
        dimensions=["inline", "crossline"],
        data_type=ScalarType.FLOAT32,
        metadata_info=[
            AllUnits(units_v1=LengthUnitModel(length=LengthUnitEnum.METER))]
    )

    # Add image variable
    ds.add_variable(
        name="image",
        dimensions=["inline", "crossline", "depth"],
        data_type=ScalarType.FLOAT32,
        compressor=Blosc(algorithm="zstd"),
        coordinates=["cdp-x", "cdp-y"],
        metadata_info=[
            ChunkGridMetadata(
                chunk_grid=RegularChunkGrid(
                    configuration=RegularChunkShape(chunk_shape=[128, 128, 128]))
            ),
            StatisticsMetadata(
                stats_v1=SummaryStatistics(
                    count=100,
                    sum=1215.1,
                    sumSquares=125.12,
                    min=5.61,
                    max=10.84,
                    histogram=CenteredBinHistogram(
                        binCenters=[1, 2], counts=[10, 15]),
                )
            ),
            UserAttributes(
                attributes={"fizz": "buzz", "UnitSystem": "Canonical"}),
        ])
    # Add velocity variable
    ds.add_variable(
        name="velocity",
        dimensions=["inline", "crossline", "depth"],
        data_type=ScalarType.FLOAT16,
        coordinates=["cdp-x", "cdp-y"],
        metadata_info=[
            ChunkGridMetadata(
                chunk_grid=RegularChunkGrid(
                    configuration=RegularChunkShape(chunk_shape=[128, 128, 128]))
            ),
            AllUnits(units_v1=SpeedUnitModel(
                speed=SpeedUnitEnum.METER_PER_SECOND)),
        ],
    )
    # Add inline-optimized image variable
    ds.add_variable(
        name="image_inline",
        long_name="inline optimized version of 3d_stack",
        dimensions=["inline", "crossline", "depth"],
        data_type=ScalarType.FLOAT32,
        compressor=Blosc(algorithm="zstd"),
        coordinates=["cdp-x", "cdp-y"],
        metadata_info=[
            ChunkGridMetadata(
                chunk_grid=RegularChunkGrid(
                    configuration=RegularChunkShape(chunk_shape=[4, 512, 512]))
            )]
    )
    # Add headers variable with structured dtype
    ds.add_variable(
        name="image_headers",
        dimensions=["inline", "crossline"],
        data_type=StructuredType(
            fields=[
                StructuredField(name="cdp-x", format=ScalarType.FLOAT32),
                StructuredField(name="cdp-y", format=ScalarType.FLOAT32),
                StructuredField(name="inline", format=ScalarType.UINT32),       
                StructuredField(name="crossline", format=ScalarType.UINT32),
            ]
        ),
        coordinates=["cdp-x", "cdp-y"],
    )
    return ds.build()

@tasansal tasansal added enhancement New feature or request v1 labels Jul 7, 2025
@tasansal tasansal linked an issue Jul 7, 2025 that may be closed by this pull request
Copy link
Collaborator

@tasansal tasansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dimitry! I left some notes that are mostly modifying due to some design choices. Let me know what you think.

Also I haven't reviewed the unit tests yet in case things need to be changes. I will do that once we finalize the implementation.

@tasansal tasansal moved this to In progress in mdio-python 1.0.0 release Jul 7, 2025
@tasansal tasansal self-requested a review July 9, 2025 14:01
Copy link
Collaborator

@tasansal tasansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitriyrepin i think it looks good except a few minor comments. However, may I ask for 1 thing.

Can we please simplify the unit tests, some of them are too low level and can be encapsulated to a higher level test and be removed from tests entirely. We should reduce the lines of code in tests and remove very low level ones that are already tested with higher level functions unless its absolutely necessary.

@tasansal tasansal moved this from In progress to In review in mdio-python 1.0.0 release Jul 9, 2025
@dmitriyrepin
Copy link
Author

Can we please simplify the unit tests, some of them are too low level and can be encapsulated to a higher level test

I have extracted the common test functionality into the following functions:

  • validate_builder()
  • validate_coordinate()
  • validate_variable()

I think now they are much more readable

@tasansal tasansal self-requested a review July 10, 2025 13:16
Copy link
Collaborator

@tasansal tasansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the pre-commit fails which checks for formatting/linting and best practices. Ready to merge once those are resolved.

@dmitriyrepin
Copy link
Author

LGTM, the pre-commit fails which checks for formatting/linting and best practices. Ready to merge once those are resolved.

Done:

nox > pre-commit run --all-files --hook-stage=manual --show-diff-on-failure
[INFO] Initializing environment for https://github.com/jsh9/pydoclint.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier:[email protected].
[INFO] Installing environment for https://github.com/jsh9/pydoclint.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pydoclint................................................................Passed
Format code with Ruff....................................................Passed
Lint code with Ruff......................................................Passed
Check for added large files..............................................Passed
Check Toml...............................................................Passed
Check Yaml...............................................................Passed
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
prettier.................................................................Passed
nox > Session pre-commit was successful.

Copy link

codecov bot commented Jul 10, 2025

Codecov Report

Attention: Patch coverage is 96.51376% with 19 lines in your changes missing coverage. Please review.

Project coverage is 90.40%. Comparing base (0195bb1) to head (1904dee).
Report is 153 commits behind head on v1.

Files with missing lines Patch % Lines
tests/unit/v1/helpers.py 87.87% 6 Missing and 6 partials ⚠️
src/mdio/schemas/v1/dataset_builder.py 95.42% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##               v1     #568      +/-   ##
==========================================
+ Coverage   84.32%   90.40%   +6.08%     
==========================================
  Files          46       70      +24     
  Lines        2194     3587    +1393     
  Branches      305      237      -68     
==========================================
+ Hits         1850     3243    +1393     
+ Misses        301      291      -10     
- Partials       43       53      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tasansal tasansal self-requested a review July 10, 2025 15:50
@tasansal tasansal merged commit 90d31a1 into TGSAI:v1 Jul 10, 2025
10 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in mdio-python 1.0.0 release Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

MDIO v1 In Memory Dataset Manifest Builder
2 participants