Skip to content

JSON LD parsing fails when multiple strings given to FileSet.includes #698

@brendon-boldt

Description

@brendon-boldt

FileSet(..., includes=["xyz"]) works but FileSet(..., includes=["xyz", "wxy"]) does not work, failing with:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "REDACTED/util/croissant.py", line 154, in <module>
    test()
  File "REDACTED/util/croissant.py", line 141, in test
    dataset = mlc.Dataset(jsonld="croissant.json")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 6, in __init__
  File "REDACTED/mlcroissant/_src/datasets.py", line 70, in __post_init__
    self.metadata = Metadata.from_file(ctx=ctx, file=self.jsonld)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/structure_graph/nodes/metadata.py", line 429, in from_file
    return cls.from_json(ctx=ctx, json_=json_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/structure_graph/nodes/metadata.py", line 439, in from_json
    jsonld = expand_jsonld(json_, ctx=ctx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 231, in expand_jsonld
    recursively_populate_jsonld(entry_node, id_to_node)
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 171, in recursively_populate_jsonld
    value = [recursively_populate_jsonld(child, id_to_node) for child in value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 161, in recursively_populate_jsonld
    return recursively_populate_jsonld(entry_node, id_to_node)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 171, in recursively_populate_jsonld
    value = [recursively_populate_jsonld(child, id_to_node) for child in value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACTED/mlcroissant/_src/core/json_ld.py", line 164, in recursively_populate_jsonld
    for key, value in entry_node.copy().items():
                      ^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'copy'

presumably due to

value = [recursively_populate_jsonld(child, id_to_node) for child in value]

trying to recurse into the array of strings by running recursively_populate_jsonld on strs when it seems to be intended for dicts only. One simple fix would be to exit early from recusrively_populate_jsonld is entry_node is not a dict, but maybe this is just masking a deeper problem.

If you want me to open a PR or something, let me know. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions