Updated data ingestion to use spanner API#16
Conversation
…namespace prefixes were inserted into the database.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data ingestion pipeline by integrating directly with the Google Cloud Spanner API, moving away from SQLAlchemy ORM for graph data writes. This strategic shift is designed to boost the efficiency and scalability of handling large volumes of graph data. Additionally, it introduces improved data modeling for edge values, better handling of identifiers, and a new command-line tool for database table management, streamlining development and operational workflows. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the data ingestion process to use the native Google Cloud Spanner API instead of SQLAlchemy for improved performance and batching capabilities, and introduces a new CLI command to drop database tables. However, the new implementation introduces several security and reliability concerns: the ingestion process is vulnerable to Denial of Service (DoS) attacks due to improper handling of non-string property values and incorrect data types when interacting with the Spanner SDK, and a flaw in the batching logic can lead to silent data loss for nodes with a large number of edges. Additionally, a critical issue with an interactive prompt in the service layer could block execution, and there are medium-severity issues related to a bug in data processing, logging practices, and code cleanliness. Addressing these issues is crucial for the robustness and availability of the ingestion pipeline.
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
| if len(val_bytes) > OBJECT_VALUE_MAX_LENGTH: | ||
| import base64 | ||
|
|
||
| return base64.b64encode(val_bytes).decode("utf-8") |
There was a problem hiding this comment.
The get_edge_val function returns a base64-encoded string (str) for the object_bytes column when the original value exceeds the maximum index key length. However, the object_bytes column is defined as a LargeBinary (Spanner BYTES) in the EdgeModel. The Google Cloud Spanner Python client library expects bytes for BYTES columns. Passing a str will likely cause a TypeError during the batch insertion, leading to a crash and failure of the ingestion process for any data containing long strings.
| return base64.b64encode(val_bytes).decode("utf-8") | |
| return base64.b64encode(val_bytes) |
There was a problem hiding this comment.
+1, i'm not sure why we'd decode the object_bytes value
There was a problem hiding this comment.
I'll look into this as a follow up, but it appears to be writing the value properly to spanner 🤷
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
…e.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
| if len(val_bytes) > OBJECT_VALUE_MAX_LENGTH: | ||
| import base64 | ||
|
|
||
| return base64.b64encode(val_bytes).decode("utf-8") |
There was a problem hiding this comment.
+1, i'm not sure why we'd decode the object_bytes value
packages/datacommons-api/datacommons_api/services/graph_service.py
Outdated
Show resolved
Hide resolved
| # Remove all CURIE namespaces before storing the node id | ||
| subject_id = strip_namespace(graph_node.id) | ||
| types = [strip_namespace(t) for t in types] | ||
| return NodeModel( |
There was a problem hiding this comment.
long term when we revisit the code: it might be helpful to be more explicit for what this NodeModel is, maybe DbNodeModel? to differentiate from all of the other ways "Node" is used in datacommons.
| val_truncated = val_bytes[:OBJECT_VALUE_MAX_LENGTH].decode( | ||
| "utf-8", errors="ignore" | ||
| ) | ||
| return val_truncated |
There was a problem hiding this comment.
do we need to worry about hashing at all for this value? if we're just truncating, is there possibility for index key collision?
There was a problem hiding this comment.
Good call- added a todo to look into this. Maybe we can truncate the value and append it to a hash of the entire byte array.
| object_value = sa.Column(Text(), nullable=True) | ||
| object_value = sa.Column(String(OBJECT_VALUE_MAX_LENGTH), nullable=True) | ||
| object_bytes = sa.Column(sa.LargeBinary(), nullable=True) | ||
| object_hash = sa.Column(String(64), primary_key=True, nullable=True) |
There was a problem hiding this comment.
what is the object_hash used for again?
There was a problem hiding this comment.
I am not sure- @n-h-diaz do you know what this field is for in the spanner db?
| # Insert nodes and edges in batches | ||
| success_count = 0 | ||
| try: | ||
| for node_model_batch in node_model_batches: |
There was a problem hiding this comment.
what happens if nodeA points to nodeB which is in a later batch and not in the graph yet. I think this might cause an error trying to write nodeA in the first batch because the edge pointing to nodeB would say that nodeB wasn't found in the node table yet?
If that seems plausible, we can mark as a todo to handle this case. Or ignore if that's not actually an issue.
There was a problem hiding this comment.
Good call- marking this as a todo for now.
this could apply to both nodes that are in later batches, and also nodes that are in remote knowledge graphs and not defined locally
Refactors graph node ingestion to use native Spanner batch mutations for performance, handles Spanner's index key limits with bytes storage pattern, and adds tests for
GraphServiceGraphServicewith mocked SQLAlchemy and Spanner clients.