-
Notifications
You must be signed in to change notification settings - Fork 2
Add ES-based metadata retrieval #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
40a0eeb
housekeeping
shuchenliu 1bdc550
add enum for t1 indices
shuchenliu f663c11
add enum for t1 indices
shuchenliu 554945d
add ES-based metadata extraction for t1
shuchenliu 3abf1f7
add tests for metadata retrieval and operation generation
shuchenliu cd878cd
fix coverage test
shuchenliu 76dcd21
add OperationNodes merging logic
shuchenliu 1c3c5a8
add Operation merging logic
shuchenliu ddaa338
Merge remote-tracking branch 'origin/main' into kg-meta
shuchenliu a45fe1f
add simple hashing function for operations
shuchenliu cd73bec
update driver to use merge_operations
shuchenliu 0c6f3b4
skip metadata assertion for now
shuchenliu ff39c73
remove ttl
shuchenliu 81bdfe7
add method to get indices from ES covered by the production alias dyn…
shuchenliu 64a40c2
Merge branch 'main' into kg-meta
shuchenliu 4a59633
fix linting
shuchenliu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 changes: 3 additions & 1 deletion
4
src/retriever/data_tiers/tier_1/elasticsearch/constraints/attributes/meta_info.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...retriever/data_tiers/tier_1/elasticsearch/constraints/attributes/ops/handle_comparison.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
src/retriever/data_tiers/tier_1/elasticsearch/constraints/attributes/ops/handle_match.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
..._1/elasticsearch/constraints/qualifier.py → ...earch/constraints/qualifiers/qualifier.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| from collections import defaultdict | ||
| from copy import deepcopy | ||
| from typing import Any | ||
|
|
||
| import ormsgpack | ||
| from elasticsearch import AsyncElasticsearch | ||
| from loguru import logger as log | ||
|
|
||
| from retriever.config.general import CONFIG | ||
| from retriever.data_tiers.utils import ( | ||
| generate_operation, | ||
| get_simple_op_hash, | ||
| parse_dingo_metadata_unhashed, | ||
| ) | ||
| from retriever.types.dingo import DINGOMetadata | ||
| from retriever.types.metakg import Operation, OperationNode, UnhashedOperation | ||
| from retriever.types.trapi import BiolinkEntity, Infores, MetaAttributeDict | ||
| from retriever.utils.redis import REDIS_CLIENT | ||
| from retriever.utils.trapi import hash_hex | ||
|
|
||
| T1MetaData = dict[str, Any] | ||
|
|
||
| CACHE_KEY = "TIER1_META" | ||
|
|
||
|
|
||
| async def get_t1_indices( | ||
| client: AsyncElasticsearch, | ||
| ) -> list[str]: | ||
| """For fetch a list of indices from ES.""" | ||
| resp = await client.indices.resolve_index( | ||
| name=CONFIG.tier1.elasticsearch.index_name | ||
| ) | ||
| if "aliases" not in resp: | ||
| raise Exception( | ||
| f"Failed to get indices from ES: {CONFIG.tier1.elasticsearch.index_name}" | ||
| ) | ||
|
|
||
| backing_indices: list[str] = [] | ||
| for a in resp.get("aliases", []): | ||
| if a["name"] == "dingo": | ||
| backing_indices.extend(a["indices"]) | ||
|
|
||
| return backing_indices | ||
|
|
||
|
|
||
| async def save_metadata_cache(key: str, payload: T1MetaData) -> None: | ||
| """Wrapper for persist metadata in Redis.""" | ||
| await REDIS_CLIENT.set( | ||
| hash_hex(hash(key)), | ||
| ormsgpack.packb(payload), | ||
| compress=True, | ||
| ) | ||
|
|
||
|
|
||
| async def read_metadata_cache(key: str) -> T1MetaData | None: | ||
| """Wrapper for retrieving persisted metadata in Redis.""" | ||
| metadata_pack = await REDIS_CLIENT.get(hash_hex(hash(key)), compressed=True) | ||
| if metadata_pack is not None: | ||
| return ormsgpack.unpackb(metadata_pack) | ||
|
|
||
| return None | ||
|
|
||
|
|
||
| def extract_metadata_entries_from_blob( | ||
| blob: T1MetaData, indices: list[str] | ||
| ) -> list[T1MetaData]: | ||
| """Extract a list of metadata entries from raw blob.""" | ||
| meta_entries: list[T1MetaData] = list( | ||
| filter( | ||
| None, | ||
| [blob[index_name].get("graph") for index_name in indices], | ||
| ) | ||
| ) | ||
|
|
||
| return meta_entries | ||
|
|
||
|
|
||
| async def retrieve_metadata_from_es( | ||
| es_connection: AsyncElasticsearch, indices_alias: str | ||
| ) -> T1MetaData: | ||
| """Method to retrieve prefetched metadata from Elasticsearch.""" | ||
| mappings = await es_connection.indices.get_mapping(index=indices_alias) | ||
| tier1_indices = await get_t1_indices(es_connection) | ||
|
|
||
| # here we pull an array of metadata, instead of 1 | ||
|
|
||
| meta: T1MetaData = defaultdict(dict) | ||
| for index_name in tier1_indices: | ||
| raw = mappings[index_name]["mappings"]["_meta"] | ||
| keys = ["graph", "release"] | ||
|
|
||
| for key in keys: | ||
| blob = raw.get(key) | ||
| if blob: | ||
| meta[index_name].update({key: blob}) | ||
|
|
||
| if not meta: | ||
| raise ValueError("No metadata retrieved from Elasticsearch.") | ||
|
|
||
| return meta | ||
|
|
||
|
|
||
| RETRY_LIMIT = 3 | ||
|
|
||
|
|
||
| async def get_t1_metadata( | ||
| es_connection: AsyncElasticsearch | None, indices_alias: str, retries: int = 0 | ||
| ) -> T1MetaData | None: | ||
| """Caller to orchestrate retrieving t1 metadata.""" | ||
| meta_blob = await read_metadata_cache(CACHE_KEY) | ||
| if not meta_blob: | ||
| try: | ||
| if es_connection is None: | ||
| raise ValueError( | ||
| "Invalid Elasticsearch connection. Driver must be initialized and connected." | ||
| ) | ||
| meta_blob = await retrieve_metadata_from_es(es_connection, indices_alias) | ||
| await save_metadata_cache(CACHE_KEY, meta_blob) | ||
| except ValueError as e: | ||
| # if exceeds retries or ES connection is invalid, return None | ||
| if retries == RETRY_LIMIT or str(e).startswith( | ||
| "Invalid Elasticsearch connection" | ||
| ): | ||
| return None | ||
| return await get_t1_metadata(es_connection, indices_alias, retries + 1) | ||
|
|
||
| log.success("DINGO Metadata retrieved!") | ||
| return meta_blob | ||
|
|
||
|
|
||
| def hash_meta_attribute(attr: MetaAttributeDict) -> int: | ||
| """Method to hash MetaAttributeDict.""" | ||
| keys = [ | ||
| "attribute_type_id", | ||
| "attribute_source", | ||
| "original_attribute_names", | ||
| "constraint_use", | ||
| "constraint_name", | ||
| ] | ||
| values: list[Any] = [] | ||
| for key in keys: | ||
| val: list[str] | None = attr.get(key) | ||
| if isinstance(val, list): | ||
| values.append(tuple(val)) | ||
| else: | ||
| values.append(val) | ||
| return hash(tuple(values)) | ||
|
|
||
|
|
||
| def merge_nodes( | ||
| nodes: dict[BiolinkEntity, OperationNode], | ||
| curr_nodes: dict[BiolinkEntity, OperationNode], | ||
| infores: Infores, | ||
| ) -> dict[BiolinkEntity, OperationNode]: | ||
| """Merge OperationNodes generated.""" | ||
| for category, node in curr_nodes.items(): | ||
| # Category not seen before → initialize | ||
| if category not in nodes: | ||
| nodes[category] = deepcopy(node) | ||
| continue | ||
|
|
||
| existing = nodes[category] | ||
| # Merge prefixes | ||
| existing.prefixes[infores].extend(node.prefixes[infores]) | ||
| # Merge attributes | ||
| existing.attributes[infores].extend(node.attributes[infores]) | ||
|
|
||
| return nodes | ||
|
|
||
|
|
||
| def dedupe_nodes( | ||
| nodes: dict[BiolinkEntity, OperationNode], infores: Infores | ||
| ) -> dict[BiolinkEntity, OperationNode]: | ||
| """De-duplicate OperationNodes generated.""" | ||
| for current in nodes.values(): | ||
| current.prefixes[infores] = list(set(current.prefixes[infores])) | ||
|
|
||
| seen_attr: set[int] = set() | ||
| attrs: list[MetaAttributeDict] = current.attributes[infores] | ||
| deduped: list[MetaAttributeDict] = [] | ||
| for attr in attrs: | ||
| hash_code = hash_meta_attribute(attr) | ||
| if hash_code not in seen_attr: | ||
| deduped.append(attr) | ||
| seen_attr.add(hash_code) | ||
|
|
||
| current.attributes[infores] = deduped | ||
|
|
||
| return nodes | ||
|
|
||
|
|
||
| def merge_operations(ops_unhashed: list[UnhashedOperation]) -> list[Operation]: | ||
| """Merge duplicate operations.""" | ||
| seen_op = dict[str, Operation]() | ||
| operations = list[Operation]() | ||
|
|
||
| for op in ops_unhashed: | ||
| op_hash = get_simple_op_hash(op) | ||
| if op_hash not in seen_op: | ||
| operation = generate_operation(op, op_hash) | ||
| operations.append(operation) | ||
| seen_op[op_hash] = operation | ||
| # needs merging if seen | ||
| else: | ||
| hashed_op = seen_op[op_hash] | ||
| if hashed_op.attributes is not None and op.attributes is not None: | ||
| hashed_op.attributes.extend(op.attributes) | ||
| if hashed_op.qualifiers is not None and op.qualifiers is not None: | ||
| hashed_op.qualifiers.update(op.qualifiers) | ||
|
|
||
| # ignoring access_metadata for now | ||
|
|
||
| return operations | ||
|
|
||
|
|
||
| async def generate_operations( | ||
| meta_entries: list[T1MetaData], | ||
| ) -> tuple[list[Operation], dict[BiolinkEntity, OperationNode]]: | ||
| """Generate operations and associated nodes based on metadata provided.""" | ||
| infores = Infores(CONFIG.tier1.backend_infores) | ||
|
|
||
| operations_unhashed: list[UnhashedOperation] = [] | ||
| nodes: dict[BiolinkEntity, OperationNode] = {} | ||
|
|
||
| for meta_entry in meta_entries: | ||
| curr_ops, curr_nodes = parse_dingo_metadata_unhashed( | ||
| DINGOMetadata(**meta_entry), 1, infores | ||
| ) | ||
| operations_unhashed.extend(curr_ops) | ||
| nodes = merge_nodes(nodes, curr_nodes, infores) | ||
|
|
||
| operations = merge_operations(operations_unhashed) | ||
| nodes = dedupe_nodes(nodes, infores) | ||
|
|
||
| log.success(f"Parsed {infores} as a Tier 1 resource.") | ||
| return operations, nodes | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.