Skip to content

Feat: add origins in update db script #1409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MarceloRobert
Copy link
Collaborator

@MarceloRobert MarceloRobert commented Aug 20, 2025

Description

The update_db script that moves data from kcidb to a separate db can have an optional origins filter so that we can avoid querying unnecessary data. Not only that but the current script still tries to follow the foreign key constraint by limiting results to the ones that have the related data in the target db, but that should be optional.

Changes

  • Some minor improvements to the command (args organization, typing)
  • Added --related-data-only argument and makes the filter for related data optional
  • Added --origins argument receiving a comma-separated string with a list of origins to be filtered by and updated queries accordingly

How to test

Use the update_db command with whatever interval; test the arguments --related-data-only (when set, the command should update less items) and --origins. Combine the origins to check if the counting is right (you can also check in kcidb directly if you want).

Closes #1401

@MarceloRobert MarceloRobert self-assigned this Aug 20, 2025
@MarceloRobert MarceloRobert force-pushed the feat/add-origins-in-populate-db branch from a61c5ea to 164b4ea Compare August 21, 2025 13:03
@MarceloRobert MarceloRobert marked this pull request as ready for review August 21, 2025 13:17
Comment on lines 474 to 509
if self.related_data_only:
existing_issues_ids = set(
Issues.objects.using(self.dashboard_conn_name).values_list(
"id", flat=True
)
)

if len(existing_issues_ids) == 0:
return []
issue_id_placeholders = ",".join(["%s"] * len(existing_issues_ids))
related_condition = f"AND issue_id IN ({issue_id_placeholders})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the query, this code seems duplicated many times. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor for a wrapper, maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that I could use a function to replace

issue_id_placeholders = ",".join(["%s"] * len(existing_issues_ids))
related_condition = f"AND issue_id IN ({issue_id_placeholders})"

but the rest seems like they would need to stay. Are you thinking of something else or is that it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left another comment about it

Comment on lines +59 to +51
self.origins: list[str]
self.origin_condition: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not initialize or make Optional, the same as the other fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that I didn't really need to initialize them or use Optional if the first thing the command does is assigning those variables. Do you think it's better with assign or without it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_interval, end_interval, and related_data_onlywere also assigned as the first thing, and you assigned them

Since we removed the foreign key constraints, we don't need to limit the selected data to the related ones. With this option we can select if we want to fetch only related items or not.
@MarceloRobert MarceloRobert force-pushed the feat/add-origins-in-populate-db branch from 164b4ea to e0fb298 Compare August 21, 2025 22:28
Comment on lines +59 to +51
self.origins: list[str]
self.origin_condition: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_interval, end_interval, and related_data_onlywere also assigned as the first thing, and you assigned them

Comment on lines +305 to +322
if self.related_data_only:
checkout_ids = set(
(
Checkouts.objects.using(self.dashboard_conn_name)
.filter(
field_timestamp__gte=self.start_timestamp,
field_timestamp__lte=self.end_timestamp,
)
.values_list("id", flat=True)
)
.values_list("id", flat=True)
)
)

if len(checkout_ids) == 0:
return []
checkout_id_placeholders = ",".join(["%s"] * len(checkout_ids))
if len(checkout_ids) == 0:
return []
checkout_id_placeholders = ",".join(["%s"] * len(checkout_ids))
related_condition = (
f"AND builds.checkout_id IN ({checkout_id_placeholders})"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a method that returns a tuple of string and set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will avoid duplicating code, just use as parameters an array for defining the ids, the field for related_condition (builds.checkout_id)

Comment on lines +399 to +412
if self.related_data_only:
existing_build_ids = set(
Builds.objects.using(self.dashboard_conn_name)
.filter(
field_timestamp__gte=self.start_timestamp,
field_timestamp__lte=self.end_timestamp,
)
.values_list("id", flat=True)
)
.values_list("id", flat=True)
)

if len(existing_build_ids) == 0:
return []
build_id_placeholders = ",".join(["%s"] * len(existing_build_ids))
if len(existing_build_ids) == 0:
return []
build_id_placeholders = ",".join(["%s"] * len(existing_build_ids))
related_condition = f"AND build_id IN ({build_id_placeholders})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be used here

Comment on lines +499 to +509
if self.related_data_only:
existing_issues_ids = set(
Issues.objects.using(self.dashboard_conn_name).values_list(
"id", flat=True
)
)

if len(existing_issues_ids) == 0:
return []
issue_id_placeholders = ",".join(["%s"] * len(existing_issues_ids))
if len(existing_issues_ids) == 0:
return []
issue_id_placeholders = ",".join(["%s"] * len(existing_issues_ids))
related_condition = f"AND issue_id IN ({issue_id_placeholders})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify origins on populate_db script
2 participants