Enhanced scan report upload speed #1227

brian-kim31 · 2025-11-26T17:56:53Z

⚡️ Optimization

PR Description

The bulk inserts in update_temp_data_dictionary_table() and create_temp_field_values_table() were using pg_hook.run() in loops, which meant one INSERT per row. For large scan reports with thousands of records, this was slow and could cause DAG timeouts.
This PR switched to psycopg2.extras.execute_values() to batch all rows into a single INSERT, which is much faster and avoids timeout issues.

Related Issues or other material

Related #1226
Closes #1226

Screenshots, example outputs/behaviour etc.

Below is a screenshot of the new record time it takes to upload the scan report and data dictionary

prquinlan

Comments added to ensure we have a mechanism to manage large inputs and to have a batching strategy.

prquinlan · 2025-11-26T19:17:22Z

app/airflow/dags/libs/SR_processing/db_services.py

+            conn = pg_hook.get_conn()
+            cursor = conn.cursor()
+            try:
+                execute_values(


It will be good to have some sanity check and batching on this input, at the moment I think this could be any size and we probably want a batch of a certain size.

Previously pg_hook.insert_rows did this automatically so we would want something similar to ensure it is robust.

prquinlan · 2025-11-26T19:17:39Z

app/airflow/dags/libs/SR_processing/db_services.py

+            conn = pg_hook.get_conn()
+            cursor = conn.cursor()
+            try:
+                execute_values(


same here as well.

AndyRae · 2025-11-27T10:00:46Z

app/airflow/dags/libs/SR_processing/db_services.py

+            conn = pg_hook.get_conn()
+            cursor = conn.cursor()
+            try:
+                execute_values(


I see this replaces insert_rows with a direct psycopg2 call. Could you clarify why we're bypassing the Airflow hook rather than tuning its batching on the insert_rows parameters?

I thought that where's we left it on Tuesday - so curious why you have moved on to reimplement Airflow's logic?

Using psycopg2 directly adds a second db access pattern, and we should balance the trade-off we are accepting by going to the driver level.

After trying fast_executemany yesterday, it wasnt working. I did some research and found that you cannot use fast_executemany with Airflow’s PostgresHook because it is a pyodbc feature intended for SQL Server, while the PostgresHook uses psycopg2, which does not support it.

Ah that's interesting, what do you mean by it's not working?

I'm surprised as this is from the Postgres provider documentation: https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_api/airflow/providers/postgres/hooks/postgres/index.html

And is very much in the Postgres code: https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_modules/airflow/providers/postgres/hooks/postgres.html#PostgresHook.insert_rows

And in psycopg2: https://www.psycopg.org/docs/extras.html#fast-execution-helpers

Do you have an error message?

There is also more than fast_executemany to tune - so understanding what is not working before we drop down would be good.

…f upload

brian-kim31 · 2025-11-27T13:21:13Z

@AndyRae @prquinlan I have decided to simplify the logic by keeping the initial implementation. I will use fast_executemany=True, which allows for optimized bulk execution, and set commit_every=3000 to write more files per batch and reduce the number of round trips. I have tested this locally, and the performance has improved. It took 2 min 27 seconds.

I would suggest we test it on dev and see the outcome.

prquinlan · 2025-11-27T13:31:50Z

Yeah, I think this is the best way. I expect you to see more improvements when you move to dev, as the latency of each round is higher.

Maybe also first worth trying different commit_every locally - if you increment that up, do you get incremental improvements?

brian-kim31 · 2025-11-27T14:26:55Z

Yeah, I think this is the best way. I expect you to see more improvements when you move to dev, as the latency of each round is higher.

Maybe also first worth trying different commit_every locally - if you increment that up, do you get incremental improvements?

Yes that is possible. The Upload is taking place as we speak on dev. I am monitoring it to see how long it takes to complete then we can see if we should increase the commit_every

prquinlan · 2025-11-27T14:45:19Z

Yeah, I think this is the best way. I expect you to see more improvements when you move to dev, as the latency of each round is higher.
Maybe also first worth trying different commit_every locally - if you increment that up, do you get incremental improvements?

Yes that is possible. The Upload is taking place as we speak on dev. I am monitoring it to see how long it takes to complete then we can see if we should increase the commit_every

Perfect.

I would try the different batch sizes locally, to find a good setting, then test on dev.

brian-kim31 · 2025-11-27T14:59:40Z

On trying with a commit of 3000, it took 59 minutes. No dramatic change. I'm gonna try with 6000 and see the result.

Locally, trying a higher commits doesn't result in lower time as well as it flactutes from time to time but all are under 3 minutes

@prquinlan @AndyRae

AndyRae · 2025-11-27T15:37:56Z

On trying with a commit of 3000, it took 59 minutes. No dramatic change. I'm gonna try with 6000 and see the result.

That is interesting - what do the Airflow logs say they are running on the database?

brian-kim31 · 2025-11-27T15:40:40Z

On trying with a commit of 3000, it took 59 minutes. No dramatic change. I'm gonna try with 6000 and see the result.

That is interesting - what do the Airflow logs say they are running on the database?

The logs say all was successful, and it was written with a batch of 3000 as instructed.

AndyRae · 2025-11-28T08:26:21Z

app/airflow/dags/libs/SR_processing/db_services.py

                    "value_description",
                ],
+                fast_executemany=True,
+                commit_every=0,  # commit every row to avoid transaction overhead


It'd be helpful to have this as an environment variable in config, so we tune through configuration rather than deployment.

Improve speed of writing to DB

7a6c287

brian-kim31 requested review from AndrewThien and AndyRae as code owners November 26, 2025 17:56

github-actions bot added the Area: Airflow label Nov 26, 2025

brian-kim31 self-assigned this Nov 26, 2025

brian-kim31 changed the title ~~Improve speed of writing to DB~~ Enhanced scan report upload speed Nov 26, 2025

prquinlan requested changes Nov 26, 2025

View reviewed changes

AndyRae reviewed Nov 27, 2025

View reviewed changes

Added fast_executemany and commit_every params to improve the speed o…

383fc31

…f upload

brian-kim31 temporarily deployed to dev November 27, 2025 13:44 — with GitHub Actions Inactive

increase commit to 10k

5528449

brian-kim31 temporarily deployed to dev November 27, 2025 15:09 — with GitHub Actions Inactive

brian-kim31 had a problem deploying to dev November 27, 2025 17:01 — with GitHub Actions Failure

brian-kim31 had a problem deploying to dev November 27, 2025 17:10 — with GitHub Actions Failure

brian-kim31 temporarily deployed to dev November 27, 2025 17:28 — with GitHub Actions Inactive

brian-kim31 added 2 commits November 27, 2025 19:19

Added Commit every row to avoid transaction overhead

5e8e69c

Added Commit every row to avoid transaction overhead

8494f21

brian-kim31 temporarily deployed to dev November 27, 2025 19:23 — with GitHub Actions Inactive

AndyRae requested changes Nov 28, 2025

View reviewed changes

Commit every 10k

cfbac32

brian-kim31 temporarily deployed to dev November 28, 2025 11:52 — with GitHub Actions Inactive

Enhanced scan report upload speed #1227

Are you sure you want to change the base?

Enhanced scan report upload speed #1227

Uh oh!

Conversation

brian-kim31 commented Nov 26, 2025

PR Description

Related Issues or other material

Screenshots, example outputs/behaviour etc.

Uh oh!

prquinlan left a comment

Choose a reason for hiding this comment

Uh oh!

prquinlan Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

prquinlan Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

AndyRae Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

brian-kim31 Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

AndyRae Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

brian-kim31 commented Nov 27, 2025

Uh oh!

prquinlan commented Nov 27, 2025

Uh oh!

brian-kim31 commented Nov 27, 2025

Uh oh!

prquinlan commented Nov 27, 2025

Uh oh!

brian-kim31 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyRae commented Nov 27, 2025

Uh oh!

brian-kim31 commented Nov 27, 2025

Uh oh!

AndyRae Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brian-kim31 commented Nov 27, 2025 •

edited

Loading