From d0e173faed503d8087d6abbd606367ba901ddede Mon Sep 17 00:00:00 2001 From: Carlo Costino Date: Mon, 11 Aug 2025 15:24:47 -0400 Subject: [PATCH 1/2] Add ADR 0015 (improving report generation) and update ADR listing This changeset adds a new ADR, 0015 (Improve API stability when processing job data and generating reports), and updates the ADR log in the ADR README. Signed-off-by: Carlo Costino --- docs/adrs/0015-async-report-generation.md | 48 +++++++++++++++++++++++ docs/adrs/README.md | 10 ++++- 2 files changed, 57 insertions(+), 1 deletion(-) create mode 100644 docs/adrs/0015-async-report-generation.md diff --git a/docs/adrs/0015-async-report-generation.md b/docs/adrs/0015-async-report-generation.md new file mode 100644 index 000000000..fe266ea0f --- /dev/null +++ b/docs/adrs/0015-async-report-generation.md @@ -0,0 +1,48 @@ +# Improve API stability when processing job data and generating reports + +Status: Accepted +Date: 11 August 2025 + +### Context + +We're currently facing a scaling issue with our app that is leading to unstable conditions due to increased load in the system. At the moment, we have a job cache that is intended to help with the performance of generating reports based on all of the message sending jobs (both one-offs and batch sends). The idea was to cache the information being sent by these jobs so it could be quickly pulled when viewing the job status and/or 1, 3, 5, and 7 day reports. + +We also have the added constraint of not storing any PII in the system, meaning we do not have any PII in any DB or Redis instances, hence the reason for opting for some kind of in-memory cache within the app process for a short duration period (maximum of 7 days). + +This cache was ultimately implemented as a [`multiprocessing.managers.SyncManager.dict`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.managers.SyncManager.dict) - a proxy to a dictionary object that could be shared between processes. However, this has proven problematic within our existing application architecture and environment and is prone to failure and instability after a short period of time and use. + +We've tried thinking through other ways we might be able to make this work, but they all point to the need of a single app instance to serve as the entry point to the shared object, which will not work within the Cloud Foundry-based environment our system is hosted in. + +### Decision + +At this point, we will move away from the shared dictionary cache and instead rely solely on using asynchronous tasks with Celery to perform these actions. In other words, instead of trying to cache anything, we will simply let Celery tasks handle the workloads and poll their status so that once they're finished, we update the UI to let the user know that their requested data is ready. + +The message sending jobs already function this way, but the report generation will need to change slightly to be entirely task based instead of trying to read and utilize the job cache first. Additionally, we'll need a bit of UI work to provide an area or overlay that informs the user that their request report(s) is being generated and once it is, update it to provide a link to download the generated report. + +This approach is nearly identical to how [FEC.gov](https://fec.gov/) handles its data export feature, which are analogous to our 1, 3, 5, and 7 day reports. You can see an example of how this works by [going here](https://www.fec.gov/data/receipts/?data_type=processed&two_year_transaction_period=2024&min_date=10%2F12%2F2024&max_date=10%2F12%2F2024) and clicking on the `Export` button toward the top right. + +The full implementation is largely found in [this module within the FEC API code base](https://github.com/fecgov/openFEC/blob/develop/webservices/tasks/download.py). + +We did try shifting the creation and management of the shared dictionary for the job cache to the main application instances and this initially improved things, but it fell over fast once a few reports were generated. A host of connection refused errors started cropping up, and any action that touched the job cache, including sending messages, resulted in an error being displayed to users (despite the action still succeeded in the case of sending messages). + +Further investigation into using anything with [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) was showing that many other moving parts need to be introduced to properly use it, which would greatly increase the complexity of our application architecture and introduce more brittle moving parts. + +### Consequences + +By making this switch to using Celery itself for our longer-running actions performed by users, we trade off some performance for reduced complexity and stability. Asynchronous tasks aren't necessarily simple either, but it's a known paradigm and by simply leveraging them and not trying to do additional processing for performance gains, we can go with a tried and true approach. + +We will need to do some additional UI work to support the switch to using tasks directly, but this is minimal in nature compared to trying to fully support a shared cache solely within a running app process(es). + +However, we will gain a significant amount of application stability and resiliency without having to increase our resource usage by taking this approach. We will also buy ourselves time and breathing room to take another look at the performance afterward and see what we can do to improve the report generation in the future with other approaches to how we process the data under the hood. + +### Author + +@ccostino + +### Stakeholders + +@ccostino, @CathyBeil + +### Next Steps + +Already implemented diff --git a/docs/adrs/README.md b/docs/adrs/README.md index 0e621c582..0d1d92292 100644 --- a/docs/adrs/README.md +++ b/docs/adrs/README.md @@ -180,7 +180,15 @@ top!). | ADR | TITLE | CURRENT STATUS | IMPLEMENTED | LAST MODIFIED | |:------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------:|:--------------:|:-----------:|:-------------:| -| [ADR-0009](./0009-adr-implement-backstopjs-to-improve-qa.md) | [Use backstopJS for QA Improvement within Admin Project](./0006-use-for-dependency-management.md) | Accepted | No | 08/27/2024 | +| [ADR-0015](./0015-async-report-generation.md) | [Improve API stability when processing job data and generating reports](./0015-async-report-generation.md) | Accepted | Yes | 08/11/2025 | +| [ADR-0014](./0014-adr-localize-notifications-python-client.md) | [Bring in the notifications-python-client directly to the API and Admin apps](./0014-adr-localize-notifications-python-client.md) | Accepted | Yes | 06/10/2025 | +| [ADR-0013](./0013-log-debug-tags.md) | [Use of debug search tags for cloudwatch logs](./0013-log-debug-tags.md) | Accepted | Yes | 02/27/2025 | +| [ADR-0012](./0012-adr-report-generation.md) | [Optimize processing of delivery receipts](./0012-adr-report-generation.md) | Accepted | Yes | 02/12/2025 | +| [ADR-0011](./0011-adr-delivery-receipts-updates.md) | [Optimize processing of delivery receipts](./0011-adr-delivery-receipts-updates.md) | Accepted | Yes | 01/22/2025 | +| [ADR-0010](./0010-adr-celery-pool-support-best-practice.md) | [Make best use of celery worker pools](./0010-adr-celery-pool-support-best-practice.md) | Accepted | Yes | 01/07/2025 | +| [ADR-0009](./0009-adr-implement-backstopjs-to-improve-qa.md) | [Use backstopJS for QA Improvement within Admin Project](./0009-adr-implement-backstopjs-to-improve-qa.md) | Accepted | Yes | 08/27/2024 | +| [ADR-0008](./0008-adr-handle-paid-quotas-at-the-organization-level.md) | [Handle paid quotas at the organization level](./0008-adr-handle-paid-quotas-at-the-organization-level.md) | Accepted | No | 09/30/2023 | +| [ADR-0007](./0007-adr-manage-total-record-retention-dynamically.md) | [Manage total record retention dynamically](./0007-adr-manage-total-record-retention-dynamically.md) | Accepted | Yes | 11/07/203 | | [ADR-0006](./0006-use-for-dependency-management.md) | [Use `poetry` for Dependency Management](./0006-use-for-dependency-management.md) | Accepted | Yes | 09/08/2023 | | [ADR-0005](./0005-agreement-data-model.md) | [Agreement info in data model](./0005-agreement-data-model.md) | Accepted | No | 07/05/2023 | | [ADR-0004](./0004-designing-pilot-content-visibility.md) | [Designing Pilot Content Visibility](./0004-designing-pilot-content-visibility.md) | Proposed | No | 06/20/2023 | From 54c801d8f1ba166b3217f7d351bf20e529a8d6a9 Mon Sep 17 00:00:00 2001 From: Carlo Costino Date: Tue, 12 Aug 2025 16:12:50 -0400 Subject: [PATCH 2/2] Updated the contents of the ADR to be accurate Thanks @terrazoon for pointing these items out! Signed-off-by: Carlo Costino --- docs/adrs/0015-async-report-generation.md | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/docs/adrs/0015-async-report-generation.md b/docs/adrs/0015-async-report-generation.md index fe266ea0f..322e9eb64 100644 --- a/docs/adrs/0015-async-report-generation.md +++ b/docs/adrs/0015-async-report-generation.md @@ -15,13 +15,9 @@ We've tried thinking through other ways we might be able to make this work, but ### Decision -At this point, we will move away from the shared dictionary cache and instead rely solely on using asynchronous tasks with Celery to perform these actions. In other words, instead of trying to cache anything, we will simply let Celery tasks handle the workloads and poll their status so that once they're finished, we update the UI to let the user know that their requested data is ready. +At this point, we will move away from the shared dictionary cache and convert it to a regular Python dictionary that is regenerated at app startup and refreshed every 30 minutes; this will continue to be used as a cache for phone numbers and personalization. Furthermore, the reports themselves will be switched to being regenerated each night via asynchronous tasks. The generated reports are stored in an S3 bucket for a maximum of 7 days. -The message sending jobs already function this way, but the report generation will need to change slightly to be entirely task based instead of trying to read and utilize the job cache first. Additionally, we'll need a bit of UI work to provide an area or overlay that informs the user that their request report(s) is being generated and once it is, update it to provide a link to download the generated report. - -This approach is nearly identical to how [FEC.gov](https://fec.gov/) handles its data export feature, which are analogous to our 1, 3, 5, and 7 day reports. You can see an example of how this works by [going here](https://www.fec.gov/data/receipts/?data_type=processed&two_year_transaction_period=2024&min_date=10%2F12%2F2024&max_date=10%2F12%2F2024) and clicking on the `Export` button toward the top right. - -The full implementation is largely found in [this module within the FEC API code base](https://github.com/fecgov/openFEC/blob/develop/webservices/tasks/download.py). +Additionally, we'll need a bit of UI work to update the links to the reports and some additional language to indicate that they are regenerated on a 24 hour basis and won't contain any of the current day's send information until the following day. We did try shifting the creation and management of the shared dictionary for the job cache to the main application instances and this initially improved things, but it fell over fast once a few reports were generated. A host of connection refused errors started cropping up, and any action that touched the job cache, including sending messages, resulted in an error being displayed to users (despite the action still succeeded in the case of sending messages). @@ -29,11 +25,11 @@ Further investigation into using anything with [`multiprocessing`](https://docs. ### Consequences -By making this switch to using Celery itself for our longer-running actions performed by users, we trade off some performance for reduced complexity and stability. Asynchronous tasks aren't necessarily simple either, but it's a known paradigm and by simply leveraging them and not trying to do additional processing for performance gains, we can go with a tried and true approach. +By making this switch to using Celery itself for our longer-running actions performed by users, we trade off some data freshness for improved performance and stability. Asynchronous tasks aren't necessarily simple either, but it's a known paradigm and by simply leveraging them and not trying to do additional processing for performance gains, we can go with a tried and true approach. -We will need to do some additional UI work to support the switch to using tasks directly, but this is minimal in nature compared to trying to fully support a shared cache solely within a running app process(es). +We will need to do some additional UI work to support the switch to using the nightly-generated reports, but this is minimal in nature compared to trying to fully support a shared cache solely within a running app process(es). -However, we will gain a significant amount of application stability and resiliency without having to increase our resource usage by taking this approach. We will also buy ourselves time and breathing room to take another look at the performance afterward and see what we can do to improve the report generation in the future with other approaches to how we process the data under the hood. +Again, we will gain a significant amount of application stability and resiliency without having to increase our resource usage by taking this approach. We will also buy ourselves time and breathing room to take another look at the performance afterward and see what we can do to improve the report generation in the future with other approaches to how we process the data under the hood. ### Author