Skip to content

Commit e64dc2a

Browse files
authored
Merge pull request #2 from apoplexi24/sqllite-cache
Added post for sqlitecache
2 parents 855bdf5 + de1f25a commit e64dc2a

File tree

8 files changed

+171
-8
lines changed

8 files changed

+171
-8
lines changed

_config.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@ theme: jekyll-theme-chirpy
99
lang: en
1010

1111
# Change to your timezone › https://kevinnovak.github.io/Time-Zone-Picker
12-
timezone: Asia/Shanghai
12+
timezone: Asia/Kolkata
1313

1414
# jekyll-seo-tag settings › https://github.com/jekyll/jekyll-seo-tag/blob/master/docs/usage.md
1515
# ↓ --------------------------
1616

17-
title: Chirpy # the main title
17+
title: TechRant by Apoplexi # the main title
1818

19-
tagline: A text-focused Jekyll theme # it will display as the subtitle
19+
tagline: This blog is an amalgamation of my experiences in tech, unfiltered and unedited. # it will display as the subtitle
2020

2121
description: >- # used by seo meta and the atom feed
22-
A minimal, responsive and feature-rich Jekyll theme for technical writing.
22+
A tech blog by the self proclaimed tech guru, Sir Apoplexi.
2323
2424
# Fill in the protocol & hostname for your site.
2525
# E.g. 'https://username.github.io', note that it does not end with a '/'.
@@ -35,10 +35,10 @@ social:
3535
# Change to your full name.
3636
# It will be displayed as the default author of the posts and the copyright owner in the Footer
3737
name: apoplexi24
38-
email: example@domain.com # change to your email address
38+
email: shivandanasharma6@gmail.com # change to your email address
3939
links:
4040
# The first element serves as the copyright owner's link
41-
- https://twitter.com/username # change to your Twitter homepage
41+
- https://twitter.com/apoplexi24 # change to your Twitter homepage
4242
- https://github.com/apoplexi24 # change to your GitHub homepage
4343
# Uncomment below to add more social links
4444
# - https://www.facebook.com/username
@@ -59,7 +59,7 @@ webmaster_verifications:
5959
# Web Analytics Settings
6060
analytics:
6161
google:
62-
id: # fill in your Google Analytics ID
62+
id: G-NCTGJDH96B # fill in your Google Analytics ID
6363
goatcounter:
6464
id: # fill in your GoatCounter ID
6565
umami:
@@ -95,7 +95,7 @@ theme_mode: # [light | dark]
9595
# will be added to all media resources (site avatar, posts' images, audio and video files) paths starting with '/'
9696
#
9797
# e.g. 'https://cdn.com'
98-
cdn: "https://chirpy-img.netlify.app"
98+
# cdn: "https://chirpy-img.netlify.app"
9999

100100
# the avatar on sidebar, support local or CORS resources
101101
avatar: "/commons/avatar.jpg"

_posts/2025-03-21-sqlitecache.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: Setting up SQLite as cache (and why it was a bad idea)
3+
author: apoplexi24
4+
date: 2025-03-21 10:00:00 +0800
5+
categories: [Development, Database]
6+
tags: [sqlite, caching, python, fastapi, optimization]
7+
pin: false
8+
math: true
9+
mermaid: false
10+
image:
11+
path: https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/sqlite-fragmentation-2.jpg
12+
alt: SQLite meme showing its limitations
13+
---
14+
15+
Cache is every backend developer's wet dream. Having your variables loaded onto your RAM is very convenient indeed, but sometimes the RAM does become a bottleneck when you become over ambitious and try to cache everything. It's like that one time you tried to memorize the entire dictionary – noble, but ultimately leading to a headache and a newfound appreciation for Ctrl+F. I, too, fell victim to the siren song of "instantaneous data access," and decided that SQLite, the plucky little database that could, would be my caching savior.
16+
17+
Spoiler alert: it wasn't *all* sunshine and rainbows. In fact, it was more like a partly cloudy day with a high chance of existential dread and a few "why did I do this to myself?" contemplation. This note chronicles my journey into the heart of caching darkness, armed with nothing but good intentions, questionable assumptions, and a database engine that's probably still judging me.
18+
19+
<img src="https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/sqlite-gaussianbell.webp" alt="SQLite Meme" />
20+
_When SQLite seems like the perfect solution... at first_
21+
22+
## The Beginning of the End
23+
24+
It all starts with non-technical stakeholders (it always does). Our accountants, who are well versed in Excel, wanted to open a .dbf file that comes from a vendor.
25+
26+
> A .dbf file is a database file, commonly associated with the dBASE database management system. It's a file format that stores structured data in a table format, with rows and columns, similar to a spreadsheet or a database table.
27+
{: .prompt-info }
28+
29+
Under normal circumstances, our good old reliable MS Excel can open the damned .dbf file. But then again, every good thing has a 'but' attached to it. The 'but' in MS Excel's case is the memory and computation required to open big files. Oh, and how can I forget, Excel cannot handle files with more than 1,048,576 rows[^1], making it near impossible to work with big data on Excel (who would have known that Excel is not a database **/s**).
30+
31+
So my manager aptly suggested we use a database connector to MS Access and we could pull the relevant data on MS Excel when needed. But greed gets to the best of us all, as the stakeholder wanted to load the entire file with all rows in memory for aggregation to see all of it at once in one place.
32+
33+
## The Project (In Brief, Hopefully)
34+
35+
I'll dive straight to the data aspect of the project and save the nuances for another post.
36+
This was the tech stack I used to deploy the endpoint:
37+
* FastAPI for backend
38+
* Nginx for reverse proxy
39+
* HTMX for frontend
40+
41+
### Reasons to Cache
42+
* Improve loading time of files rather than reading it every time
43+
* Storing the files uploaded on disk and loading the current file onto the cache
44+
* Eliminate reuploading of files as it takes the most time in the process
45+
46+
One of the features the stakeholders wanted was fast load times. The issue would be loading the file onto a pandas dataframe each time the data is uploaded would take a lot of time. To overcome that, one could just use a global variable that would be loaded once per session and then copies could be made in each function so that the dataframe loaded as global variable would be untouched.
47+
48+
```python
49+
##### rest of the code here ########
50+
# yeah I like type hinting the obvious, bite me
51+
currently_loaded_dataframe: pd.DataFrame = None
52+
filtered_dataframe: pd.DataFrame = None
53+
54+
@app.get('/apply_filter_on_table')
55+
async def post_endpoint(req: Request):
56+
global currently_loaded_dataframe, filtered_dataframe
57+
req_as_dict: dict = await req.json()
58+
copy_df: pd.DataFrame() = currently_loaded_dataframe.copy()
59+
####### filter operations on copy_df based on req params #######
60+
filtered_df = copy_df
61+
# defining api response here
62+
return response
63+
```
64+
65+
This seemed a perfect enough implementation for one user on one browser at a time. Good enough for POC as we don't want to do premature optimizations and over engineer it. If I wanted to scale it up I would use a FIFO Ordered Dictionary with fixed number of keys:
66+
67+
```python
68+
from collections import OrderedDict
69+
70+
class CustomFifoCache(OrderedDict):
71+
def __init__(self, capacity):
72+
super().__init__()
73+
self._capacity = capacity
74+
75+
def __setitem__(self, key, value):
76+
super().__setitem__(key, value)
77+
if len(self) > self._capacity:
78+
self.popitem(last=False) # yeet the oldest item (FIFO)
79+
80+
loaded_dataframe_dict: CustomFifoCache = CustomFifoCache(capacity=3)
81+
82+
@app.post('/upload_file')
83+
async def upload_large_file(file_id: str = Form(...),
84+
file: UploadFile = File(None)):
85+
global loaded_dataframe_dict
86+
destination = os.path.join(BASE_FILES_DIR, file_id + '.dbf')
87+
async with aiofiles.open(destination, "wb") as out_file:
88+
while content := await file.read(1024 * 1024): # Read in 1MB chunks.
89+
await out_file.write(content)
90+
try:
91+
df = await load_dbf_to_dataframe(destination)
92+
loaded_dataframe_dict[file_id] = df
93+
except Exception as e:
94+
print("error while loading dbf to df: ", e)
95+
96+
url = f"/dashboard?file_id={file_id}"
97+
response = RedirectResponse(url=url, status_code=302)
98+
response.headers["HX-Redirect"] = url
99+
100+
return response
101+
```
102+
103+
## Gunicorn - The Winged Horse of Asynchronous Agony
104+
105+
Wrong. I forgot the very fact that the backend is managed by Gunicorn.
106+
107+
<img src="https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/error2black.png" alt="Error Message" />
108+
_The moment of realization_
109+
110+
> Gunicorn 'Green Unicorn' is a Python WSGI HTTP Server for UNIX. It's a pre-fork worker model. The Gunicorn server is broadly compatible with various web frameworks, simply implemented, light on server resources, and fairly speedy.
111+
{: .prompt-info }
112+
113+
Gunicorn spun up the default count of 25 (ie 2 * $num_cores+1$) uvicorn workers to do its bidding. That means each request is getting routed to one of those 25 uvicorn workers based on request traffic, while they all have their own version of the global variable loaded. The file I was uploading was getting updated only in the uvicorn worker to which the request was routed and not propagated to the other workers.
114+
115+
<img src="https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/gunicorn-flowchart.png" alt="Gunicorn Workers" />
116+
_Multiple workers, multiple problems_
117+
118+
## SQLite - A Batman or Bane?
119+
120+
SQLite prides itself on being faster than a filesystem by a factor of 35%[^2], which means it **should** be blazing fast. But wait, what about the space-time trade-off in data storage (or any algorithm in the world). We saved some time so there must be a space tradeoff somewhere which we cannot afford, the runtime by itself is consuming RAM like there is no tomorrow.
121+
122+
## Major Issues with using SQLite as Cache
123+
124+
A good SQLite database architecture makes use of indices for speed in querying. If the filter is applied on a particular column that is not indexed, the entire table is loaded onto the memory rather than using an offset[^3]. This means anytime the filter is not on the columns I intend the users to query on, then the ETA converts from "3 minutes" to "1000 hours" like I am downloading a badly seeded file using P2P torrent.
125+
126+
<img src="https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/sqlite-fragmentation.jpg" alt="Download Time" />
127+
_Actual footage of what goes on in data science teams_
128+
129+
Let's delve into the real reasons why SQLite as a cache turned my coding utopia into a debugging dystopia:
130+
131+
1. **The Indexing Illusion:** Each unindexed filter query became a full table scan, turning my "blazing fast" cache into a digital molasses swamp. It was like inviting Usain Bolt to a race and then making him run through quicksand.
132+
133+
2. **Memory Hogging (Again):** SQLite decided to helpfully load entire tables into memory during those unindexed queries. My server started sweating more than I do during a production deployment.
134+
135+
3. **Concurrency Conundrums:** SQLite is surprisingly good with concurrent _reads_. But throw in a few _writes_, and things get dicey.
136+
137+
4. **The "It's a Feature, Not a Bug" Fallacy:** The persistence of SQLite gave me a false sense of security, and made me less inclined to implement proper cache invalidation.
138+
139+
5. **Overhead of SQL operations:** Even with the indices, there is an overhead of converting the dataframe to SQL operations.
140+
141+
## Conclusion
142+
143+
> Maybe a "perfect caching system" was all the friends we made along the way.
144+
{: .prompt-tip }
145+
146+
Even though the process was harrowing, I managed to make it work.
147+
148+
<img src="https://cdn.jsdelivr.net/gh/apoplexi24/blog-assets@main/sqlite-cache/img/prod-abomination.png" alt="Prod Abomination" />
149+
150+
_Abomination of a code if I must_
151+
152+
> Ugly Working Software >>> Fancy Software that doesn't work
153+
{: .prompt-warning }
154+
155+
The management was pleased by this maneuver. I had reduced a poor person's daily work of 6 hours of splitting the file for operations on it on a slow windows server. Everyone's happy, except the FastAPI service running on the VM with the mammoth responsibility of handling huge data.
156+
157+
Ultimately there's nothing wrong in using a file, global variable, Redis or SQLite as cache, it all depends on the use case, hardware restrictions, application architecture and most importantly, the mental sanity of the developer. A developer should ideally choose the most simplest approach and if it works, it works. Optimize only when necessary or else you will never finish the project, ever.
158+
159+
## References
160+
161+
[^1]: [MS Excel Specifications and Limits](https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3)
162+
[^2]: [SQLite Faster than Filesystem](https://www.sqlite.org/fasterthanfs.html)
163+
[^3]: [Why You Shouldn't Use SQLite](https://www.hendrik-erz.de/post/why-you-shouldnt-use-sqlite)
-160 KB
Binary file not shown.
-107 KB
Binary file not shown.
-628 KB
Binary file not shown.
-12.3 KB
Binary file not shown.
-13.8 KB
Binary file not shown.

0 commit comments

Comments
 (0)