Rust custom Binary #42

howesteve · 2025-04-05T19:41:46Z

howesteve
Apr 5, 2025

Hello,

Amazing project here, thanks. I have some doubts about the rust integration - I know it's not documented yet, so I got nowhere else to refer to.

Shouldn't trailbase::AppState be generic or accepting params? How are we supposed to use our own variables in handlers when AppState is closed? For instance, if we have a cache, a database handler... how to share them for handlers?
Why don't you use a connection pool instead of the current implementation, i.e. single connection + crossbeam? There will be severe concurrency issues why this design.
Rebuild times are so slow! I mean, do we really have to wait for all those C dependencies to resolve on every compile? I mean, change one line, and wait 10min for every build...

ignatz · 2025-04-06T08:49:12Z

ignatz
Apr 6, 2025
Maintainer

Thanks!

I have some doubts about the rust integration

You're spot on. I will happily extend your list, so we can address the issues one by one.

For context, I didn't want to publish any crates. I was and am worried that:

The APIs are in extremely rough shape, just the bare minimum for what's needed to support internal use-cases with no focus on consistency etc.
I didn't want to advertise framework use very much at all: TrailBase Crate #3. Understandably, early users were users coming from PocketBase and were familiar with its paradigms.
Besides the questionable experience current APIs offer, I'm genuinely concerned that maintaining stable framework APIs will slow down its internal development until tings crystalize more. My plan was to stabilize the feature set of the stand-alone binary first, polish and then refactor the APIs into something someone might actually want to use. Ideally, there would be a strict separation between internal APIs and public APIs, instead of everything just mushed together.
I think there's a big difference between folks forking and going off on their own vs published crates which promise semantic versioning, ...
All that said, I clearly didn't stick with my initial plan and caved (TrailBase Crate #3), which led to your unpleasant experience... I'm sorry

I'm very interested to make things better. I did go back on my initial stance because in this early state, TB is mostly interesting to enthusiasts and enthusiasts are the most valuable users in terms of feedback and shaping the product. There's a a bit of tension, since enthusiast are more inclined to framework-use, while longer-term I would expect more users to rely on the JS/TS runtime, since it's a lore more approachable, can support a rich ecosystem, is still very fast and doesn't suffer from atrocious compile times.

Sorry for the long blurp,just felt like I owe you the context. Now to your specific points:

Shouldn't trailbase::AppState be generic or accepting params? How are we supposed to use our own variables in handlers when AppState is closed? For instance, if we have a cache, a database handler... how to share them for handlers?

Yes. Ideally there would be a stricter separation between public, stable and internal APIs rather than everything bunched together on AppState with a generic AppState::get_user_state().

Rebuild times are so slow! I mean, do we really have to wait for all those C dependencies to resolve on every compile? I mean, change one line, and wait 10min for every build...

That's another reason why I think stricter splitting would be beneficial. Yes, TrailBase has a bunch of dependencies and I don't love the compile-times either. I haven't done enough profiling to say exactly what contributes how much but you seem to think it's the C-library dependencies. How did you conclude that? FWIW, I do think you're probably right, i.e. the linking stage is probably the choke point and given V8 makes up 70+% of the binary size it will probably contribute significantly. I've given the dependencies several sweeps trying to eliminate as many as possible but there isn't much low hanging fruit left especially with a few's transitive dpeendencies being the major chunk.

FWIW, even w/o restructuring, I do see nice improvements by using cranelift and mold.

Why don't you use a connection pool instead of the current implementation, i.e. single connection + crossbeam? There will be severe concurrency issues why this design.

I've changed the execution model in the past, this is what I settled on after extensively benchmarking various options from pools over locking, ... (https://github.com/ignatz/libsql_bench). A pool may help for very very read-heavy workloads but is quickly overshadowed by write lock contention. From all the things I tried, the current implementation performed the most consistently. The existing TrailBase benchmarks do test high-concurrency writes and high-concurrency read workloads. When you say "severe concurrency issues", I'm guessing that maybe you're thinking of the pathological case: mostly reads with some very expensive queries.

FWIW, the current setup and async APIs allow us to evolve the execution model as we identify better compromises. I've certainly toyed with a more involved RW-lock approach with concurrent reads and serialized writes.

0 replies

howesteve · 2025-04-07T02:43:42Z

howesteve
Apr 7, 2025
Author

Thanks for such a receptive, detailed answer. You don't have to be sorry for anything.

I can see your focus is deliver an alternative for PocketBase, with a nice end user experience, while mine is more of an "easier server boilerplate", with a bonus js and permissions.

My comments about it:

About the Axum AppState issue

There is a SubState functionality in axum, it might be a good solution.
Without customizing Axum and it' s routes, there is a world of tower middleware the user do not have access to. For instance, rate limiting? Blacklisting IPs? Tracing subscriber integration? You can' t just integrate them all and the user has to have access not only to AppState, but the axum router.

About the single conn vs pool issue

I didn' t know about the benchmarks. Nice stuff. I also love some benchmarking, but can be misleading. I saw a lot of tests with smart pointers, thread local, mutex, etc. but all using a single conn. Did you actually try a pool? I mean, like r2d2 or deadpool, with let' s say, 20 connections? I saw no tests with pools. Write lock contention will happen regardless of using a pool or single conn - that' s sqlite' s weak point and everybody knows about it. Why didn' t you use criterion' s group benchmarks for those?
What happens if the single connection crashes somehow or delays a lot to conclude a query? It will choke down the whole server, right? That' s the problem with single connections.
How well will the single connection behave out of the current workflow and inside an user defined axum handler? I mean, imagine in a handler the user reads db; fetches an url; updates the cache; etc. You'll lock the connection the whole duration of the handler Can you see this is impractical?
Yes, I meant how will it behave in highly concurrent scenarios and with slower queries? Think: larger datasets (or no limit/offset), CTE queries, vacuum, unindexed tables, (not your faault but trailbase will be blamed), where there is a single conn to handle every request and the axum server has to wait for the that single conn to be available? A single conn might look interesting in synthetic benchmarks with very minimal/fast queries, but I don' t see that working well on production. A benchmark that simulates real traffic, with all sorts of operations, would be more revealing then just tight looping a single operation at a time. This is specially true with sqlite, which is inherently sync by nature (with great relief using WAL mode).

About the compile times/linking

Yes, there should be a split, otherwise developing like this is impractical. Or linking to dynamic libs - I' m guessing you probably don' t want that and prefer a single static file.
Answering your question, I know it' s C compiling simply because it took 10min every time and I looked at my task manager to see what was happening and saw several "cc" commands running.

Btw, a couple more suggestions.

I see a task queue is one of the things you want to integrate. Good thing. Apalis is a great queue, and actively developed. That could be an option. But, use a separate db to enhance concurrency.
You might want to strip then upx your distro binary. It goes from 99mb to 31mb on my amd64 platform.

1 reply

ignatz Apr 7, 2025
Maintainer

About the Axum AppState issue

* There is a [SubState](https://docs.rs/axum/latest/axum/extract/struct.State.html#substates) functionality in axum, it might be a good solution.

* Without customizing Axum and it' s routes, there is a world of tower middleware the user do not have access to. For instance, rate limiting? Blacklisting IPs? Tracing subscriber integration? You can' t just integrate them all and the user has to have access not only to AppState, but the axum router.

Ack. For the middleware specifically, it's probably going to be a mix, i.e. TB will install a reasonable default set and let you more or less configure it but it would be prudent for the framework use to also be able to add your own.

About the single conn vs pool issue

* I didn' t know about the benchmarks. Nice stuff. I also love some benchmarking, but can be misleading. I saw a lot of tests with smart pointers, thread local, mutex, etc. but all  using a single conn. Did you actually try a pool? I mean, like r2d2 or deadpool, with let' s say, 20 connections? I saw no tests with pools. Write lock contention will happen regardless of using a pool  or single conn - that' s sqlite' s weak point and everybody knows about it. Why didn' t you use criterion' s group benchmarks for those?

Lot to dissect:

I'm comfortable with calling thread-local a pool (maybe you're not). You have one connection per thread and since the APIs is synchronous, one couldn't benefit from more connections.
Write lock contention, or at least the cost thereof, is very different over a shared connection vs many connections. In the latter case you have N tasks more or less eagerly (depending on your retry policy) try to acquire a file lock. In the shared connection case you don't have contention in the same way, since you're merely serializing the writes.
Do you think using criterion would change the results? I agree that it would make sense to use criterion if one wanted to turn this into a continuous performance setup.

* What happens if  the single connection crashes somehow or delays a lot to conclude a query? It will choke down the whole server, right? That' s the problem with single connections.

What does crash mean in this context? I'm not aware of this sqlite failure mode. If there was a an actual issue, e.g. segmetation fault, the server dying is IMHO WAI.

* How well will the single connection behave out of the current workflow and inside an user defined axum handler? I mean, imagine in a handler the user reads db; fetches an url; updates  the cache; etc. You'll lock  the connection the whole duration of the handler Can you see this is impractical?

It would be impractical, however that's not what the implementation does (unless you push all work onto the worker). If you use e.g. conn.query, the connection worker only handles the single query. The continuation, e.g. fetching urls in your example, would run outside of it.

* Yes, I meant how will it behave in highly concurrent scenarios and with slower queries? Think: larger datasets (or no limit/offset), CTE queries, vacuum, unindexed tables,  (not your faault but trailbase will be blamed), where there is a single conn to handle  every request and the axum server has to wait for the that single conn to be available? A single conn might look interesting in synthetic benchmarks with very minimal/fast queries, but I  don' t see that working well on production. A benchmark that simulates real traffic, with all sorts of operations, would be more revealing then just tight looping a single operation at a time. This is specially true with sqlite, which is inherently sync by nature (with great relief using WAL mode).

I'm all for improving the benchmarks. That's also what I meant with the current setup lets us iterate transparently. Accounting for expensive reads specifically makes a lot of sense. Some of your examples, e.g. vacuuming, would have he same outcome with multiple connections.

A single conn might look interesting in synthetic benchmarks with very minimal/fast queries, but I don' t see that working well on production. A benchmark that simulates real traffic, with all sorts of operations, would be more revealing then just tight looping a single operation at a time.

Agreed. Similarly for pools. As soon as you mix in a non-trivial amount of writes, you quickly get a significant long-tail. I think the best we can do: define a representative query mix and then optimize for that. If a queue ends up winning: great. Based on my experiments it's just not as obvious to me. The good news: we can experiment with this w/o affecting user code.

About the compile times/linking

* Yes, there should be a split, otherwise developing like this is impractical. Or linking to dynamic libs - I' m guessing you probably don' t want that and prefer a single static file.

For folks building their own binary, it should be on them to decide how it's linked. I do think that it makes sense for the reference binary to be statically linked (being nitty only Linux has a stable syscall api).

* Answering your question, I know it' s C compiling simply because it took 10min every time and I looked at my task manager to see  what was happening and saw several "cc"  commands running.

Could you share the specific command lines? I assumed that we're talking incremental builds. I wouldn't expect any C-code to get recompiled on incremental builds. Re-linked yes.

Btw, a couple more suggestions.

* I see a  task queue is one of the things you want to integrate. Good thing. Apalis is a great queue, and actively developed. That could be an option. But, use a separate db to enhance concurrency.

Thanks, will take a look

You might want to strip then upx your distro binary. It goes from 99mb to 31mb on my amd64 platform.

I'm open to the idea of additionally offering a stripped binary, I'm just not sure it's the best default trade-off. IMHO, optimizing binary size is more relevant if you manage many deployments and/or short-lived processes. Did you run into any issues with the current size?

ignatz · 2025-04-07T21:30:51Z

ignatz
Apr 7, 2025
Maintainer

Quick update, I updated the synthetic benchmarks to also incorporate r2d2 as well as a TB flavor with multiple connections: https://github.com/ignatz/libsql_bench/blob/pool/vendor/trailbase-sqlite/src/connection.rs#L59 . I'm certainly keen to increase read concurrency. Should probably improve the benchmark first to be a bit less synthetic.

0 replies

ignatz · 2025-04-08T15:05:45Z

ignatz
Apr 8, 2025
Maintainer

Also did a quick PoC for the TB server with multiple connections: https://github.com/trailbaseio/trailbase/tree/multiconn

0 replies

ignatz · 2025-04-08T20:48:06Z

ignatz
Apr 8, 2025
Maintainer

Regarding build times, just moved the JS assets into a separate crate to reduce false-positive rebuilds making the build in many cases go from:

to:

At this point, the build is very much dominated by link times. Building w/o v8 safes roughly a third. But the best cause of action for now is probably to use mold.

1 reply

ignatz Apr 17, 2025
Maintainer

Coming back to build times, I ran a studdy. Turns out linking is minuscule thus using ldd vs mold hardly makes a difference.

It turns out that the build times are completely dominated by "fat" LTO. Using:

"thin" LTO + 16 codegen-units improves my release build times by ~71%
LTO "off" + 16 codgen-units + cranelift improves my release build times by 88%

See #46 (comment), for the study

howesteve · 2025-04-10T18:17:24Z

howesteve
Apr 10, 2025
Author

Hi there. Sorry for my absence. Crazy week here. Let's try to comment some of your points.

A agree the best strategy is to add some built-in reasonable middleware and let user plug in whatever they need.
About the TL vs pool: I know from a process standpoint they both return a connection, however as we discussed, multiple connections will help mitigate the case for slow queries, which the TL/single conn cannot, that's the whole point.
If you want to simulate slow queries for adding into the benchmarks, whose might be interesting for a continuous benchmark for the project. I do get very different results using criterion and regular loop/Instant-based benchmarks.
A delay function

use rusqlite::{Connection, functions::FunctionFlags};
use std::time::Duration;
use std::thread::sleep;

fn add_sleep_function(conn: &Connection) -> rusqlite::Result<()> {
    conn.create_scalar_function(
        "sleep",  // SQL function name
        1,        // Number of arguments
        FunctionFlags::SQLITE_UTF8 | FunctionFlags::SQLITE_DETERMINISTIC,
        |ctx| {
            let ms = ctx.get::<i64>(0)?;  // Get 1st arg as milliseconds
            sleep(Duration::from_millis(ms as u64));
            Ok(())  // Return nothing (NULL in SQL)
        },
    )
}

then call it:

let conn = Connection::open(":memory:")?;
add_sleep_function(&conn)?;
conn.execute("SELECT sleep(1000);", [])

or a slow recursive query:

WITH RECURSIVE delay(n) AS (
  SELECT 1
  UNION ALL
  SELECT n+1 FROM delay WHERE n < 10000000  -- Adjust n if needed
)

or 3) a join among two huge unindexed tables. I know this is not the correct way to do it, but someone will, and blame TB because of some parallel query which does not return.

Then run any of these options in a bench, in parallel, using a thread pool using a TL/single connection and you'll see what I mean about the concurrent issues. I'm sure you already get the point. If we have other connections available in pool, there will be mitigation of concurrency issues.

Some notes about benchmarks in general

I think if you're into benchmarks and this is a big selling point of the product, they should be more comprehensive to include parallel reads and writes, joins, slow queries, tables with more columns, etc. so to simulate a more realistic workflow;
read-heavy workflows (ex: 99% reads, 1% inserts), read moderate workflows (ex: 90% reads 10% writes), write heavy, and so on. Each of them will bring their own surprises due to the the model sqlite was coded in, and due to the pooling model you choose at the end, and it's not so obvious or precise until you really bench each of them and use those benches to both guide development based on them and prove your selling points. Otherwise development could go into directions that aren't ideal.

About criterion, I like it .It's a great tool for benchmarks and covers the math/metrics very well; it does warm ups, discards outliers, compares with the last run, etc. it's quite more accurate than a simple Instant-based test. I personally I have some sqlite benchmarks I made using criterion, and I might post them as a template later if you want, just as an example, but I'm pretty sure you can come up with something better. Btw criterion has group/throughput modes which could be very useful for this.

Another nice selling point from a marketing standpoint of view could be comparing performance against PostgreSQL on a per-machine basis. Sqlite wins with large margin in this scenario, yet 90% of developers have no idea. There are rust embedded pgsql distros you could use in benches, again, if you are into the benchmarking thing as a selling point and comparing TB with other solutions.

I saw the r2d2 new benchmark, performance looks similar (just a bit slower) - to TL/TL3. There are other pools as well, but I don't see the pool as a bottleneck compared to I/O itself.

Optimizing binary sized is optional as I said. I'm much more into building my own binary, so the official distro size does not concern me personally at all. But 1/3 size looks good for deployment, specially on several machines.

About the compiling times - I couldn't test still. This version has a lot of dependencies that weren't there the last time - prettier, prettier-astro, dart, etc. I guess there is a new example which uses them and deps aren't still handled automatically, I'll fix them and try later.

3 replies

ignatz Apr 10, 2025
Maintainer

Hey Steve,

thanks for taking the time you input is super valuable.

About the compiling times - I couldn't test still. This version has a lot of dependencies that weren't there the last time - prettier, prettier-astro, dart, etc. I guess there is a new example which uses them and deps aren't still handled automatically, I'll fix them and try later.

You should only need rust, pnpm, proto, and maybe a few system libs installed. All the other deps like dart, prettier, ... shouldn't be needed for building TrailBase. These are build deps, example deps, client deps, ... If you run into any issue, just let me know.

I think if you're into benchmarks and this is a big selling point of the product, they should be more comprehensive to include parallel reads and writes, joins, slow queries, tables with more columns, ...

About criterion, I like it .It's a great tool for benchmarks and covers the math/metrics very well; it does warm ups, discards outliers, compares with the last run, ...

Agreed, all very good points

Another nice selling point from a marketing standpoint of view could be comparing performance against PostgreSQL on a per-machine basis...

You mean comparing postgres and postgres client vs Sqlite-in-Trailbase or were you thinking really just the databases? Otherwise w/o the Inter-process it might be a bit apples to oranges, similarly postgres w/o an API layer with ACL/parsing/... overheads. Comparing TrailBase to SupaBase might be most apples to apples in this regard. What do you think?

About the TL vs pool: I know from a process standpoint they both return a connection, however as we discussed, multiple connections will help mitigate the case for slow queries, which the TL/single conn cannot, that's the whole point.

I'm a bit surprised. TL will return a connection per thread. If you Tokio runtime has N threads, you have N connections.

Also thank you for the specific examples, they are a great starting point. Also using criterions, since it was build for this specifically makes a lot of sense.

howesteve Apr 11, 2025
Author

I'm a bit surprised. TL will return a connection per thread. If you Tokio runtime has N threads, you have N connections.
I might have misunderstood something, but using thread local can cache a connection per thread, but that do not scale well if you spawn many short-lived tasks. Axum uses tokio tasks, and tasks != threads. Only one connection per thread means under-utilization and load imbalance.
I'm guessing the concurrency benchmarks will show this clearly.

There are typically many tasks an few threads in a axum server. So thread-local storage won't scale: only one connection per thread means under-utilization and load imbalance.
Futures (as returned by axum) can hop threads, especially within .await chains. Thread-local state becomes fragile and non-deterministic.

There is also other advantages in using pool managers such as:

They are async-aware, which TL is not.
Connection reuse across threads/tasks.
Increasing the concurrency (i.e. having like 50 connections open)
Instrumentation
Better idle connection management
etc.

Again, I think more real world benchmarking will be revealing here.

Also, I suggest try to benchmark using wrk/oha/bombardier to check how many reads you can do through the web server - not just internally using Instant loops - and go increasing the concurrency number to 100, 1000 and you'll easily see what I mean when using thread local. Remember the database access will be managed under axum, thereby using tokio's tasks and synthetic benchmarking using loops to read/write the database will be much more revealing of rusqlite's speed than TB's.

ignatz Apr 11, 2025
Maintainer

Really love that we're getting down into the detail here. Thanks

There are typically many tasks an few threads in a axum server. So thread-local storage won't scale: only one connection per thread means under-utilization and load imbalance.

By default the underlying tokio runtime will run on an executor pool of N threads, where N is the number of cores in your system. As you allude to, there's work stealing so pending tasks can be executed by any thread addressing under-utilization/load-imbalance.

I'm not sure what you mean by "won't scale", since your machine can't run more than N threads in parallel. Note that sqlite is sycnrhonous, so any executor thread querying SQL is "fully utilized" and won't pick up any other tasks in the meantime. Right now, sqlite execution is off the main runtime. For example, if TB was using an R2D2 thread-pool instead (or thread-local) sqlite could clog up the entire runtime preventing other tasks from making progress.

Futures (as returned by axum) can hop threads, especially within .await chains. Thread-local state becomes fragile and non-deterministic.

It's not deterministic, which connection any task will get but the same can be said for pools. It would not only be fragile but broken if you depended on a specific connection but I'm not sure why that would be the case. Rust's Send marker trait usually does a pretty good job at protecting you from any issues when using state across .await yield points.

EDIT: after reading the discussion again, I'm starting to think that you're concerned that for

let connection = state.pool.get_connection();
connection.query("SELECT 4");
do_work().await;
connection.qurey("SELECT 2");

both queries should be guaranteed to hit the same connection. Why is that? In TB this would look more like:

state.conn().query("SELECT 4").await;
do_work().await;
state.conn().query("SELECT 2").await;

Where both queries may use different connections

They are async-aware, which TL is not.

Not quite sure what you mean by "async-aware". Do you mean: "provides async APIs"? R2D2 is pretty sync and async agnostic (it's blocking threads and using condvars for efficient wake-ups under the hood). I'd call TB's current implementation async aware because it doesn't clog up the runtime 🤷‍♀️

Connection reuse across threads/tasks.

How relevant is this for sqlite?

Increasing the concurrency (i.e. having like 50 connections open)

Not sure what this would do if you have 16 executors in your runtime all blocked on sync sqlite. You have concurrency (pending tasks) but no place to run them in a way that would increase parallelism further (even if all threads are just waiting for an sqlite write lock).

Instrumentation

👍 if the pool implements it. Something TB should have whether it's thread-local, pool, or potatoe-based. Way may even be able to do more than just instrumenting the use of connections in a pool, e.g. individual queries.

Better idle connection management

I could see this if you have a 100 open postgres connections all doing there own heartbeating, ... . Not sure what this would do for sqlite.

Again, I think more real world benchmarking will be revealing here.

Agreed (still think thread-local and pools are virtually the same :) )

Also, I suggest try to benchmark using wrk/oha/bombardier to check how many reads you can do through the web server - not just internally using Instant loops - and go increasing the concurrency number to 100, 1000 and you'll easily see what I mean when using thread local. Remember the database access will be managed under axum, thereby using tokio's tasks and synthetic benchmarking using loops to read/write the database will be much more revealing of rusqlite's speed than TB's.

Agreed. That said, that's IMHO what the TrailBase benchmarks are doing: https://trailbase.io/reference/benchmarks. Implementation detail of specific client aside, I'm hammering TB through the web interface with 64 (arbitrary number) parallel connections.

howesteve · 2025-04-11T00:58:21Z

howesteve
Apr 11, 2025
Author

I should have explained the PostgreSQL thing better.

I mean performance-wise app (TB/sqlite) vs app (PostgreSQL) performance/cost. The macro benchmarking, not the micro benchmark.

Many developers believe for their application to be fast, they need PostgreSQL. Which btw is indeed an incredible database. But I don't know if you ever tried Supabase: it's really, really slow, I mean, it was like like 200-400 reqs/sec, because there is the network overhead. Compare with the ~10-20k req you can get from sqlite/simple queries using TB and there it is you selling point: in a 5 dollars vps, you get 25-100x the speed of PostgreSQL, at 1/5 of the cost (supabase is US$ 25/month), without having to manage servers, and no external dependencies, no database managing, just your TB binary. We're talking huge server expenses cutting, no dependency on a third party (that's a dealbreaker for me), easier managing.

Of course if your application is huge, or you need consistence, stored procedures, specific features, PostgreSQL might be needed, but that's another story. Managed PostgreSQL has a better performance then supabase, but nothing that comes near these numbers a simple sqlite server can perform.

But as I was saying, if your app can manage, say, 10k req/sec, 99% of the applications don't need anything else, right? 102410606024 = 884 million reqs/day. On a budget 5 dollars server. How many apps you know need more then this? And one can easily scale this up vertically on a better server, or add sharding to sqlite (see turso, libsql, litefs). Mileage can vary depending on the application and server for sure. But I dare someone to beat the cost/performance of sqlite for this kind of servers.
Firebase is even worse than supabase, I'm not even going into it.

I have made a server infrastructure similar to TB, using axum and sqlite, and I get 100k reads per second, 30 writes/sec (synthetic) on a 13 years old cpu. I mean, the speed is ridiculous. One difference for my app's server from TB is, I've used cached session IDs instead of JWTs, to prevent the overhead of decoding the JWT on every request. There is a small gain in this, you can benchmark if you want. Since sqlite is usually single served, I see no need in using JWTs. Besides, JWT are kinda hard to revoke (have to either wait them to expire or blacklist them). All of these issues are gone using a HashMap(session_id_, session).

Back on topic - Pocketbase and TB are supabase/firebase killers. People still didn't realize that.

So here is your selling point: do some good benchmarking, show the application is resilient/trustable and a cost-performance analysis against competition, and you'll see TB usage skyrocketing. And if you want to really raise a flag, benchmark TB against a guest supabase account, and do the performance vs cost analysis. You'll get something as, "TB is 100-500x cheaper than supabase hosting on the same hardware, without the lock in.".

1 reply

ignatz Apr 11, 2025
Maintainer

But I don't know if you ever tried Supabase: it's really, really slow, I mean, it was like like 200-400 reqs/sec, because there is the network overhead. Compare with the ~10-20k req you can get from sqlite/simple queries using TB and there it is you selling point: in a 5 dollars vps, you get 25-100x the speed of PostgreSQL, at 1/5 of the cost (supabase is US$ 25/month), without having to manage servers, and no external dependencies, no database managing, just your TB binary. We're talking huge server expenses cutting, no dependency on a third party (that's a dealbreaker for me), easier managing.

Agreed with everything. That's very succinctly way of putting the intentions behind TB 👍

Re supabase: https://trailbase.io/reference/benchmarks.

Of course if your application is huge, or you need consistence, stored procedures, specific features, PostgreSQL might be needed, but that's another story.

TB supports stored procedures :). Distribution is not off the table even with sqlite :)

And one can easily scale this up vertically on a better server, or add sharding to sqlite (see turso, libsql, litefs).

Happy to discuss this too. IMHO, sharding and scaling are orthogonal. I'm a bit skeptical of how well the million separate DBs approach works in practice. It means: no foreign keys, no joining across tables, ... . For most users this is not a viable option. Silly example: web store where every customer is in their own DB. In other words, the data needs to partition naturally and you shouldn't bring any business incentive to break the silo :)

Firebase is even worse than supabase, I'm not even going into it.

And since it's not open source, we'll never know how inefficient it is 😅

I've used cached session IDs instead of JWTs, to prevent the overhead of decoding the JWT on every request. There is a small gain in this, you can benchmark if you want. Since sqlite is usually single served, I see no need in using JWTs. Besides, JWT are kinda hard to revoke (have to either wait them to expire or blacklist them). All of these issues are gone using a HashMap(session_id_, session).

Most importantly session IDs are simpler and simplicity is generally a good thing in auth. There are benefits to JWTs like metadata and non-local validation. We can benchmark but decoding a JWT is local and shouldn't be very expensive, i.e. an acl check against the DB is probably more expensive.

FWIW, TB implements revocation. It has a revocation list for when your refresh your token. Tokens are generally "short"-lived. That said, the revocation list is not currently exposed (TBD). In an emergency with revoked keys still living too long, you can also roll the keys (requiring all users to re-auth).

And if you want to really raise a flag, benchmark TB against a guest supabase account, and do the performance vs cost analysis. You'll get something as, "TB is 100-500x cheaper than supabase hosting on the same hardware, without the lock in.

I've only done it locally but I like your proposal. Would be interesting to see in the context of a 5$ VPS vs whatever supabase chooses. FWIW, I really like Supabase and performance isn't everything. As you point out, most apps are fine with whatever :)

ignatz · 2025-04-12T11:19:36Z

ignatz
Apr 12, 2025
Maintainer

Quick update: inspired by our discussion here, I added some of the benchmarks from the separate repo into the trailbase-sqlite crate with a criterion harness. More representative setups with a broader set of queries are still TBD.

I'm bringing this up because I wanted to document a limitation (which I've run into in the past and blissfully keep forgetting about). The default criterion sampling approach doesn't work very reliably for the current setup because we heavily rely on sustained FS performance. Concretely, sqlite's in-memory DBs doesn't support WAL-mode so it's not a great representation, i.e. we have to write to some filesystem. If we write to temp files (which is what we currently do), we're non-hermetic, subject to hysteresis from the FS, SSD, ... . In other words, running the benchmark a few times gives quite a spread and the order in which we run the individual test benches in also affects the outcome.

Ideally, we'd write to an in-memory fs, however I'm not aware of an easy, portable way of making this work. We could shell out but then benchmarks would only run on Linux (which may be fine). Alternatively, we could just settle for sqlite's in-memory mode as representative enough as an optimization target.

3 replies

howesteve Apr 12, 2025
Author

I think benchmarks should not be in-memory, except for getting insights/bottlenecks about the implementation. I mean, come on, "our performance for in-memory dbs is like this" - can't go this way. I can't remember any database benchmarks that are run in-memory (expect of course, for memory-only dbs).

Real benchmarks should mimic production server situations, i.e. in-disk writing. And yes, that will vary according to the hardware, os, ram, cpu, ssd/nvme, etc. Same as in real life; if you have a faster server, you have faster performance. Disk-saving ops will happen at the end one way or another, right? At least if you care to durability at all. And that's why benchmarks always publish the hardware they ran on.

However I do understand your concern about isolating TB's performance from SSD's. You can provide a flag for running in-disk or in-memory for providing insights and showing the user the bottleneck could be his disk/OS/ram limitations and not TB itself. Which I believe it's what concerns you right now, right?

Numbers are quite different using disk or on-memory databases of course (can tell from real life benching experience). Benches will also be quite different depending on the durability chosen (pragma sync flavors).

For in-dis vs memory benches, I setup my pragmas like this:

#[inline]
fn get_r2d2_pool() -> Pool<SqliteConnectionManager> {
  let pool = R2D2_POOL
    .get_or_init(|| {
      let manager = SqliteConnectionManager::file(DB_FILE).with_init(|c| {
        if DB_FILE == ":memory:" {
          Ok(
            c.execute_batch("PRAGMA journal_mode=off;pragma synchronous=off")
              .unwrap(),
          )
        } else {
          Ok(
            c.execute_batch("PRAGMA journal_mode=WAL;pragma synchronous=normal;")
              .unwrap(),
          )
        }
      });
      Mutex::new(Pool::builder().build(manager).unwrap())
    })
    .lock()
    .unwrap();
  pool.clone()
}

Anyway, you don't need WAL for :memory: based sqlite databases, it makes no sense since there is no durability.

Finally, if you really want to do it, there are some crates for in-memory filesystems, like rust-vfs. I'm not sure if I would go that way, and I've never used them, I'm just citing here.

ignatz Apr 12, 2025
Maintainer

Agreed. It's hard to argue that a benchmark should be as representative as reasonably possible. You may have misunderstood my point. My point isn't about absolute performance, otherwise sure have an in-memory b-tree representation. My point was solely about variance and fairness.

Some filesystem more than others (e.g. journaled, CoW, ...) have a lot going on. SSDs have a lot going on with limited caches, re-balancing, flushes, other background tasks... . As you're running the benchmarks you're getting a lot of run-to-run variance (many false positive regressions/improvements with criterion's default analysis, currently +/- 10+%) and hysteresis effects, i.e. order of runs, ultimate number of iterations, ... skew in relative performance between different setups in the same run.

My proposed "ideal" solution was to use a OS/FUSE-level RAM FS to reduce the impact of cache effects and have a very simple FS w/o much hysteresis. This is different from SQLite's in-memory, since you'd still be writing to an FS, manage a WAL, ... it would be much closer to production, but arguably less than a persistent FS, which is slower thus changing timing. That's also true for different hardware as well (as you pointed out, e.g. NVMe vs SATA SDD vs HDD are jumps probably with a similar order of magnitude). The main point is that memory isn't cached, RAM FSs are simple leading to more run-to-run consistency (hopefully) and more fairness between subsequently executed benchmark runs: worker vs pool vs potatoe. And at least one TrailBase production setup is using a RAM FS: the demo :).

EDIT: I just quickly wanted to highlight that priorities also depend a bit on the context, e.g. are you optimizing a specific version for a specific system or comparing different execution models? Then authenticity is critical. Are you monitoring performance/regressions over time? Then low-variance is most important. Ultimately, nobody would investigate a regression on top of noisy data.

Finally, if you really want to do it, there are some crates for in-memory filesystems, like rust-vfs. I'm not sure if I would go that way, and I've never used them, I'm just citing here.

They're not an option. It would need to be an OS/FUSE level FS, so that sqlite can interact with it. The crates you mention require you to use their own filesystem abstractions, which sqlite won't.

ignatz Apr 12, 2025
Maintainer

To make this a little less theoretic

Insert results:

trailbase-sqlite insert time:   [3.2943 ms 3.3128 ms 3.3323 ms]
                        change: [+3.5227% +4.3640% +5.2052%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

shared/locked rusqlite insert
                        time:   [8.4207 ms 8.8639 ms 9.2938 ms]
                        change: [-20.647% -16.537% -12.241%] (p = 0.00 < 0.05)
                        Performance has improved.

TL/pool rusqlite insert time:   [10.475 ms 11.079 ms 11.714 ms]
                        change: [+2.4388% +9.4472% +16.415%] (p = 0.01 < 0.05)
                        Performance has regressed.

read results:

trailbase-sqlite read   time:   [8.2653 µs 8.3555 µs 8.4464 µs]
                        change: [+1.6472% +11.090% +25.292%] (p = 0.05 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

shared/locked rusqlite read
                        time:   [16.683 µs 17.422 µs 18.014 µs]
                        change: [-8.7921% -2.0315% +4.7088%] (p = 0.57 > 0.05)
                        No change in performance detected.

TL/pool rusqlite read   time:   [1.1964 µs 1.2048 µs 1.2130 µs]
                        change: [+2.1478% +4.7796% +7.3050%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

and mixed (some fast reads, some slow reads, and some writes):

trailbase-sqlite mixed  time:   [19.954 µs 20.370 µs 20.710 µs]
                        change: [-2.6034% +1.5417% +6.1964%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

shared/locked rusqlite mixed
                        time:   [40.545 µs 42.959 µs 45.345 µs]
                        change: [-13.833% -9.4122% -5.1490%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  10 (10.00%) low mild
  1 (1.00%) high severe

TL/pool rusqlite mixed  time:   [456.25 µs 534.48 µs 633.11 µs]
                        change: [-22.257% -3.4665% +21.366%] (p = 0.76 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

howesteve · 2025-04-13T02:05:38Z

howesteve
Apr 13, 2025
Author

Ah ok, you're right about rust-vfs. I guess tmpfs could be a better solution for isolating fs/os instabilities, but as you previously pointed, linux-only. I guess it's fair for benchmarking.

I don't know what queries produced these numbers, but they look quick.

Btw I usually set up criterion to run benches in groups and show throughput, like:

  let mut group = c.benchmark_group("pools");
  group.measurement_time(std::time::Duration::from_secs(10)); // Run for 10 seconds
  group.sample_size(5000);
  group.throughput(Throughput::Elements(1));
  group.bench_function("tx inserts", benchmark_tx_inserts);
  group.bench_function("(single prep) reads", benchmark_prep_reads);
  group.bench_function("(rusqlite cached) prep reads ", benchmark_prep_reads_cached);
  group.bench_function("unprep reads", benchmark_unprep_reads);
  // ...
  group.finish();

It outputs like this:

Benchmarking rusqlite/prep inserts (rusqlite cached): Collecting 5000 samplerusqlite/prep inserts (rusqlite cached)
                        time:   [27.702 µs 29.091 µs 30.532 µs]
                        thrpt:  [32.753 Kelem/s 34.375 Kelem/s 36.099 Kelem/s]
Found 401 outliers among 5000 measurements (8.02%)
  15 (0.30%) high mild
  386 (7.72%) high severe

I prefer this format better than plain timings (just saying, see what you prefer). Actually throughput (i.e. reqs/sec) is the final number we like to see and compare and see at the end if the solution is a good cost/benefit.

It would be nice, as I said, to have a benchmark comparing TB's performance against a PostgreSQL server under the same hardware. Or PB. That would mitigate any differences from hardware, fs, etc. Actually as I said before, it would be wonderful to see the benchmark run on a cheap VPS, comparing TB/sqlite against PB/sqlite, pgsql, or other with other solutions, because that would be the typical case after all right, a web server in a VPS. The cost of TL/pooling, it's not that is negligible, but will be a small fraction of this benchmark.

Looking at your numbers, no doubt TL is (much!) faster then a pool, as your benches are showing, but what about performance in parallel, high concurrently, slow queries? That would be the bottleneck that a pool should fix. I don't know what exact queries produced the values above, but I'm curious about your numbers. This is getting even more interesting.

5 replies

ignatz Apr 13, 2025
Maintainer

I don't know what queries produced these numbers, but they look quick.

Benchmarks are here: https://github.com/trailbaseio/trailbase/blob/main/trailbase-sqlite/benches/benchmark.rs#L263. I agree, that they should probably be more varied with some even slower.

Btw I usually set up criterion to run benches in groups and show throughput, like:

  let mut group = c.benchmark_group("pools");
  group.measurement_time(std::time::Duration::from_secs(10)); // Run for 10 seconds
  group.sample_size(5000);
  group.throughput(Throughput::Elements(1));
  group.bench_function("tx inserts", benchmark_tx_inserts);
  group.bench_function("(single prep) reads", benchmark_prep_reads);
  group.bench_function("(rusqlite cached) prep reads ", benchmark_prep_reads_cached);
  group.bench_function("unprep reads", benchmark_unprep_reads);
  // ...
  group.finish();

It outputs like this:

Benchmarking rusqlite/prep inserts (rusqlite cached): Collecting 5000 samplerusqlite/prep inserts (rusqlite cached)
                        time:   [27.702 µs 29.091 µs 30.532 µs]
                        thrpt:  [32.753 Kelem/s 34.375 Kelem/s 36.099 Kelem/s]
Found 401 outliers among 5000 measurements (8.02%)
  15 (0.30%) high mild
  386 (7.72%) high severe

I prefer this format better than plain timings (just saying, see what you prefer). Actually throughput (i.e. reqs/sec) is the final number we like to see and compare and see at the end if the solution is a good cost/benefit.

Thanks, I'll look into it. I haven't played around too much with criterion yet, so very welcome.

It would be nice, as I said, to have a benchmark comparing TB's performance against a PostgreSQL server under the same hardware.

Agreed

Or PB. That would mitigate any differences from hardware, fs, etc.

https://trailbase.io/reference/benchmarks has naive insert/read benchmarks against PB, but also looking at latency and footprint.

Actually as I said before, it would be wonderful to see the benchmark run on a cheap VPS, comparing TB/sqlite against PB/sqlite, pgsql, or other with other solutions, because that would be the typical case after all right, a web server in a VPS.

Agreed. Just being pragmatic here. I'm trying to focus on the most critical things first, e.g. the risk of slow reads clobbering tail latency. We can always go broader.

The cost of TL/pooling, it's not that is negligible, but will be a small fraction of this benchmark

I'm not sure I fully got this but I'd say it will massively depend on your setup. In my experience and to my delight, I've seen sqlite being so fast, that it actually can have a significant impact.

Looking at your numbers, no doubt TL is (much!) faster then a pool, as your benches are showing, but what about performance in parallel, high concurrently, slow queries? That would be the bottleneck that a pool should fix. I don't know what exact queries produced the values above, but I'm curious about your numbers. This is getting even more interesting.

As we discussed in the other thread, I'd argue that the TL setup is the ideal pool as long as:

Tasks cannot hold onto connections (which I'm increasingly off the belive that you're thinking off)
Queries are executed off the main runtime, which I feel very strongly about. Otherwise, slow writes (and reads in the single worker case) can not only block further DB access but also any task from making progress. In other words: no sync I/O on an async runtime.

Maybe I misunderstood but given these two points, I don't see how a pool would improve parallelism or concurrency.

EDIT: Quick addendum, the TL/pool in the benchmark violates (2), which gives it a leg up at the expense of not being suitable for production. It's meant as a point of reference.

ignatz Apr 13, 2025
Maintainer

I applied your suggestions: grouping and throughput. Let me share some results also for a trailbase-sqlite with multiple workers:

QueryMix/trailbase-sqlite (1 thread)
                        time:   [23.837 µs 23.903 µs 23.970 µs]
                        thrpt:  [41.719 Kelem/s 41.835 Kelem/s 41.952 Kelem/s]
                 change:
                        time:   [+2.2626% +5.9804% +10.389%] (p = 0.00 < 0.05)
                        thrpt:  [-9.4114% -5.6429% -2.2126%]
                        Performance has regressed.
Found 28 outliers among 500 measurements (5.60%)
  5 (1.00%) low mild
  13 (2.60%) high mild
  10 (2.00%) high severe
QueryMix/trailbase-sqlite (2 threads)
                        time:   [52.285 µs 53.767 µs 55.299 µs]
                        thrpt:  [18.084 Kelem/s 18.599 Kelem/s 19.126 Kelem/s]
                 change:
                        time:   [-11.281% -5.0555% +1.1884%] (p = 0.12 > 0.05)
                        thrpt:  [-1.1745% +5.3247% +12.715%]
                        No change in performance detected.
Found 30 outliers among 500 measurements (6.00%)
  20 (4.00%) high mild
  10 (2.00%) high severe
QueryMix/trailbase-sqlite (4 threads)
                        time:   [101.47 µs 104.67 µs 108.07 µs]
                        thrpt:  [9.2529 Kelem/s 9.5540 Kelem/s 9.8552 Kelem/s]
                 change:
                        time:   [-7.8664% -1.9637% +4.2604%] (p = 0.53 > 0.05)
                        thrpt:  [-4.0863% +2.0030% +8.5381%]
                        No change in performance detected.
Found 37 outliers among 500 measurements (7.40%)
  26 (5.20%) high mild
  11 (2.20%) high severe
Benchmarking QueryMix/trailbase-sqlite (8 threads): Warming up for 3.0000 s
Warning: Unable to complete 500 samples in 10.0s. You may wish to increase target time to 18.9s, enable flat sampling, or reduce sample count to 250.
QueryMix/trailbase-sqlite (8 threads)
                        time:   [247.55 µs 256.86 µs 266.52 µs]
                        thrpt:  [3.7521 Kelem/s 3.8932 Kelem/s 4.0396 Kelem/s]
                 change:
                        time:   [-5.5950% +2.7532% +11.445%] (p = 0.53 > 0.05)
                        thrpt:  [-10.270% -2.6794% +5.9266%]
                        No change in performance detected.
Found 39 outliers among 500 measurements (7.80%)
  26 (5.20%) high mild
  13 (2.60%) high severe
QueryMix/locked-rusqlite
                        time:   [38.287 µs 38.535 µs 38.781 µs]
                        thrpt:  [25.786 Kelem/s 25.951 Kelem/s 26.118 Kelem/s]
                 change:
                        time:   [-2.4117% -0.1887% +1.7657%] (p = 0.88 > 0.05)
                        thrpt:  [-1.7350% +0.1890% +2.4713%]
                        No change in performance detected.
Found 29 outliers among 500 measurements (5.80%)
  5 (1.00%) low mild
  18 (3.60%) high mild
  6 (1.20%) high severe
QueryMix/TL-rusqlite    time:   [1.4422 ms 1.5147 ms 1.5904 ms]
                        thrpt:  [628.78  elem/s 660.19  elem/s 693.41  elem/s]
                 change:
                        time:   [+1.7905% +8.9149% +16.578%] (p = 0.01 < 0.05)
                        thrpt:  [-14.221% -8.1852% -1.7590%]
                        Performance has regressed.
Found 11 outliers among 500 measurements (2.20%)
  8 (1.60%) high mild
  3 (0.60%) high severe

Unless I'm holding it very wrong (which isn't unlikely), this supports my points:

Variance is through the roof,
and it's really hard to beat the single worker as soon as you mix in some writes except for maybe extremely egregious reads.

That said, I think "egregious reads" can/could be enough of a problem (as you point out) that it's worthwhile to account for them (it would also likely help extremely read-heavy workloads). Just need for find a reasonable compromise. Now we have a bench to play with... thanks to you and this discussion 👍

howesteve Apr 13, 2025
Author

Those are very good numbers. I like those criterion group/throughput benchmarks much better. I agree about reasonable compromise.

My concurrency concerns, I'll try to put in other words, is this.

A TL pool means a conn per thread.
Threads are limited, usually set up one per core. Usually axum uses num_cpus::get(), so let's suppose 8 threads for a 8 cores CPU. That will limit the TL pool to 8 db connections, right?. Tokio spawns many tasks, but not many threads - threads will stay limited at 8. Even if there are 1000 concurrent tasks running in tokio, the threads number will remain the same, and so will the connections number.
If queries are slow, concurrency will suffer because the connections are limited and stay busy in those queries. To simulate that, run:

$ wrk -d 5 -t 1 -c 100 http://localhost:3000/

... against your TB server reading from 50ms queries (the sqlite sleep() function I sent an example above is an easy example of simulating them). You'll end up with something as 1000ms/50ms * 8 = 160 reqs/sec. There goes concurrency - because the connections are locked busy in slow queries. It gets even worse if there are writes mixed in due to the sqlite's one-writer nature, and that's a real world situation usually.

A regular pool is by no means a miracle and has it's overhead, but if you have a larger pool - ex: 50 conns - you'll get nearly 1000/50 * 50 = ~1000 reqs/sec. So that's what I mean by "concurrency mitigation".
So for sure for in synthetic benchmarks with very fast microsecond queries , TL will win because for obvious reasons it has lower overhead compared to a pool such as R2D2. But if queries are slower, connections will be stuck. That's my whole point.
A conventional pool can easily open 40, 100 connections if wanted. That will mitigate the cost of slow queries locking the limited conns.
If you could increase the number of TL conns, that would be the best of worlds, since TL retrieval time is fast. Or perhaps an alternative pool implementation.

I saw the queries for the new benchmark, didn't realize they are published. They look better then previous, even having joins, but are still kinda synthetic. I'm not criticizing, just pointing there are places for both synthetic and real world benchmarks and usually synthetic are good for finding implementation bottlenecks, but real world are what the end user wants to see.

I know - benchmarking is a lot of work and has it's nuances, but at the end, it's what proves your point and your product. So at some point, you'll probably have to use a real world sql database dump for benches, and do deletes, selects, updates, slow queries, fast queries, aggregation queries - all concurrently, etc. as so to simulate what a real db server gets at the end.
Otherwise the product will end up being optimized only for synthetic benchmarks.

ignatz Apr 13, 2025
Maintainer

My concurrency concerns, I'll try to put in other words, is this.
* A TL pool means a conn per thread.

Yes.

* Threads are limited, usually set up one per core. Usually axum uses num_cpus::get(), so let's suppose 8 threads for a 8 cores CPU. That will limit the TL pool to 8 db connections, right?.

Right.

Tokio spawns many tasks, but not many threads - threads will stay limited at 8. Even if there are 1000 concurrent tasks running in tokio, the threads number will remain the same, and so will the connections number.

Yes.

* If queries are slow, concurrency will suffer because the connections are limited and stay busy in those queries. To simulate that, run:

Even if you had an R2D2 pool with 32 connections, 8 tasks out of your thousand running an SQLite query synchronously on the 8 threads, means that your other 992 tasks will have to wait. They simply won't run because you have no more available threads to run them on. You're not starved by the number of connections but by the number of threads.

That's why I'm so adamant about moving the sync-IO off the main runtime. That's what trailbase-sqlite does.

Let's say trailbase-sqlite has 1 worker thread, then all tasks not yielding to sqlite can continue to make progress, while the ones that do are "awaiting", i.e. blocked.

Let's say trailbase-sqlite has 16 worker threads, then you can run up to 16 parallel read queries, while non of the threads on the main runtime get locked up. Since there is a 1-to-1 correspondence between sync-IO an parallelism each worker thread may have its own connection.

A larger number of connection in a pool would only be beneficial if tasks were to hold on to them, e.g. by having a reference to a connection object, while not running queries. You had an example before, about querying something, doing something else and than putting it back in the cache. That's exactly what you would end up doing with R2D2. Maybe that's just the way you got used to? In trailbase-sqlite's model you don't hold onto connections. You submit a query, which gets executed on a thread with its private connection and the result gets send back.

$ wrk -d 5 -t 1 -c 100 http://localhost:3000/

... against your TB server reading from 50ms queries (the sqlite sleep() function I sent an example above is an easy example of simulating them). You'll end up with something as 1000ms/50ms * 8 = 160 reqs/sec. There goes concurrency - because the connections are locked busy in slow queries. It gets even worse if there are writes mixed in due to the sqlite's one-writer nature, and that's a real world situation usually.

Yes, same goes for a setup where you have 8 threads and 500 connections as long as you query on the main runtime.

* A regular pool is by no means a miracle and has it's overhead, but if you have a larger pool - ex: 50 conns - you'll get nearly 1000/50 * 50 =  ~1000 reqs/sec. So that's what I mean by "concurrency mitigation".

no

* So for sure for in synthetic benchmarks with very fast microsecond queries , TL will win because for obvious reasons it has lower overhead compared to a pool such as R2D2. But if queries are slower, connections will be stuck. That's my whole point.

Understood. My point: on 8 threads you saturate on 8 parallel queries processed over 8 or 50 connections.

* A conventional pool can easily open 40, 100 connections if wanted. That will mitigate the cost of slow queries locking the limited conns.

no. If you have 8 threads and 50 connections and 8 tasks are processing slow queries, no other task will get a go.

* If you could increase the number of TL conns, that would be the best of worlds, since TL retrieval time is fast. Or perhaps an alternative pool implementation.

It would make no difference unless tasks hold onto connections while not using them.

I saw the queries for the new benchmark, didn't realize they are published. They look better then previous, even having joins, but are still kinda synthetic. I'm not criticizing, just pointing there are places for both synthetic and real world benchmarks and usually synthetic are good for finding implementation bottlenecks, but real world are what the end user wants to see.

I know - benchmarking is a lot of work and has it's nuances, but at the end, it's what proves your point and your product. So at some point, you'll probably have to use a real world sql database dump for benches, and do deletes, selects, updates, slow queries, fast queries, aggregation queries - all concurrently, etc. as so to simulate what a real db server gets at the end. Otherwise the product will end up being optimized only for synthetic benchmarks.

Agreed. They are very synthetic and happy to massage them into a more varied more representative set. Arguably, any arbitrarily defined set of queries whether defined or captured from a production workload are synthetic with regards to your own workload. Playing devil's advocate, over-optimizing for one specific real query-set has the same issues. I'd be totally fine with making the number of worker threads/connections (and other parameters) configurable to tune for your own workload.

ignatz Apr 17, 2025
Maintainer

I've been thinking a bit more about how to better get my point across, maybe also for other readers...

The following two cases are equivalent in terms of parallelism:

You have a pool with enough connections for each task to grab one (could be thousands) to have ready when it's their turn to execute on a thread.
You don't let tasks grab connections but whenever they get to execute on a thread, their happens to be a ready connection for them to borrow and use.

In both cases there's no scarcity for connections. If SQLite would use async IO, their might be added concurrency but in the sync case, TL have the advantage that they cannot run out of connections. I'm also not sure if R2D2 plays nice with the tokio runtime, because if it was to run out of connections it may block the thread preventing it from picking up other tasks.

howesteve · 2025-04-13T19:08:58Z

howesteve
Apr 13, 2025
Author

Btw you can override the TL limitations this with the likes of

#[tokio::main(flavor = "multi_thread", worker_threads = N)]
async fn main() {
    // app
}

Or with a custom runtime:

tokio::runtime::Builder::new_multi_thread()
    .worker_threads(N)
    .enable_all()
    .build()
    .unwrap()
    .block_on(async {
        axum::Server::bind(&addr).serve(app.into_make_service()).await.unwrap();
    });

but I'm not sure this would be a good way performance-wise.

1 reply

ignatz Apr 13, 2025
Maintainer

Btw you can override the TL limitations this with the likes of

You can certainly tell a tokio runtime how many threads it should have but the above considerations remain the same. The parallelism isn't limited by thread-local connections (unless you allocate them to tasks) but the number of threads or the actual number of execution units in your machine.

ignatz · 2025-04-14T10:09:35Z

ignatz
Apr 14, 2025
Maintainer

Coming back to the original request before we went deep down the optimization path (which has been very productive), you were asking about custom state and middleware. I just broke up the API to allow for this: https://github.com/trailbaseio/trailbase/blob/dev/examples/custom-binary/src/main.rs#L9.

I'm by no means convinced that this is ideal (just as a I think the public AppState API is a hot mess) but at least it should unblock your use-cases.

0 replies

ignatz · 2025-04-15T12:08:13Z

ignatz
Apr 15, 2025
Maintainer

The upstream version now allows for parallel reads. Currently the parallelism is limited between 2 and 4 threads, though happy to make this configurable in the future. Thanks for pushing

0 replies

ignatz · 2025-04-18T06:54:46Z

ignatz
Apr 18, 2025
Maintainer

Still working through this backlog, just looked at apalis as a queue system based on your recommendation - thanks. Looks great, I'll tentatively will give it a shot. It mostly looks pretty modular. I hope I can provide my own SQL integration for consistency but also avoid sqlx.

0 replies

howesteve · 2025-04-18T14:47:46Z

howesteve
Apr 18, 2025
Author

Ok I had a very busy week, saw quickly the new release, congrats. Nice to see the parallelism improvement. I'm yet to test the new version. I was building a criterion benchmark to show my concurrency issues point, but maybe it will not be needed and time is a real issue here lately.

About apalis - yeah, I don't like sqlx very much as well (not for sqlite I mean) but there are trade offs in this case and I'll just leave my opinion because I messed with those in the past so you might benefit from that.

First thing, yes as you know, sqlx is not performant as some claim, it increases compilation times, all those issues I'm sure you know quite well. I'm much more into rusqlite - I think are' on the same boat here.

I think (at least how I do in my projects) the queue's database file should be separate from the main database file. There is no need for them to be in the same db.

It raises concurrency (remember sqlite's 1 writer lock per file). You won't want the queue and the main database fighting over concurrency.
It's easier to maintain the queue in a separate file, including migrations (which you could not control totally depending on how you implement it). For instance, one could just delete or migrate the queue.db file and not concern about the main db's data.
There would be no table naming conflicts between TB and apalis.
No migration issues
Easier to implement a separate worker-only process
Easier to distribute queues for high volume processing

When using 2 separate databases as suggested above, if one for any reason wants to work with both databases in the same query (ex: join a jobs table with a main db users table) it's often easier to use a sqlite "ATTACH DATABASE" statement, and namespace the queue db. Much, much cleaner. Just check the sqlite's "attach database" reference. So that wouldn't be an issue and makes the approach plausible.
About apalis sqlx implementation

Since for a queue sqlx's performance, even if subpar compared to rusqlite, should be just enough for a built-in queue.

It does not usually need as much concurrency as the main db
If using a separate queue db, I think you could prefer just use the apalis sqlx's implementation so that you do not loose time porting it to rusqlite and/or maintaining it in the future. But you might prefer doing it. I'm just saying. Surely an apalis rusqlite port would be welcome (including by myself).

Advantages:

Not creating/maintaining an apalis-rusqlite port, just accept what they already offer
No linking issues (see below)

Problems:

Main one: libsqlite3-sys conflicts, which are a pain in the a** - both sqlx and rusqlite fighting for a suitable compatible libsqlite3-sys in the same binary.
Minor: some bloating (sqlx + rusqlite in the same file)
Minor: some small performance loss in the queue by using sqlx instead of rusqlite. As I said, shouldn't be a big issue for most applications.
Minor: increased compiling time due to sqlx and it's macros system.

Apalis Rusqlite implementation

Advantages:

Better performance, compatibility overall, and compiling times.

Problems:

Creating and maintaining a apalis-rusqlite crate. Apalis is still pre 1.0 and there could be significant changes. They seem to know what they do, but you never know.

So it's really a trade off, use the default sqlx impl or doing a ruslite port?

1 reply

ignatz Apr 18, 2025
Maintainer

Hey, thanks for the considered response, much appreciated.

I think (at least how I do in my projects) the queue's database file should be separate from the main database file. There is no need for them to be in the same db.

Yeah, that was my intuition too.

When using 2 separate databases as suggested above, if one for any reason wants to work with both databases in the same query (ex: join a jobs table with a main db users table) it's often easier to use a sqlite "ATTACH DATABASE" statement, and namespace the queue db. Much, much cleaner. Just check the sqlite's "attach database" reference. So that wouldn't be an issue and makes the approach plausible.

Ack. I'm not too concerned. Just foreign keys would work.

About apalis sqlx implementation...

Main one: libsqlite3-sys conflicts, which are a pain in the a** - both sqlx and rusqlite fighting for a suitable compatible libsqlite3-sys in the same binary.

The linking is my main and most practical concern. Everything else is negotiable as you point out.

Creating and maintaining a apalis-rusqlite crate. Apalis is still pre 1.0 and there could be significant changes. They seem to know what they do, but you never know.

You're too late :) - I had already forked it: https://github.com/trailbaseio/trailbase/tree/dev/trailbase-apalis

So it's really a trade off, use the default sqlx impl or doing a ruslite port?

Not unless we'd also downgrade to an older version of sqlite at the whim of sqlx.

Rust custom Binary #42

Uh oh!

howesteve Apr 5, 2025

Replies: 14 comments · 16 replies

Uh oh!

Uh oh!

ignatz Apr 6, 2025 Maintainer

Uh oh!

howesteve Apr 7, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 7, 2025 Maintainer

Uh oh!

ignatz Apr 7, 2025 Maintainer

Uh oh!

ignatz Apr 8, 2025 Maintainer

Uh oh!

ignatz Apr 8, 2025 Maintainer

Uh oh!

ignatz Apr 17, 2025 Maintainer

Uh oh!

howesteve Apr 10, 2025 Author

Uh oh!

ignatz Apr 10, 2025 Maintainer

Uh oh!

howesteve Apr 11, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 11, 2025 Maintainer

Uh oh!

howesteve Apr 11, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 11, 2025 Maintainer

Uh oh!

Uh oh!

ignatz Apr 12, 2025 Maintainer

Uh oh!

howesteve Apr 12, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 12, 2025 Maintainer

Uh oh!

ignatz Apr 12, 2025 Maintainer

Uh oh!

howesteve Apr 13, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 13, 2025 Maintainer

Uh oh!

Uh oh!

ignatz Apr 13, 2025 Maintainer

Uh oh!

howesteve Apr 13, 2025 Author

Uh oh!

ignatz Apr 13, 2025 Maintainer

Uh oh!

ignatz Apr 17, 2025 Maintainer

Uh oh!

howesteve Apr 13, 2025 Author

Uh oh!

Uh oh!

ignatz Apr 13, 2025 Maintainer

Uh oh!

howesteve
Apr 5, 2025

Replies: 14 comments 16 replies

ignatz
Apr 6, 2025
Maintainer

howesteve
Apr 7, 2025
Author

ignatz Apr 7, 2025
Maintainer

ignatz
Apr 7, 2025
Maintainer

ignatz
Apr 8, 2025
Maintainer

ignatz
Apr 8, 2025
Maintainer

ignatz Apr 17, 2025
Maintainer

howesteve
Apr 10, 2025
Author

ignatz Apr 10, 2025
Maintainer

howesteve Apr 11, 2025
Author

ignatz Apr 11, 2025
Maintainer

howesteve
Apr 11, 2025
Author

ignatz Apr 11, 2025
Maintainer

ignatz
Apr 12, 2025
Maintainer

howesteve Apr 12, 2025
Author

ignatz Apr 12, 2025
Maintainer

ignatz Apr 12, 2025
Maintainer

howesteve
Apr 13, 2025
Author

ignatz Apr 13, 2025
Maintainer

ignatz Apr 13, 2025
Maintainer

howesteve Apr 13, 2025
Author

ignatz Apr 13, 2025
Maintainer

ignatz Apr 17, 2025
Maintainer

howesteve
Apr 13, 2025
Author

ignatz Apr 13, 2025
Maintainer