How we built a fair multi-tenant queuing system
published on 2024/01/23
As we scaled to thousands of users who have triggered billions of functions to run, we learned a great deal about the challenges developing a platform built on queues.
Attempting to build a system like this with any off-the-shelf queuing system is quickly going to become a struggle. Here are some of the key challenges that you will encounter:
- Fairness. You send 5,000 events per second. Not hard to manage, but in a single queue your jobs will block other users' work from running. That's not fair. One user should not be able to block another's work.
- Multi-tenancy. Most queues operate on a "worker" model: you specify which queues a worker should listen to, and the amount of jobs that worker can handle. Spinning up new workers for each function is tedious and cumbersome. It's made worse given that development branches are ephemeral and bursty.
- Contention. With thousands of customers running thousands of step functions containing parallel steps, at any given time there are millions jobs available for work. In typical queueing systems, workers will all contend for the earliest job available, leading to lots of spinning and wasted effort.
- Concurrency. Customizing concurrency at a function level across distributed workers is not possible in other queueing systems. Typically, concurrency management is implemented within the worker polling logic. Within Inngest, you can set multiple concurrency settings on a single function using one line of code. This creates virtual queues for managing runs.
- Read/write load. Handling tens of thousands of jobs per second leads to hundreds of thousands of requests per second: enqueues, "locks" (we'll get to that), dequeues all lead to heavy load on the backing infrastructure that runs the queue.
- Infrastructure. The queue needs to be reliable and fault-tolerant, which means we need distributed systems built in (which also impacts latency).
- Observability. Most queues are opaque, with little or basic observability out of the box. It's hard to build in real-time observability for step functions beyond the backlog: eg. number of functions waiting for other events, or a histogram of wall time vs execution time.
- Customizability. It's difficult and incredibly time intensive to extend basic queueing systems with advanced functionality. Customizations include: changing backoffs function-by-function, cancelling the backlog based off of job data, batching, debounce, smart indexing, and step function parallelism.