River: A fast, robust job queue for Go and Postgres

360 points2 yearsbrandur.org

bgentry • 2 years ago

Hi HN, I'm one of the authors of River along with Brandur. We've been working on this library for a few months and thought it was about time we get it out into the world.

Transactional job queues have been a recurring theme throughout my career as a backend and distributed systems engineer at Heroku, Opendoor, and Mux. Despite the problems with non-transactional queues being well understood I keep encountering these same problems. I wrote a bit about them here in our docs: https://riverqueue.com/docs/transactional-enqueueing

Ultimately I want to help engineers be able to focus their time on building a reliable product, not chasing down distributed systems edge cases. I think most people underestimate just how far you can get with this model—most systems will never outgrow the scaling constraints and the rest are generally better off not worrying about these problems until they truly need to.

Please check out the website and docs for more info. We have a lot more coming but first we want to iron out the API design with the community and get some feedback on what features people are most excited for. https://riverqueue.com/

radicalbyte • 2 years ago

I don't know why people even use libraries in those languages - assuming you stick to a database engine you understand well then the (main)-database-as-queue pattern is trivial to implement. Any time spent writing code is quickly won back by not having to debug weird edge cases, and sometimes you can highly optimize what you're doing (for example it becomes easy to migrate jobs which are data-dominated to the DB server which can cut processing time by 2-3 orders of magnitude).

It's particularly suited to use cases such background jobs, workflows or other operations which occur within your application and scales well enough for what 99.9999% of us will be doing.

AtlasBarfed • 2 years ago

I thought the first rule of queues was never use a database as a queue.

22c • 2 years ago

The "problem" is that a database as a queue doesn't infinitely scale, but nothing infinitely scales. If you're starting to reach the limits of Postgres-as-a-queue then you've either done something very wrong or very right to get to that point.

It's probably more correct to say that the engineering effort required to make a Postgres-as-a-queue scale horizontally is a lot more than the engineering effort required to make a dedicated queueing service scale horizontally. The trade-off is that you're typically going to have to start scaling horizontally much sooner with your dedicated queuing service than with a Postgres database.

The argument for Postgres-as-a-queue is that you might be able to get to market much quicker, which can be significantly more important than how well you can scale down the track.

AtlasBarfed • 2 years ago

TheProTip • 2 years ago

Really cool. I'm working on .Net project that I've also adopted a "single dependency" stance on; that being Postgres. I'm pretty thrilled to see I'm not the only one lol!

I plan to use Orleans to handle a lot of the heavy HA/scale lifting. It can likely stand in for Redis in a lot of cache use cases(in some non-obvious ways), and am anticipating writing a Postgres stream provider for it when the time comes.. Will likely end up writing a Postres job queue as well so will definitely check out River for inspiration.

A lot of postgres drivers, including the .Net defacto Npgsql, support logical decode these days which unlocks a ton of exciting use cases and patterns via log processing.

dangoodmanUT • 2 years ago

How do you look at models like temporal.io (service in front of DB) and go-workflows (direct to DB) in comparison? It seems like this is more a step back towards that traditional queue like asynq is, which is where the industry is leaving from to the model of temporal

ramchip • 2 years ago

Personally I've found Temporal very limited as a queue: no FIFO ordering (!), no priorities, no routing, etc. It's also very complex when you just want async jobs, and more specialized than a DB or message broker, which can be used for many other things. I think there's a place for both.

bgentry • 2 years ago

I don't think these approaches are necessarily mutually exclusive. There are some great things that can be layered on top of the foundation we've built, including workflows. The best part about doing this is that you can maintain full transactionality within a single primary data store and not introduce another 3rd party or external service into your availability equation.

cloverich • 2 years ago

So excited y'all created this. Through a few job changes I've been exposed to the most popular background job systems in Rails, Python, JS, etc, and have been shocked at how under appreciated their limitations are relative to what you get out of the box with relational systems. Often I see a lot of DIY add-ons to help close the gaps, but its a lot of work and often still missing tons of edge cases and useful functionality. I always felt going the other way, starting w/ a relational db where many of those needs are free, would make more sense for most start-ups, internal tooling, and smaller scale businesses.

Thank you for this work, I look forward to taking it for a (real) test drive!

endorphine • 2 years ago

How does this compare to https://github.com/vgarvardt/gue?

csarva • 2 years ago

Not familiar with either project, but it seems gue is a fork of the authors previous project, https://github.com/bgentry/que-go

bgentry • 2 years ago

Yes, there's a note in the readme to that effect although I don't think they bear much resemblance anymore. que-go was an experiment I hacked up on a plane ride 9 years ago and thought was worth sharing. I was never happy with its technical design: holding a transaction for the duration of a job severely limits available use cases and worsens bloat issues. It was also never something I intended to continue developing alongside my other priorities at the time and I should have made that clearer in the project's readme from the start.

gregwebs • 2 years ago

Or neoq. https://news.ycombinator.com/item?id=38352778

tombh • 2 years ago

At the bottom of the page on riverqueue.com it appears there's a screenshot of a UI. But I can't seem to find any docs about it. Am I missing something or is it just not available yet?

mosen • 2 years ago

Looks like it’s underway:

> We're hard at work on more advanced features including a self-hosted web interface. Sign up to get updates on our progress.

bgentry • 2 years ago

The UI isn't quite ready for outside consumption yet but it is being worked on. I would love to hear more about what you'd like to see in it if you want to share.

codegeek • 2 years ago

If you could build a UI similar to Hangire [0] or Laravel Horizon [1], that would be awesome.

[0] https://hangfire.io

[1] https://github.com/laravel/horizon

fithisux • 2 years ago

An Airflow for Gophers?

justinclift • 2 years ago

Does it do job completion notification?

Along the lines of:

    _, err := river.Execute(context.Background(), j) // Enqueue the job, and wait for completion
    if err != nil {
        log.Fatalf("Unable to execute job: %s", err)
    }
    log.Printf("Job completed")

Does that make sense?

bgentry • 2 years ago

It sounds like you're looking to be able to find out when a job has finished working, no matter which node it was run on. No, River does not have a mechanism for that today. It's definitely something we've talked about though.

linux2647 • 2 years ago

Is there a minimum version of Postgres needed to use this? I’m having trouble finding that information in the docs

victorbjorklund • 2 years ago

Looks great. For people wondering about wether postgres really is a good choice for a job queue I can recommend checking out Oban in Elixir that has been running in production for many years: https://github.com/sorentwo/oban

Benchmark: peaks at around 17,699 jobs/sec for one queue on one node. Probably covers most apps.

https://getoban.pro/articles/one-million-jobs-a-minute-with-...

bgentry • 2 years ago

Oban is fantastic and has been a huge source of inspiration for us, showing what is possible in this space. In fact I think during my time at Distru we were one of Parker's first customers with Oban Web / Pro :)

We've also had a lot of experience with with other libraries like Que ( https://github.com/que-rb/que ) and Sidekiq (https://sidekiq.org/) which have certainly influenced us over the years.

sorentwo • 2 years ago

The very first paying Pro customer, as a matter of fact =)

You said back then that you planned on pursuing a Go client; now, four years later, here we are. River looks excellent, and the blog post does a fantastic job explaining all the benefits of job queues in Postgres.

sorentwo • 2 years ago

> Work in a transaction has other benefits too. Postgres’ NOTIFY respects transactions, so the moment a job is ready to work a job queue can wake a worker to work it, bringing the mean delay before work happens down to the sub-millisecond level.

Oban just went the opposite way, removing the use of database triggers for insert notifications and moving them into the application layer instead[1]. The prevalence of poolers like pgbouncer, which prevent NOTIFY ever triggering, and the extra db load of trigger handling wasn't worth it.

[1]: https://github.com/sorentwo/oban/commit/7688651446a76d766f39...

bgentry • 2 years ago

That makes a lot of sense, I've had the thought a few times that the NOTIFY overhead could get overwhelming in a high-throughput queue but haven't yet had an opportunity to verify this or experiment with a mechanism for reducing this overhead.

joker666 • 2 years ago

Also pgbouncer.

latchkey • 2 years ago

If I was going to do my own Job Queue, I'd implement it more like the GCP Tasks [0].

It is such a better model for the majority of queues. All you're doing is storing a message, hitting an HTTP endpoint and deleting the message on success. This makes it so much easier to scale, reason, and test task execution.

Update: since multiple people seem confused. I'm talking about the implementation of a job queue system, not suggesting that they use the GCP tasks product. That said, I would have just used GCP tasks too (assuming the usecase dictated it, fantastic and rock solid product.)

[0] https://cloud.google.com/tasks

brandur • 2 years ago

There's a lot to be said about the correctness benefits of a transactional model.

The trouble with hitting an HTTP API to queue a task is: what if it fails, or what if you're not sure about whether it failed? You can continue to retry in-band (although there's a definite latency disadvantage to doing so), but say you eventually give up, you can't be sure that no jobs were queued which you didn't get a proper ack for. In practice, this leads to a lot of uncertainty around the edges, and operators having to reconcile things manually.

There's definite scaling benefits to throwing tasks into Google's limitless compute power, but there's a lot of cases where a smaller, more correct queue is plenty of power, especially where Postgres is already the database of choice.

andrewstuart • 2 years ago

HTTP APIs are ideal for message queues with Postgres.

The request to get a message returns a token that identifies this receive.

You use that token to delete the message when you are done.

Jobs that don’t succeed after N retries get marked as dead and go into the dead letter list.

This the way AWS SQS works, it’s tried and true.

latchkey • 2 years ago

> what if it fails, or what if you're not sure about whether it failed?

This is covered in the GCP Tasks documentation.

> There's definite scaling benefits to throwing tasks into Google's limitless compute power, but there's a lot of cases where a smaller, more correct queue is plenty of power, especially where Postgres is already the database of choice.

My post was talking about what I would implement if I was doing my own queue, as the authors were. Not about using GCP Tasks.

politician • 2 years ago

Do you know that brandur's been writing about Postgres job queues since at least 2017? Cut him some slack.

https://brandur.org/job-drain

https://news.ycombinator.com/item?id=15294722

bgentry • 2 years ago

2015, even :) https://brandur.org/postgres-queues

latchkey • 2 years ago

jbverschoor • 2 years ago

>> Timeouts: for all HTTP Target task handlers the default timeout is 10 minutes, with a maximum of 30 minutes.

Good luck with a long running batch.

latchkey • 2 years ago

If you're going to implement your own queue, you can make it run for however long you want.

Again, I'm getting downvoted. The whole point of my comment isn't about using GCP Tasks, it is about what I would do if I was going to implement my own queue system like the author did.

By the way, that 30 minute limitation can be worked around with checkpoints or breaking up the task into smaller chunks. Something that isn't a bad idea to do anyway. I've seen long running tasks cause all sorts of downstream problems when they fail and then take forever to run again.

jbverschoor • 2 years ago

Well you can't really.. If you're gonna use HTTP and expect a response, you're gonna be in for a fun ride. You'll have to go deal with timeout settings for:

  - http libraries
  - webservers
  - application servers
  - load balancers
  - reverse proxy servers
  - the cloud platform you're running on
  - waf

It might be alright for smaller "tasks", but not for "jobs".

latchkey • 2 years ago

Have you ever used Cloud Tasks?

fierro • 2 years ago

that's pretty fundamentally different, no? One requires you to build a distributed system with >1 components leveraging GCP tasks APIs. The second is just a library do some book keeping inside your main datastore.

gregwebs • 2 years ago

We are looking right now to use a stable PG job queue built in Go. We have found 2 already existing ones:

* neoq: https://github.com/acaloiaro/neoq

* gue: https://github.com/vgarvardt/gue

Neoq is new and we found it to have some features (like scheduling tasks) that were attractive. The maintainer has also been responsive to fixing our bug reports and addressing our concerns as we try it out.

Gue has been around for a while and is probably serving its users well.

Looking forward to trying out River now. I do wonder if neoq and river might be better off joining forces.

acaloiar • 2 years ago

Hey thanks for the plug!

River's professional looking website makes me think there are more commercial ambitions behind River than neoq, but maybe I'm wrong. I do think they're very similar options, but neoq has no commercial ambitions. I simply set out to create a good, open source, queue-agnostic solution for Go users.

I'm actually switching its license to BSD or MIT this week to highlight that fact.

surprisetalk • 2 years ago

I love PG job queues!

They’re surprisingly easy to implement in plain SQL:

[1] https://taylor.town/pg-task

The nice thing about this implementation is that you can query within the same transaction window

rockwotj • 2 years ago

Agreed. Shortwave [1] is built completely on this, but with the added layer of having a leasing system that is per user on top of the tasks. So you only need to `SKIP LOCKED` to grab a lease, then you can grab as many tasks as you want and process them in bulk. It allows higher throughput of tasks, and also was required for the use case as the leases where tied to a user and tasks for a single user must be processed in order.

[1]: https://www.shortwave.com/

bennyp101 • 2 years ago

Nice, I've been using graphile-worker [0] for a while now, and it handles our needs perfectly, so I can totally see why you want something in the go world.

Just skimming the docs, can you add a job directly via the DB? So a native trigger could add a job in? Or does it have to go via a client?

[0] https://worker.graphile.org/

sorentwo • 2 years ago

The number of features lifted directly from Oban[1] is astounding, considering there isn't any attribution in the announcement post or the repo.

Starting with the project's tagline, "Robust job processing in Elixir", let's see what else:

  - The same job states, including the British spelling for `cancelled`
  - Snoozing and cancelling jobs inline
  - The prioritization system
  - Tracking where jobs were attempted in an attempted_by column
  - Storing a list of errors inline on the job
  - The same check constraints and the same compound indexes
  - Almost the entire table schema, really
  - Unique jobs with the exact same option names
  - Table-backed leadership election

Please give some credit where it's due.

[1]: https://github.com/sorentwo/oban

bgentry • 2 years ago

Hi Parker, I'm genuinely sorry it comes across as though we lifted this stuff directly from Oban. I do mean it when I say that Oban has been a huge inspiration, particularly around its technical design and clean UX.

Some of what you've mentioned are cases where we surveyed a variety of our favorite job engines and concluded that we thought Oban's way was superior, whereas others we cycled through a few different implementations before ultimately apparently landing in a similar place. I'm not quite sure what to say on the spelling of "cancelled" though, I've always written it that way and can't help but read "canceled" like "concealed" in my head :)

As I think I mentioned when we first chatted years ago this has been a hobby interest of mine for many years so when a new database queue library pops up I tend to go see how it works. We've been in a bit of a mad dash trying to get this ready for release and didn't even think about crediting the projects that inspired us, but I'll sync with Brandur and make sure we can figure out the right way to do that.

I really appreciate you raising your concerns here and would love to talk further if you'd like. I just sent you an email to reconnect.

throwawaymaths • 2 years ago

Ok so maybe just put it on the github readme? "Inspired by Oban, and X, and Y..."

JFC One line of code you don't even have to test

Thaxll • 2 years ago

As if Oban invented anything, like queues on RDMS is not a new concept at all. Oban is 4 years old, do you know how many queues baked by DB were created in the last 10years?

I don't see Sidekiq credited on the main page of Oban.

sorentwo • 2 years ago

I’d argue strongly that Oban did invent things, including parts of the underlying structure used in River, and the authors agree that it was a heavy influence.

While there is no overlap in technology or structure with Sidekiq, the original Oban announcement on the ElixirForum mentions it along with all of the direct influences:

https://elixirforum.com/t/oban-reliable-and-observable-job-p...

hamandcheese • 2 years ago

Not sure if you meant to compare to Sidekiq (which uses Redis). But delayed-job and que are both projects that stretch back much more than 4 years in the ruby ecosystem that leverage relational databases as well.

kamikaz1k • 2 years ago

…how else do you spell cancelled? With one l? Wow, learning this on my keyboard as I type this…

jchw • 2 years ago

Data point: I am American and I would not spell 'cancelled' any other way. I don't think it is strictly British.

Groxx • 2 years ago

As far as I've seen, Americans are just plain inconsistent on this spelling.

Cancelled has nice pairing with cancellation, canceled can be typed nicely without any repeated finger use on qwerty, both clearly mean the same thing and aren't confused with something else... I say let the battles begin, and may the best speling win.

Referer pains me though.

jchw • 2 years ago

To be fair, "referer" is just a misspelling, I don't think it was ever accepted as a correct spelling by a large number of people. I'm sure you know the backstory though.

chuckhend • 2 years ago

Awesome! Seems like this would be a lot easier to work with and perhaps more performant than Skye's pg-queue? Queue workload is a lot like OLTP, which IMO, makes Postgres great for it (but does require some extra tuning).

Unlike https://github.com/tembo-io/pgmq a project we've been working on at Tembo, many queue projects still require you to run and manage a process external to the database, like a background worker. Or they ship as a client library and live in your application, which will limit the languages you can chose to work with. PGMQ is a pure SQL API, so any language that can connect to Postgres can use it.

RedShift1 • 2 years ago

All these job queue implementations do the same thing right, SELECT ... FOR UPDATE SKIP LOCKED? Why does every programming language need its own variant?

rockwotj • 2 years ago

To work with each language's drivers

RedShift1 • 2 years ago

But it's the same thing every time. Turn autocommit off, run the SELECT, commit, repeat? Or am I missing something?

ddorian43 • 2 years ago

You just do the update with skip locked and finish the job until time runs out or update the job "liveness". (can be autocommit on & off)

rubenfiszel • 2 years ago

Looks cool and thanks for sharing. Founder of windmill.dev, an open-source, extremely fast workflow engine to run jobs in ts,py,gosh whose most important piece, the queue, is also just rust + postgresql (and mostly the FOR UPDATE SKIP LOCKED).

I'd be curious to compare performances once you guys are comfortable with that, we do them openly and everyday on: https://github.com/windmill-labs/windmill/tree/benchmarks

I wasn't aware of the skip B-tree splits and the REINDEX CONCURRENTLY tricks. But curious what do you index in your jobs that use those. We mostly rely on the tag/queue_name (which has a small cardinality), scheduled_for, and running boolean which don't seem good fit for b-trees.

troft • 2 years ago

[dead]

sa46 • 2 years ago

I wrote our own little Go and Postgres job queue similar in spirit. Some tricks we used:

- Use FOR NO KEY UPDATE instead of FOR UPDATE so you don't block inserts into tables with a foreign key relationship with the job table. [1]

- We parallelize worker by tenant_id but process a single tenant sequentially. I didn't see anything in the docs about that use case; might be worth some design time.

[1]: https://www.migops.com/blog/select-for-update-and-its-behavi...

orefalo • 2 years ago

Interesting, I would have though a solution like https://temporal.io/ would be more appropriate for these use cases.

a job queue might just be the tip of the use cases iceberg... isn't it?

in the end it's a pub/sub - I use nats.io workers for this.

arf, just read a few comments on this same line down bellow.

meowtimemania • 2 years ago

The main benefit of something like River is simplicity. With River your application might consist of two components, a database, and a code server. Such an architecture is really easy to test, develop, debug and deploy.

Adding temporal.io means introducing a third component. More components usually means more complexity. More complexity means more difficult to test, develop, debug and deploy.

As with everything, it's all about tradeoffs.

kromem • 2 years ago

Nice!

A few years ago I wrote my own in house distributed job queue and scheduler in Go on top of Postgres and would have been very happy if a library like this had existed before.

The two really are a great pair for this usecase for most small to medium scaled applications, and it's awesome to see someone putting a library out there publicly doing it - great job!!

JoshGlazebrook • 2 years ago

I didn't really see this feature, but I think another good one would be a way to schedule a future job that is not periodic. ie: "schedule job in 1 hr" to where it's either not enqueued or not available to be consumed until (at least) the schedule time.

bgentry • 2 years ago

You've found an underdocumented feature, but in fact River does already do what you're asking for! Check out `ScheduledAt` on the `InsertOpts`: https://pkg.go.dev/github.com/riverqueue/river#InsertOpts

I'll try to work this into the higher level docs website later today with an example :)

m-a-r-c-e-l • 2 years ago

Nice problem observation.

One solution is the outbox pattern:

https://microservices.io/patterns/data/transactional-outbox....

sotraw • 2 years ago

If you are on Kafka already, there is an alternative to schedule a job without PG [0]

[0] https://www.wgtwo.com/blog/kafka-timers/

andrewstuart • 2 years ago

The thing is there are now vast numbers of queue solutions.

What’s the goal for the project? Is it to be commercial? If so you face massive headwind because it’s so incredibly easy to implement a queue now.

molszanski • 2 years ago

Would love to see an SQLite driver

bojanz • 2 years ago

This looks like a great effort and I am looking forward to trying it out.

I am a bit confused by the choice of the LGPL 3.0 license. It requires one to dynamically link the library to avoid GPL's virality, but in a language like Go that statically links everything, it becomes impossible to satisfy the requirements of the license, unless we ignore what it says and focus just on its spirit. I see that was discussed previously by the community in posts such as these [1][2][3]

I am assuming that bgentry and brandur have strong thoughts on the topic since they avoided the default Go license choice of BSD/MIT, so I'd love to hear more.

[1] https://www.makeworld.space/2021/01/lgpl_go.html [2] https://golang-nuts.narkive.com/41XkIlzJ/go-lgpl-and-static-... [3] https://softwareengineering.stackexchange.com/questions/1790...

bgentry • 2 years ago

Hi bojanz, to be honest we were not well informed enough on the licensing nuances. I appreciate you sharing these links, please tune into this GitHub issue where we'll give updates soon and make sure any ambiguity is resolved. https://github.com/riverqueue/river/issues/47

youerbt • 2 years ago

Job queues in RDBMS are always so controversial here on HN, which is kinda sad. Not everybody needs insane scale or whatever else a dedicated solutions offer. Not to mention if you already have RDBMS laying around, you don't have to pay for extra complexity.

hipadev23 • 2 years ago

What a strange design. If a job is dependent on an extant transaction then perhaps the job should run in the same code that initiated the transaction instead of a outside job queue?

Also you pass the data a job needs to run as part of the job payload. Then you don't have the "data doesn't exist" issue.

teraflop • 2 years ago

It's not strange at all to me. The job is "transactional" in the sense that it depends on the transaction, and should be triggered iff the transaction commits. That doesn't mean it should run inside the transaction (especially since long-running transactions are terrible for performance).

Passing around the job's data separately means that now you're storing two copies, which means you're creating a point where things can get out of sync.

hipadev23 • 2 years ago

> should be triggered iff the transaction commits

Agreed. Which is why the design doesn't make any sense. Because in the scenario presented they're starting a job during a transaction.

teraflop • 2 years ago

I don't understand what you mean. The job is "created" as part of the transaction, so it only becomes visible (and hence eligible to be executed) when the transaction commits.

eximius • 2 years ago

That part is somewhat poorly explained. That is a motivating example of why having your job queue system be separate from your system of record can be bad.

e.g.,

1. Application starts transaction 2. Application updates DB state (business details) 3. Application enqueues job in Redis 4. Redis jobworkers pick up job 5. Redis jobworkers error out 6. Application commits transaction

This motivates placing the jobworker state in the same transaction whereas non-DB based job queues have issues like this.

Chris911 • 2 years ago

The job is queued as part of the transaction. It is executed by a worker outside the scope of the transaction.

j45 • 2 years ago

Maybe it’s not designed for that or all use cases and that can make sense.

Personally, I need long running jobs.

brandur • 2 years ago

Author here.

Wanting to offload heavy work to a background job is absolute as old of a best practice as exists in modern software engineering.

This is especially important for the kind of API and/or web development that a large number of people on this site are involved in. By offloading expensive work, you take that work out-of-band of the request that generated it, making that request faster and providing a far superior user experience.

Example: User sign-up where you want to send a verification email. Talking to a foreign API like Mailgun might be a 100 ms to multisecond (worst case scenario) operation — why make the user wait on that? Instead, send it to the background, and give them a tight < 100 ms sign up experience that's so fast that for all intents and purposes, it feels instant.

stouset • 2 years ago

GP isn’t taking umbrage with the concept of needing to offload work to a background process.

hipadev23 • 2 years ago

> Wanting to offload heavy work to a background job is absolute as old of a best practice as exists in modern software engineering.

Yes. I am intimately familiar with background jobs. In fact I've been using them long enough to know, without hesitation, that you don't use a relational database as your job queue.

lazyant • 2 years ago

> I've been using them long enough to know, without hesitation, that you don't use a relational database as your job queue.

I'm also very familiar with jobs and I have used the usual tools like Redis and RMQ, but I wouldn't make a blanket statement like that. There are people using RDBS as queues in prod so we have some counter-examples. I wouldn't mind at all to get rid of another system (not just one server but the cluster of RMQ/Redis you need for HA). If there's a big risk in using pg as backend for a task queue, I'm all ears.

toolz • 2 years ago

as far as I'm aware the most popular job queue library in elixir depends on postgres and has performance characteristics that cover the vast majority of background processing needs I've come across.

I wonder maybe if you've limited yourself by assuming relational DBs only have features for relational data. That isn't the case now and really hasn't been the case for quite some time now.

qaq • 2 years ago

Postgres based job queues work fine if you have say 10K transaction per second and jobs on average do not take significant time to complete (things will run fine on fairly modest instance). They also give guarantees that traditional job queues do not.

Rapzid • 2 years ago

qaq • 2 years ago

Job is not dependent on extant transaction. The bookkeeping of job state runs in the same transaction as your domain state manipulation so you will never get into situation where job domain mutation commited but job state failed to update to complete.

maherbeg • 2 years ago

I think you may be misunderstanding the design here. The transaction for initiating the job is only for queuing. The dequeue and execution of the job happens in a separate process.

The example on the home page makes this clear where a user is created and a job is created at the same time. This ensures that the job is queued up with the user creation. If any parts of that initial transaction fails, then the job queuing doesn't actually happen.

zackkitzmiller • 2 years ago

I agree. This design is incredibly strange, and seems to throw away basically all distributed systems knowledge. I'm glad folks are playing with different ideas, but this one seems off.

eximius • 2 years ago

No, this is a fairly common pattern called having an 'outbox' where the emission/enquing of your event/message/job is tied to the transaction completion of the relevant domain data.

We use this to ensure Kafka events are only emitted when a process succeeds, this is very similar.

iskela • 2 years ago

So when the the business data transaction commit a notify event is raised and a job row is inserted. Out of bound job broker listens to a notify event of the job-table or polls the table skipping rows and takes work for processing?

eximius • 2 years ago

Basically.

For our particular use case, I think we're actually not using notify events. We just insert rows into the outbox table and the poller re-emits as kafka events and deletes successfully emitted events from the table.

youerbt • 2 years ago

Either business data and job are committed or none of them. Then as you write, either polling or listening to an even worker, can pick it up. Bonus stuff, from implementation perspective, is that if worker selects row FOR UPDATE (locking the job from others to pick up) and dies, Postgres will release the lock after some time, making the job available for other workers.

throwawaaarrgh • 2 years ago

Sorry if I'm late to the party, but has anyone told newbie developers that relational databases make very poor queues? We found that out like 15 years ago. (it's not about scalability, though that is a significant problem which this post hand-waves)

whalesalad • 2 years ago

A lot has changed in 15 years.

throwawaaarrgh • 2 years ago

Like?