How makes serverless background jobs possible

How makes serverless background jobs possible


10 min read

Featured on Hashnode is an open-source framework for building background jobs in your existing codebase. Right now we have support for Next.js and serverless. We’re adding support for more JavaScript platforms soon.

This post will explain how it works and dive into some technical detail. Hold on to your butts.

Hold on to your butts!

First of all, why would I need something like this?!

Let’s say your web app is hosted somewhere that uses serverless like Vercel, Cloudflare or Netlify. You want to perform a background job – a series of operations that take some time, can fail, and must be retried.

Without a lot of extra work, serverless is not a good fit for this problem. Serverless functions have timeouts which can be as low as 10 seconds. You must send a response with some data during this time otherwise the job fails. You also want to keep users informed of what’s happening during this time.

How is this problem normally solved?

Most teams end up creating a separate service for background jobs that doesn’t run on serverless. This approach works well but introduces significant work and ongoing maintenance.

You must:

  • Create API endpoints on both sides – to send data back and forth. You really want strong type safety as well.

  • Store the state of Runs so you can report the status and recover when servers go down (e.g. during deployments).

  • Be able to retrieve the state of Runs for displaying to users.

  • Add logging so you can debug when things go wrong.

  • The ability to rerun failed jobs, either from the start with the same input data or just continue by retrying the last failed task.

  • And of course, you need to write the code for each job and deploy it.

How is built?

Here’s a helicopter view before we dive into a real example in detail.

Architectural overview

Your servers

You write background jobs alongside your normal code, using the SDKs. You can access your database, existing code, whatever you normally do. It’s just code.

But the twist is: if you want make something retryable or logged to the dashboard you wrap it in a Task. We make this easy for APIs by providing Integrations (we’ve already done it for you). More on that in a bit.

You’ll need to use an adaptor (like our Next.js one) that creates an API endpoint so we can send messages back to your servers.

The API, dashboard and Postgres

The API triggers Jobs, manages Tasks and saves the state of all Runs. It also allows you to get the current state of Jobs.

The dashboard is a great UI your whole team can use view all your Jobs, Runs (logs) and retry/cancel things.

The glorious dashboard

Postgres is used both as a store of state for Runs/Tasks and for the Job queue (we use Graphile Worker). is fully open source and can be self-hosted. We have a cloud product too.

Let’s check out an example Job: GitHub issue reminders

When a new GitHub issue is created, if no one has acted on it after 24 hours, assign it to someone and send a Slack reminder.

Here’s the code:

import { client } from "@/trigger";
import { Github, events } from "";
import { Slack } from "";
import { Job } from "";

const github = new Github({
  id: "github-api-key",
  //this token doesn't leave your servers
  token: process.env.GITHUB_API_KEY!,

//this uses OAuth (setup in the dashboard)
const slack = new Slack({ id: "slack" });

  id: "new-github-issue-reminder",
  name: "New GitHub issue reminder",
  version: "0.1.0",
  //include the integrations so they can be used in run()
  integrations: { github, slack },
  //this is what causes run() to fire
  trigger: github.triggers.repo({
    event: events.onIssueOpened,
    owner: "triggerdotdev",
    repo: "",
  //where the magic happens
  run: async (payload, io, ctx) => {
    //delay for 24 hours (or 60 seconds in development)
    const delayDuration =
      ctx.environment.type === "DEVELOPMENT" ? 60 : 60 * 60 * 24;
    await io.wait("wait 24 hours", delayDuration);

    const issue = await io.github.getIssue("get issue", {
      owner: payload.repository.owner.login,
      issueNumber: payload.issue.number,

    //if the issue has had no activity
    if (issue.updated_at === payload.issue.updated_at) {
      await io.slack.postMessage("Slack reminder", {
        text: `Issue needs attention: <${issue.html_url}|${issue.title}>`,
        channel: "C04GWUTDC3W",

      //assign it to someone, in this case… Matt
      await io.github.addIssueAssignees("add assignee", {
        owner: payload.repository.owner.login,
        issueNumber: payload.issue.number,
        assignees: ["matt-aitken"],

There's a YouTube walkthrough of how to create this Job from start to finish.

Job registration

When you run our CLI during local development or when you deploy, your Jobs will get registered with the platform. This makes us aware of them so we can trigger them to start.

There are currently three types of Triggers:

  1. Event Triggers – define the name of an event and expected payload, then send a matching event to trigger the Job(s).

  2. Scheduled – either use a CRON pattern or an interval that you want a Job to run at.

  3. Webhooks – subscribe to changes on another service using their API.

We’re going to dig into webhooks in detail because it’s the most interesting Trigger and is used in our example above.

How Job registration works

  1. You start an endpoint refreshing (either by using the CLI dev command or you have deployment setup).

  2. The absolute URL for your Trigger endpoint is updated.

  3. A request to your endpoint is made with the INDEX_ENDPOINT action.

  4. Data about all your Jobs, Sources, Dynamic Triggers and Dynamic Sources is sent back.

  5. Jobs are registered: new Jobs are created, old Jobs are updated. Any Integrations that don’t exist are created (let’s assume for simplicity that the Slack OAuth Integration with id slack has already been setup).

  6. Sources are registered – a source for the GitHub triggerdotdev/ repo with the issues event doesn’t exist, so it needs to be created and registered. If it existed and the config had changed it would be updated.

  7. Registering webhooks uses a Job, we kick this off by creating records then sending an event to your server.

  8. The internal registration Job starts (this Job is defined inside the GitHub Integration). It uses the Github API to register the webhook and passes that data back to the API.

  9. The source is updated and is ready to receive webhook events.

Triggering a Run

Let’s dig into the details of how this Job gets triggered and a Run starts.

How Run triggering works

  1. Someone creates an issue in your GitHub repo.

  2. GitHub sends a webhook to the URL that was registered on

  3. An EventRecord is created, which stores the payload and metadata in the database (used for logging and retries).

  4. All the Jobs which subscribe to this webhook event are found (it can be more than one).

  5. A Job Run record is created in the database, for each Job and environment combo (e.g. dev and prod).

  6. Any Integrations are prepared, so they can be used in the run() function. Slack uses OAuth so the latest token is retrieved from the secret store.

  7. A request with the EXECUTE_JOB action is made to your Trigger API endpoint.

  8. The run() function is called on your servers.

In the (very likely) scenario where the Run doesn’t complete in a single go (e.g. a timeout is hit, a wait is used, your server goes down…) then steps 6 onwards are repeated.

Tasks, resuming, retrying and idempotent-ness

Inside the run() function you can just use normal code. The code isn’t sent to our servers so if you need to perform a super secret operation like accessing very private data from your database you can do that.

It's classified

run() is called at least once

It’s critical to understand that the run() function will very likely be called more than once. So, anything that has side effects, like mutating something, should either be idempotent and/or be wrapped in a Task.


Excuse me?

Idempotent is a fancy word that has a disappointingly low Scrabble score (15 points). It means no matter how many times you call something (with the same inputs) the result will be the same. A function that sets name = "Rick Astley" in your database is idempotent. But doing rickrollCount += 1 in your database is not because each time you call it, the result is different from the previous times.


Tasks are very useful for several reasons, and we strongly recommend you use them.

  1. Once they have succeeded (i.e. not thrown an error) the result is stored so they won’t run again. The result is simply retrieved from the database.

  2. This success storage/retrieval makes them idempotent most of the time. But not fully on their own. If an error is thrown or a network failure happens that Task will get retried (by default). So if you’ve done something that isn’t idempotent you will get unwanted side effects.

  3. You can configure the retrying behaviour and if the Task does fail it will be retried. The default behaviour is retrying with exponential back off.

  4. When you create a Task you get given an idempotency key that you can use. Some APIs support these, like Stripe.

  5. Tasks are logged to the dashboard which gives you great visibility into your Jobs.

Now you know where all my great programming jokes come from…

Tasks always have a key as the first parameter. This uniquely identifies that Task inside a run. When the Job is re-run the key is used to look up the previous result. Think of a bit like the React key in a map.


You can install Integration packages in your project and you can also create your own. They expose useful API calls as Tasks which have been configured to give a great experience.

The most important properties from the request/response are highlighted in the dashboard.

The most important properties from the request/response are highlighted in the dashboard.

The run() in detail

You were warned it was going to go deep. Here goes:

How a Run works

  1. The Job is prepared, as per the previous diagram.

  2. The run() function gets called.

  3. The io.wait function is called on your server. The API is called and the Task is created in the database with the WAITING status. This causes the SDK to throw a special ResumeWithTaskError which stops the run from continuing. On the API side the continue time is scheduled. After the wait is over the Run is ready to continue. The Task status is updated.

  4. The Job is prepared again, this time because it’s not the first time the state of any Tasks are sent as well.

  5. run() is called again. Any Tasks that were received are added to the cache.

  6. io.wait() function is called again. This time there is a Task in the cache with the key "wait 24 hours". That Task is COMPLETED so we can continue.

  7. run() continues to the next code.

  8. io.github.getIssue is called, which is from the Integration. A request is made to the API and a PENDING Task is created. The github.getIssue underlying code mostly just wraps GitHub’s official SDK rest.issues.get call in runTask(). This request hit the GitHub API rate limit (a 429). When an error is thrown from runTask(), by default it will retry. A RetryWithTaskError is thrown (which stops the run from continuing). The Task is updated and the continue time is scheduled.

9–15. By this point you should get the idea of how the run() function works. Messages are sent back and forth, re-running happens and reliability is achieved.

When the run() function gets to the end the result is sent back to the API and it's set to COMPLETE.

Too Long; Already Finished Reading provides resilient Background Jobs for serverless environments.

  • The API and SDK allow you to write Jobs in your codebase that run on your servers.

  • We’re Postgres maximalists, like Supabase.

  • The state of a Run and its Tasks/Subtasks are stored.

  • The run() is called multiple times. This can be caused by waits, errors, server timeouts, network failures, server outages…

  • Making sure your code is idempotent is important. This is also the case in a lot of situations outside of background jobs.

  • Tasks are important as they create resume points for rerunning and make Runs easier to debug in the dashboard.

You don’t need to understand how it works to start writing backgrounds jobs. But hopefully this was a fun deep dive. If you are excited by this, we’d love for you to give a try (cloud or self-hosted) or you can contribute to the project.