So Tap does a lot of things, but I wanted to write some quick notes on how I've been playing around with it today to help others to get going with it, cause there's a lot to it. My use case, back filling custom lexicons and filtering by that from the firehose. Example: let's say you're building an atproto app with the collection nsids in xyz.atpoke.*. I want the historical records across the atmosphere and any future ones made, along with proof it was really that user who made that record (the authenticated part of atproto). Tap solves that for you easily and quickly.

daniel 🫠's avatar
daniel 🫠
@dholms.xyz

finally landed it! Tap is your all-in-one sync tool for the Atmosphere: webhooks, backfill, filtering, signaling collections, no cbor/msts/signatures/cursors. this thing's got it all! give it a go and let me know what you think & if you run into any issues docs.bsky.app/blog/introdu...

Introducing Tap: Repository Synchronization Made Simple | Bluesky

Introducing Tap: Repository Synchronization Made Simple | Bluesky

Just about every app built on AT needs data from a repository at some point. For many use cases – feed generators, labelers, bots – streaming live data through a Relay or Jetstream works well. But som...


https://docs.bsky.app/blog/introducing-tap

These are just some quick notes to help you get going and may or may not be best practice and things I wrote as I was playing with it tonight.

1. Download go if you haven't already. Install directions

2. Install Tap with go install github.com/bluesky-social/indigo/cmd/tap@latest this should now give you an executable named tap you can run. If not may want to check your go install directory and the bin folder there. Mine was at ~/go/bin

3. Run tap with

TAP_SIGNAL_COLLECTION=xyz.atpoke.graph.poke TAP_COLLECTION_FILTERS=xyz.atpoke.graph.poke,xyz.atpoke.* tap run

A break down on those env variables

  • TAP_SIGNAL_COLLECTION - An env variable. Tells Tap to find every repo on the atmosphere from a relay with the collection of xyz.atpoke.graph.poke, if it finds a repo with that, then add that repo to the list to be synced and backfill on run of those collections. You want to pick a record you think everyone who has used the atproto app has. Like a profile record on sign up.

  • TAP_COLLECTION_FILTERS -An env variable. Tells tap to filter by repo when finding collections to backfill or watch for in the future from the firehose. You want to use the nsid

And that's it. It will take a while and spit out a lot of stuff while it backfills to a sqlite db named tap.db

To consume this backfill/live events you want to either set a TAP_WEBHOOK_URL env variable where it sends a POST request to for each or connect to a jetstream like web socket interface. Each of these will have a live flag showing whether it was from a backfill or from the firehose.

You can use the new @atproto/tap typescript library to consume the websocket, but be prepared. Once the events play through and they are acknowledged, they will not again. I put a hack in this example to show where you can set the ack method to a void function so you can rerun it while developing.

import { Tap, SimpleIndexer } from '@atproto/tap'

const tap = new Tap('http://localhost:2480')

const indexer = new SimpleIndexer()

indexer.identity(async (evt) => {
    console.log(`${evt.did} updated identity: ${evt.handle} (${evt.status})`)
})


indexer.record(async (evt, opts) => {

    const uri = `at://${evt.did}/${evt.collection}/${evt.rkey}`
    if (evt.action === 'create' || evt.action === 'update') {
        console.log(`${evt.action}: ${uri}`)
    } else {
        console.log(`deleted: ${uri}`)
    }
    //Uncomment to NOT acknowledge the event from the tap server so you can re run the script during development
    // opts.ack = () => console.log('"acknowledged"')
})


indexer.error((err) => console.error(err))

const channel = tap.channel(indexer)
channel.start()

In plain language, what we did above

1. We give Tap a lexicon we know everyone who uses our atproto app has. like sh.tangled.actor.profile. This is TAP_SIGNAL_COLLECTION.

2. Tap calls com.atproto.sync.listReposByCollection on a relay to get a list of repos with those records

3. Tap downloads the repos export (their .cars) and starts reading through the repo recording and saving any record in that collection. Also while reading the export it also finds what you have set in TAP_COLLECTION_FILTERS. For instance, if you have sh.tangled.* set it will backfill every single collection's records it found in the repo with that nsid prefix. This is all saved into a local db, which is sqlite by default but can change to something else like postgres by setting the TAP_DATABASE_URL variable

4. While that is happening, the firehose is also listening for new records.

Some notes on the firehose part:

  • Tap saves and sends events by repo, as in the user's did. Which you can set to do that manually by doing a POST to tap's endpoint at /repos/add this will also trigger a backfill from your previously set env variables

  • If it sees a new record go by for TAP_SIGNAL_COLLECTION and is not tracking that user it will send it along and start tracking them. NOT TAG_COLLECTION_FILTERS for those it still saves those lexicons as long as the repo is being "tracked" as previously mentioned.

There's some other really cool things as well about tap, like identity stuff. Which you can find on the readme

Railway template

@dholms.xyz made some directions on how to deploy with Railway, I took those further and made a template so ideally you just have to set the collections you want to filter, password, and setup the domain and you are good to go! This example also uses postgres as the DB so you can connect and explore that.

Railway is a paid service, but the above "Deploy on Railway" button should give you $20 in credits with my referral code. This also gives me 15% if you end up spending anything in the future. It should also be noted. That tap runs with the full firehose and downloads repo exports. This can be network intensive, and railway does charge for egress. You will want to watch your usage. I backfilled 75 repos, and 50k records while streaming them to my computer, and it cost me $0.10. Mileage may vary and grow a lot depending on what you are backfilling. So please take note and watch your usage and set usage limits, because this can be a lot if you try to backfill something like Bluesky.

Example showing what you need to set following what we did above

On deploy, you may have to click on the tap service, go to settings and make sure the public network domain has the correct port of 2480 and has a generated domain or can do a custom domain

The setup for tap is a bit different since it has a password

const tap = new Tap('https://{your tap service public domain}', {adminPassword: 'topsecret'})