Last.fm To The Cloud: How We Migrated Our API, Live

Ben XO
9 min readNov 12, 2020

Please note — Last.fm is hiring for engineers! Skip to the end for how to apply

A cloud with the AudioScrobbler logo, and a ladder leading up to it.

One of the more difficult challenges faced when moving the Last.fm infrastructure into the cloud was to move our live API traffic, which receives tens of thousands of requests per second, with no disruption to the millions of users who use it on a daily basis. The aim was to make the change in a way that users would not even notice. But, for several reasons, this is tricky!

Structure of our API service

The API speaks to many of the same backends as the web pages on the site, but it has a few aspects which are specific to the API itself:

  • a database of API keys and applications
  • a cache, accessed thousands of times per second, to support API Sessions (i.e. the data which enables users to be logged in within an app)
  • a cache used for rate limiting.

The vast majority of this traffic comes through single URL: ws.audioscrobbler.com.

Our API Filter application deals with all of the methods on our API page, as well as the legacy Scrobble API, which is still actively used by some third party apps. It arbitrates between anonymous and logged-in features, and it has to accept users’ scrobbles quickly and reliably, whilst rejecting API abuse and ensuring that the service provided to apps is fast and fair.

The starting point.

Whilst deploying a second copy of all of these services is easy enough, the problem is ensuring seamless continuity from the point of view of users who are actively using the service at the time of the transition.

Deploy to cloud.

From the outside, the main visible change is that the DNS entry for ws.audioscrobbler.com has to change to a new IP address. But those familiar with DNS will know that a change like this is not instantaneous. For some period of time — perhaps even days — traffic will still be sent via the old address. So, both the old and new deployments have to serve the same data, and a log-in through one entrypoint must create a session that works at either entrypoint. Failure to share the session cache could cause a user to be unexpectedly logged out of the app they’re using — perhaps more than once.

Key Challenges

One of the key challenges to overcome is that both the caching layer we use (Redis) and the database (Postgres) support replication, to keep two copies in different locations in sync. However, neither of them support having more than one instance which is the Primary — it’s only possible to write to one of them, meaning that at first, the new deployment must be configured to access the database and cache of the old deployment.

Postgres

Let’s deal first with Postgres. Fortunately for us, all API keys and application data exist in the Redis cache, and so the consequence of turning Postgres off entirely is limited: for a period of time, you would not be able to register a new API key (as a developer), but other than that there is no user visible impact. The strategy we chose here was to make the API database read-only, then export the data and import it into the new instance in the cloud, before pointing both new and old API stacks to the new database. The migration time here was under an hour — not so bad.

One point to note about dumping and restoring Postgres from your own deployment into a managed system (such as Google CloudSQL) is that you probably want the database users to be managed by Google Cloud — and that means they will not be part of your database dump, and the login credentials will likely be different. So you have to study the options to pg_dump to make sure you are dumping the structure and data, but nothing extra such as the user accounts.

You may need to make some manual additions to the process to ensure the correct permissions on your tables — it’s worth examining all the queries your apps do to check you have all the required permissions covered!

It’s also extremely valuable to do a dry run, with a detailed set of repeatable instructions, to find out how long this process will take. Do it twice — sometimes, you’ll need to add extra steps to tear down data you created the first time, to ensure the process is truly repeatable at the time you need it to go smoothly.

Read only database — no new API keys possible during this configuration
Dump and restore the database (less than 1 hour)
Use the cloud database directly — API app creation re-enabled

Redis

Redis, however, is another story. With tens of thousands of writes per second, it would not be possible to take it offline; and even the time taken to deploy a configuration change (were we to try to make the switch to the new Redis “instantaneously” with a configuration change to the app) could cause some users scrobbles to be lost, as they’re unable to create a session to submit them through. That would not be acceptable!

Tailoring the approach to the types of traffic we serve

At this point it’s worth examining the types of API traffic we receive, and adjusting the strategy in a more nuanced way.

Our API traffic breaks down broadly into three kinds:

  • API requests to fetch information
  • API requests where a user is logging in, creating a session (perhaps for scrobbling or other logged-in features)
  • Scrobbles.

Scrobbles themselves also break down into two kinds:

  • Those which come through our recommended scrobbling API (Scrobble 2.0), which are part of a single long lived session
  • Scrobbles from the Scrobble 1 API, which, due to the way third party apps are typically implemented, often create a new session on a scrobble-by-scrobble basis.

What this means for our session cache is that we actually have two kinds of sessions: long-lived ones (which can be active for months or even years) — these are used by apps such as the Last.fm Android app — and short lived sessions which are created for a single use and then expire a short time later.

We decided that by splitting the cache into two caches — long lived, and short lived — we could migrate the two with different strategies. In fact, the short lived cache doesn’t even need to be kept in sync with replication. A brand new cache can be created just for the new deployment, and scrobbles coming in will be picked up seamlessly.

The Rate Limiting cache is ready to go — it’s not necessary to replicate it.
At this point we split the session cache into short and long lived versions.

Scrobble 1 only uses this type of cache and so we could divert scrobble 1 traffic first, and independently of the rest of the API — a seamless change with no loss of data.

For the long lived cache, with long-lived sessions, we set up Redis with replication. Only one of the replicas can be written to, but switching which-is-which is quick (although not seamless). The only user visible impact would be that during the few minutes where we set the new Redis as the primary, and changed the configuration of the API endpoint to use it, a user would not be able to successfully log in to apps — but if they were already logged in, they would not notice any changes.

One of the problems with our use of Redis is that ours is around 30Gb in size. By default, Redis will try to save the entire thing to disk before transferring a copy to a replica. This takes time — it’s not uncommon for the first replication to take as long as half an hour.

Redis also will not do partial transfers to new replicas — it always does a full send. — so you must ensure a few things first, otherwise replication will never succeed.

  • both the Primary and Replica must have long enough timeouts set in order to wait for the transfer without assuming the other has timed out.
  • You must set repl-timeout high enough, and repl-backlog-size large enough to account for any changes which happen in the mean-time. (Moving the short lived keys to a separate Redis helped considerably with this)

For high availability, we use a Redis HA managed by Redis Sentinel. The cloud deploy of this required a little tricking in order to make the active primary behave as a replica of the Redis we’re bootstrapping from. This meant scaling it down to a single instance in GKE, so that Sentinel didn’t try to “correct” the problem by switching it back to being a replica of a different (empty) instance in GKE — for this we used kubectl --scale 1 to temporarily turn off the other replicas. Only after the entire process is complete (and this Redis is the only active instance) can it be scaled back up.

Of course, in this state it’s fragile — you must ensure that the link between your premises and your cloud deployment does not go down during the migration.

The final steps

This short-lived cache doesn’t need replication but the long-lived cache does.
At this point we can send all Scrobble 1 traffic straight to cloud with a DNS change!

With these changes in place, all that remained was to set the old deployment to forward all traffic to the new deployment, and the migration was complete!

Reroute all API traffic from one load balancer to the other. (Logging into apps temporarily disabled)
Break the replication and make the cloud Redis read/write. Normal service restored!

At this point, after breaking the replication from your original site, don’t forget to scale your Redis back up to a sensible number (we chose 3 replicas in different zones for redundancy).

Change the DNS for the main API endpoint and wait for the traffic to migrate naturally.

With the DNS for the main API endpoint now changed, traffic will gradually migrate over to the new endpoint over the next few days.

Lessons learned

We learned three important lessons from this process.

The first is that many common database and cache applications are not easy to work with in a cross-site configuration. Databases which support multiple primaries would have made this process much easier. There are proxies we could have deployed to duplicate traffic, but these add yet more layers of complexity — and each of those proxies would require production-level alerting, monitoring and graphing in order for them to be deployed safely into production.

The second lesson is that sometimes it pays to make some small modifications to the application to support unusual situations you face. The change to split the cache into two types was simple — and it’s very important to know the properties and lifetimes of the types of data your app stores.

The third lesson is that diagrams speak a thousand words. We used diagrams like the ones in this blog post throughout the migration in order to help the team understand what needed to be done, in what order, and the implications. It’s easy to configure things for the final set up — but the short-term steps in between which ensure continuity are much harder to explain without pictures to refer to!

Special Thanks

Special thanks to Matty Finnie (our lead designer) for upgrading the diagrams in this blog post!

Interested in working on problems like this?

Last.fm is currently looking for engineers to help us solve problems and develop new features for our complex, service oriented system that serves millions of users every day. If you’d like to apply, send your CV and a cover letter to jobs@last.fm !

--

--

Ben XO

Director Engineering @lastfm / D&B DJ on @bassdrive since 2001 (live show tracklist: @bdxposure, podcast: http://t.co/uKi8i128CW)