Last updated Aug 13, 2024

Getting Started with DROID

bb8.jpg

“There are only two hard things in Computer Science: cache invalidation and naming things”. - Phil Karlton

DROID (Distributed Read Optimised Inter-region Datastore) is a highly reliable, available, read-optimised cross-region platform solution for metadata distribution. The DROID sidecar runs alongside your application providing a reliable low latency metadata cache access with invalidation handling.

DROID is an evolution of the Tenant Context Service (TCS) infrastructure to support metadata distribution beyond the original Tenant Platform use case.

With DROID we are building on and evolving the proven TCS infrastructure to allow more teams and systems to leverage the high availability and low latency of TCS as a general read-optimized metadata caching platform.

DROID is an evolution of a current running platform in TCS, not a new one. We are building on a foundational high-scale system that has already solved the problem of metadata distribution at scale for Tenant Platform.

Even without DROID, several key pieces described below have been built (especially sharding) as TCS usage continues to grow. This means the difference between DROID and organic TCS evolution is actually quite low.

Why do this?

The Context team have many years of experience operating a multi-region, tier 0, metadata distribution system at scale. TCS was built to solve a hard problem - reliable distribution and cache invalidation for tenant metadata. This problem is not local to Tenant Platform and Atlassian as a whole has needed a system to solve it. Our vision is to have the DROID sidecar running on every node providing reliable fast access to metadata.

Rather than every team trying to solve this, we aim to open up our metadata distribution platform so teams can easily onboard and benefit as we evolve our system.

Several features we are building as part of DROID (for example TCS sharding, dynamic entity types) are also needed for TCS as our usage more than doubles every 12 months. The gap between what we need to evolve TCS organically and what we need for DROID is actually not that large. Extended spikes have validated the size of this gap. TCS is already a critical piece of the Atlassian cloud platform and is a natural fit to grow to a more general platform offering.

DROID High Level Plan

We have been working on DROID focusing on building out the foundations of the system, spiking new ingestion methods and working with early adopters.

DROID has been progressing while the team continues to support TCS as a tier-0 system and delivered other non-DROID programs like Fedramp, DaRes, Sliver, External User Security, UPP and Bitbucket/Trello Admin Hub Integration.

To help us focus on TCS/DROID we formed a new team called Tenant Catalogue supports and evolve the current Cloud Provisioner → Tenant Catalogue → Transformer Service pipeline.

Refactoring TCS invalidation engine

We started this journey with a refactor of the core of TCS’s invalidations engine; moving from SQS to S3. This was necessary due to scaling limit and cost constraints.

Internal Record type sharding

A key part of DROID will be the ability to shard by record types. This is needed to limit the blast area of a misbehaving or overloaded record type. To set this up we have separated record types in code.

Two new forms of metadata ingestion

Background

Prior to DROID the only ingestion path for metadata to TCS was via the pipeline:


Cloud Provisioner → Tenant Catalogue → Transformer Service → TCS

with this metadata forming part of the Catalogue Service Record. This pattern has served us well and will continue to be the ingestion pipeline for CSR metadata.

However, this method has the following limitations on wider usage:

  • The pipeline requires a heavy integration phase to onboard involving several Tenant Platform teams Cloud Provisioning (CP), Tenant Catalogue (TC), Context. Operational incidents can also require these teams to coordinate to diagnose and resolve.
  • The architecture of TCS means it can potentially operate at a far greater scale (especially when we implement sharding, see below). The current ingestion pipeline will not support this as CP and TC systems are appropriately designed with the scale of the CSR in mind. Refactoring CP and TC systems to handle this scale would be cost prohibitive, involve work from several more teams and we would still need to deliver the scaling pieces of the DROID program in TCS.
  • Teams have requested metadata be added to the CSR just so it can be served by TCS. This has led to bloating of the CSR containing keys that really should not be present, increasing the surface area and operational load on the Cloud Provisioning and Tenant Catalogue teams.
  • Te have had to reject integrations where the metadata does not make sense to be in the CSR but TCS would have been an appropriate distribution system.

New methods for ingesting metadata

For DROID we have built two methods to ingest metadata - Read-through and External Ingestion.

Integration TypeOverview
Read-throughTCS loads metadata from external backing stores (REST endpoints) in case of cache misses.
External IngestionMetadata Providers publish source records to StreamHub for DROID to listen to, transform and store in DynamoDB. Cache misses will be loaded from DynamoDB.

Which system to use will depend on the needs, data size and access patterns of the systems we are integrating with.

Both of these methods allow us to bypass the CP -> Catalogue Record pipeline. DROID allows TCS to distribute metadata for systems without requiring CP or TC integration which should greatly simplify onboarding.

Both of these are in the early beta phase with selected systems, though the onboarding process is currently a high-touch “white glove” approach while we build and learn from the first couple of integrations.

Rate this page: