Last updated Aug 13, 2024

Integration types

Before you start

DROID is a distributed caching store with extremely low latency and high availability. There are trade-offs baked into its design which suits certain use cases but not others. In order to be onboarded to DROID successfully it’s vital to understand both DROID and your use case’s characteristics and requirements and whether it really is a good fit.

Do your homework first - DROID 101

Prepare a questionnaire and share it with Context - Integration squad

This is time to fill in your DROID questionnaire on a best-effort basis. Don’t get scared because of the extensiveness of the questions there, the questions are everything that might be relevant to a DROID integration. Don’t spend tons of time gathering data either. The idea is just to grab some relevant data to inform us of:

  • Whether DROID is truly suitable for your use case
  • Whether you’re better off with read-through or external ingestion flavor of DROID
  • Whether your scale warrants a dedicated shard in DROID

When your questionnaire is finished please share it with Context - Integration squad (You can find us in #help-tcs Slack channel).

Schedule a catchup with DROID team and have your use case reviewed

Give us some time to digest your questionnaire. Meanwhile, schedule a catchup (between 30 minutes and an hour) so we can discuss your use case in more detail and answer any questions you may have.

Integration Type Comparison

A comparison to two approaches to ingest metadata into TCS skipping the provisioning path, pros and cons and when to recommend each approach

What is it?

Read-throughExternal Ingestion
Using TCS as a cache proxy to another service (source service). Our WebServers call source service on cache misses. We maintain a copy of the last response data we received and send it to consumer. Source service has the option to send invalidation events to TCS via StreamHub.Data providers send their data to TCS using StreamHub events (live) and Wrench (bulk). TCS ingest the data and write into DDB, and TCS WebServers are used to lookup from DDB. TCS sends invalidation events when DDB writes change data.

How does it work?

Read-throughExternal Ingestion
Context team will work with you to define some contracts regarding
  1. TCS to fetch data from your service on a cache miss and
  2. (optional) how your service inform TCS about changes in source data.
Your service to send events to StreamHub whenever a record is created or updated. The AVI, schema and throughput of the events must be agreed upon by both your team and Context team.

Requirements

Read-throughExternal Ingestion
If you choose to onboard your service as a backing store for a read-through cache there are strict requirements to make sure that your integration with DROID meets our standards in term of availability, reliability and performance. These include but not limited to:
  1. Service Tiers : your service must be a tier 1 or tier 0.
  2. Your service must present in at least 2 regions.
  3. Your service must be able to provide DROID with source records via a REST endpoint according to our contract. There must be published and reasonable SLO/SLI regarding availability, reliability and latency of such endpoint.
If you choose the external ingestion path your service must provide DROID with a mean to bootstrap the whole dataset. There are strict requirements regarding availability, reliability and scale of such method so that when necessary (e.g. in accidents) we can restore the whole dataset in a timely manner.

Features

Read-throughExternal Ingestion
InvalidationYes. Via StreamhubYes
Batch FlushingNot neededYes
Data TransformationNo transformationYes

Best suited for

Read-throughExternal Ingestion
Data shapeNo transformationTransformers supported
Record sizeUp to source service and max_item_size of our L2 caches (set as 1MB as of Jan 5, 2024) Max is determined as min of ( StreamHub event size limit, DynamoDB record size limit, SQS record size limit ) ~ 250KB as of Jan 5, 2024
Dataset “liveness” / access patternsuits use cases with large potential datasets where most records are never accessedsuit use cases where most records may be used at some point
Frequency of changeonce we’ve built invalidation, main relevance will be that this affects the load on the data source service
TrafficSlow
Producer Integration TierMust be tier 0~1 (access reliability/performance dependent on source datasource)Can integrate with all service tiers. (ingestion freshness dependent on StreamHub supplier)

Pros and Cons

Read-throughExternal Ingestion
Pros
  • Least effort to integrate if you already have a source service
  • Peace of mind. Just shoot a SH event whenever your data changes, DROID takes care of transforming, ingesting, distributing and invalidation.
Cons
  • Needs highly available data source service - consumers expecting a tier 0 service from TCS
  • In the event of an incident in the backing service, TCS team will be paged for an incident outside of our control
  • More effort to integrate with DROID especially the bulk ingestion path.
  • Probably not a huge deal but ingesting via StreamHub will add 2-3 seconds (p99) on top of usual TCS ingestion delay.

Existing use cases

Read-throughExternal Ingestion
  • Config Registry
  • Access Narrowing
  • Identity license checks
  • DLP - Data classification tags

Sharding

Read-throughExternal Ingestion
If the source service is sharded in a traditional sense then TCS may not be able to reach out to the right shard to get necessary data for your keys

Availability requirements on other team’s data-supplying service

Read-throughExternal Ingestion
  • Explicit communication of expectations that our service is no better than what the data source provides. If that is not acceptable, uplifting to tier 0 is needed for the data source
  • Context team and the team providing data must work out a model of shared ownership in term of maintenance, alerts, incident managements, etc… for the integration to succeed
  • Context team and the team providing data must work out a model of shared ownership in term of maintenance, alerts, incident managements, etc… for the integration to succeed

Impact on DROID

Read-throughExternal Ingestion
Effect on our SLOs / performance Availability:
Cold cache / experience when we do not have the data can only provide the same tier as their data source service
To do multi-region they’d need to provide a single Micros alias backed by multiple regions
Once it’s in our cache, can we give Tier 0? Maybe, it’s not quite the same because we don’t store in dynamo.
Future: considering shared memcache

Performance/ Latency:
Similar, except:
Cold cache / experience when we do not have the data will be significantly slower and dependent on source datastore latency (i.e. sidecar latency distribution is more skewed - some very fast, cold ones slower than normal TCS)
AN datastore is basically dynamo, no info yet on performance
Effect on our BAU/ KTLO load New invalidation capability increases the TCS server code base so yes there is an increase expected in KTLO

On call/ alerts/ incidents

Read-throughExternal Ingestion
We will get paged if their datasource service is having problems. How do we prevent this?
  • Need contract between consuming teams and source datastore team
  • Set up their alerting as part of onboarding
  • Education
  • Still some risk here, we should be conservative in adopting this widely until we prove this model can scale
We will get paged if ingestion StreamHub event source is having problems. (Probably less immediate impact than read-through cache being down). How do we prevent this?

Isolation/ effect on other parallel TCS use cases

Read-throughExternal Ingestion
Currently have only partial separation (still on same WebServer / same undertow thread pool)

We need to separate external read through cases from the rest of the flow (separate WebServers)

WebServer separation is required for us to be comfortable putting this in production to avoid risk to existing TCS consumers
Same story; possibly slightly less risk due to no 3rd-party datasource service being involved

Scaling limitations

Potential risks associated with onboarding additional traffic

Read-throughExternal Ingestion
Without separation, there is a risk of requests timing out and cache misses

Mitigation: dedicate x% of the traffic to AN

Note: this means they don’t get scaling - our nodes will not scale from their traffic alone

Two kinds of data:
  • one is suitable for bulk ingestion (tenanted / app list)
  • one is only suitable for read-through (sparse)
  • they want both through read-through, we think the former is less suitable

Scope of work

Read-throughExternal Ingestion
Without separation, there is a risk of requests timing out and cache misses

Mitigation: dedicate x% of the traffic to AN

Note: this means they don’t get scaling - our nodes will not scale from their traffic alone

Two kinds of data:
  • one is suitable for bulk ingestion (tenanted / app list)
  • one is only suitable for read-through (sparse)
  • they want both through read-through, we think the former is less suitable

Rate this page: