Last updated Sep 24, 2024

Integration types

Before you start

DROID is a distributed caching store with extremely low latency and high availability. There are trade-offs baked into its design which suits certain use cases but not others. In order to be onboarded to DROID successfully it’s vital to understand both DROID and your use case’s characteristics and requirements and whether it really is a good fit.

Do your homework first - DROID 101

Prepare a questionnaire and share it with Context - Integration squad

This is time to fill in your DROID questionnaire on a best-effort basis. Don’t get scared because of the extensiveness of the questions there, the questions are everything that might be relevant to a DROID integration. Don’t spend tons of time gathering data either. The idea is just to grab some relevant data to inform us of:

  • Whether DROID is truly suitable for your use case
  • Whether you’re better off with External Ingestion or Read-through flavor of DROID
  • Whether your scale warrants a dedicated shard in DROID

When your questionnaire is finished please share it with Context - Integration squad (You can find us in #help-tcs Slack channel).

Schedule a catchup with DROID team and have your use case reviewed

Give us some time to digest your questionnaire. Meanwhile, schedule a catchup (between 30 minutes and an hour) so we can discuss your use case in more detail and answer any questions you may have.

Integration Assessment Criteria

DROID use cases are currently assessed on a case-by-case basis. We need to review your use case to ensure it’s a good fit for DROID and will only onboard suitable use cases. Explicit areas of assessment include but no limited to:

Cost friendliness
Currently cost awareness is a hot topic, and we need to ensure that your use case is cost-effective for DROID. For larger use cases, a budget must be allocated prior.

Data access and production patterns
DROID needs highly cacheable data, means data which is infrequently updated but is frequently read.

No plain text UGC or PII data
Please refer to our Data Classification Policy.

Integration Type Comparison

A comparison to two approaches to ingest metadata into TCS skipping the provisioning path, pros and cons and when to recommend each approach

What is it?

External IngestionRead-through
Data producers send their data to TCS using StreamHub events (live) and Wrench (bulk). TCS ingests the data and writes into DynamoDB (DDB), and TCS web servers are used to lookup from DDB. TCS sends invalidation events when DDB writes change data.Using TCS as a cache proxy to another service (source service). Our web servers call the source service on cache misses. We maintain a copy of the last response data we received and send it to the consumer. The source service has the option to send invalidation events to TCS via StreamHub.

How does it work?

External IngestionRead-through
Your service to send events to StreamHub whenever a record is created or updated. The AVI, schema and throughput of the events must be agreed upon by both your team and the Context team.Context team will work with you to define some contracts regarding
  1. TCS to fetch data from your service on a cache miss and
  2. (optional) how your service inform TCS about changes in source data.

Requirements

External IngestionRead-through
If you choose the External Ingestion path, your service must provide DROID with a mean to bootstrap the whole dataset. There are strict requirements regarding availability, reliability and scale of such method so that when necessary (e.g. incidents), we can restore the whole dataset in a timely manner. If you choose to onboard your service as a backing store for a Read-through cache there are strict requirements to make sure that your integration with DROID meets our standards in term of availability, reliability and performance. These include but not limited to:
  1. Service Tiers : your service must be a tier 1 or tier 0.
  2. Your service must present in at least 2 regions.
  3. Your service must be able to provide DROID with source records via a REST endpoint according to our contract. There must be published and reasonable SLO/SLI regarding availability, reliability and latency of such endpoint.

Features

External IngestionRead-through
InvalidationYesYes, via StreamHub
Batch FlushingYesNot needed
Data TransformationYesNo transformation

Best suited for

External IngestionRead-through
Data shapeTransformers supportedNo transformation
Record size Max is determined as min of ( StreamHub event size limit, DynamoDB record size limit, SQS record size limit ) ~ 250KB as of Jan 5, 2024 Up to source service and max_item_size of our L2 caches (set as 1MB as of Jan 5, 2024)
Dataset “liveness” / access patternsuit use cases where most records may be used at some pointsuits use cases with large potential datasets where most records are never accessed
Frequency of changeonce we’ve built invalidation, main relevance will be that this affects the load on the data source service
TrafficSlow
Producer Integration TierCan integrate with all service tiers. (ingestion freshness dependent on StreamHub supplier)Must be tier 0~1 (access reliability/performance dependent on the data source)

Pros and Cons

External IngestionRead-through
Pros
  • Peace of mind. Just shoot a StreamHub event whenever your data changes, DROID takes care of transformation, ingestion, distribution and invalidation.
  • Least effort to integrate if you already have a source service
Cons
  • More effort to integrate with DROID especially the bulk ingestion path.
  • Probably not a huge deal but ingesting via StreamHub will add 2-3 seconds (p99) on top of usual TCS ingestion delay.
  • Needs highly available data source service - consumers expect a tier 0 service from TCS
  • In the event of an incident in the backing service, TCS team will be paged for an incident outside of our control

Existing use cases

External IngestionRead-through
  • Identity license checks
  • DLP - Data classification tags
  • Config Registry
  • Access Narrowing

Sharding

External IngestionRead-through
If the source service is sharded in a traditional sense then TCS may not be able to reach out to the right shard to get necessary data for your keys

Availability requirements on other team’s data-supplying service

External IngestionRead-through
  • Context team and the team providing data must work out a model of shared ownership in term of maintenance, alerts, incident managements, etc… for the integration to succeed
  • Explicit communication of expectations that our service is no better than what the data source provides. If that is not acceptable, uplifting to tier 0 is needed for the data source.
  • Context team and the team providing data must work out a model of shared ownership in term of maintenance, alerts, incident managements, etc… for the integration to succeed

Impact on DROID

External IngestionRead-through
Effect on our SLOs / performance Availability:
Cold cache / experience when we do not have the data can only provide the same tier as their data source service
To do multi-region they’d need to provide a single Micros alias backed by multiple regions
Once it’s in our cache, can we give Tier 0? Maybe, it’s not quite the same because we don’t store in dynamo.
Future: considering shared memcache

Performance/ Latency:
Similar, except:
Cold cache / experience when we do not have the data will be significantly slower and dependent on source datastore latency (i.e. sidecar latency distribution is more skewed - some very fast, cold ones slower than normal TCS)
AN datastore is basically dynamo, no info yet on performance
Effect on our BAU/ KTLO load New invalidation capability increases the TCS server code base so yes there is an increase expected in KTLO

On call/ alerts/ incidents

External IngestionRead-through
DROID team will get paged if the ingestion StreamHub event source is having problems. (Probably less immediate impact than a Read-through cache being down). DROID team will get paged if the data source service is having problems. To prevent this we need:
  • A robust contract between consuming teams and source data source team.
  • Producers to set up their alerting as part of onboarding.
  • Both consumers and producers to be educated on DROID's capabilities.
  • To be conservative in adopting this widely until its proven to scale.

Isolation/ effect on other parallel TCS use cases

External IngestionRead-through
Currently have only partial separation (still on same web server / same undertow thread pool)

We need to separate external Read-through cases from the rest of the flow (separate web servers)

Web server separation is required for us to be comfortable putting this in production to avoid risk to existing TCS consumers
Same story; possibly slightly less risk due to no 3rd-party data source service being involved

Scaling limitations

Potential risks associated with onboarding additional traffic

External IngestionRead-through
Without separation, there is a risk of requests timing out and cache misses

Mitigation: dedicate x% of the traffic to AN

Note: this means they don’t get scaling - our nodes will not scale from their traffic alone

Two kinds of data:
  • one is suitable for bulk ingestion (tenanted / app list)
  • one is only suitable for read-through (sparse)
  • they want both through read-through, we think the former is less suitable

Scope of work

External IngestionRead-through
Without separation, there is a risk of requests timing out and cache misses

Mitigation: dedicate x% of the traffic to AN

Note: this means they don’t get scaling - our nodes will not scale from their traffic alone

Two kinds of data:
  • one is suitable for bulk ingestion (tenanted / app list)
  • one is only suitable for read-through (sparse)
  • they want both through read-through, we think the former is less suitable

Rate this page: