DROID is a distributed caching store with extremely low latency and high availability. There are trade-offs baked into its design which suits certain use cases but not others. In order to be onboarded to DROID successfully it’s vital to understand both DROID and your use case’s characteristics and requirements and whether it really is a good fit.
Our one-pager is a good starting point. From there you will be familiar with DROID terminologies and the principles of where/how your integration will interact with the DROID pipeline.
This is time to fill in your DROID questionnaire on a best-effort basis. Don’t get scared because of the extensiveness of the questions there, the questions are everything that might be relevant to a DROID integration. Don’t spend tons of time gathering data either. The idea is just to grab some relevant data to inform us of:
When your questionnaire is finished please share it with Context - Integration squad (You can find us in #help-tcs Slack channel).
Give us some time to digest your questionnaire. Meanwhile, schedule a catchup (between 30 minutes and an hour) so we can discuss your use case in more detail and answer any questions you may have.
DROID use cases are currently assessed on a case-by-case basis. We need to review your use case to ensure it’s a good fit for DROID and will only onboard suitable use cases. Explicit areas of assessment include but no limited to:
Cost friendliness
Currently cost awareness is a hot topic, and we need to ensure that your use case is cost-effective for DROID. For larger use cases, a budget must be allocated prior.
Data access and production patterns
DROID needs highly cacheable data, means data which is infrequently updated but is frequently read.
No plain text UGC or PII data
Please refer to our Data Classification Policy.
A comparison to two approaches to ingest metadata into TCS skipping the provisioning path, pros and cons and when to recommend each approach
External Ingestion | Read-through |
---|---|
Data producers send their data to TCS using StreamHub events (live) and Wrench (bulk). TCS ingests the data and writes into DynamoDB (DDB), and TCS web servers are used to lookup from DDB. TCS sends invalidation events when DDB writes change data. | Using TCS as a cache proxy to another service (source service). Our web servers call the source service on cache misses. We maintain a copy of the last response data we received and send it to the consumer. The source service has the option to send invalidation events to TCS via StreamHub. |
External Ingestion | Read-through |
---|---|
Your service to send events to StreamHub whenever a record is created or updated. The AVI, schema and throughput of the events must be agreed upon by both your team and the Context team. | Context team will work with you to define some contracts regarding
|
External Ingestion | Read-through |
---|---|
If you choose the External Ingestion path, your service must provide DROID with a mean to bootstrap the whole dataset. There are strict requirements regarding availability, reliability and scale of such method so that when necessary (e.g. incidents), we can restore the whole dataset in a timely manner. |
If you choose to onboard your service as a backing store for a Read-through cache there are strict requirements to make sure that your integration with DROID meets our standards in term of availability, reliability and performance. These include but not limited to:
|
External Ingestion | Read-through | |
---|---|---|
Invalidation | Yes | Yes, via StreamHub |
Batch Flushing | Yes | Not needed |
Data Transformation | Yes | No transformation |
External Ingestion | Read-through | |
---|---|---|
Data shape | Transformers supported | No transformation |
Record size | Max is determined as min of ( StreamHub event size limit, DynamoDB record size limit, SQS record size limit ) ~ 250KB as of Jan 5, 2024 | Up to source service and max_item_size of our L2 caches (set as 1MB as of Jan 5, 2024) |
Dataset “liveness” / access pattern | suit use cases where most records may be used at some point | suits use cases with large potential datasets where most records are never accessed |
Frequency of change | once we’ve built invalidation, main relevance will be that this affects the load on the data source service | |
Traffic | Slow | |
Producer Integration Tier | Can integrate with all service tiers. (ingestion freshness dependent on StreamHub supplier) | Must be tier 0~1 (access reliability/performance dependent on the data source) |
External Ingestion | Read-through | |
---|---|---|
Pros |
|
|
Cons |
|
|
External Ingestion | Read-through |
---|---|
|
|
External Ingestion | Read-through |
---|---|
If the source service is sharded in a traditional sense then TCS may not be able to reach out to the right shard to get necessary data for your keys |
External Ingestion | Read-through |
---|---|
|
|
External Ingestion | Read-through | |
---|---|---|
Effect on our SLOs / performance |
Availability: Cold cache / experience when we do not have the data can only provide the same tier as their data source service To do multi-region they’d need to provide a single Micros alias backed by multiple regions Once it’s in our cache, can we give Tier 0? Maybe, it’s not quite the same because we don’t store in dynamo. Future: considering shared memcache Performance/ Latency: Similar, except: Cold cache / experience when we do not have the data will be significantly slower and dependent on source datastore latency (i.e. sidecar latency distribution is more skewed - some very fast, cold ones slower than normal TCS) AN datastore is basically dynamo, no info yet on performance | |
Effect on our BAU/ KTLO load | New invalidation capability increases the TCS server code base so yes there is an increase expected in KTLO |
External Ingestion | Read-through |
---|---|
DROID team will get paged if the ingestion StreamHub event source is having problems. (Probably less immediate impact than a Read-through cache being down). |
DROID team will get paged if the data source service is having problems. To prevent this we need:
|
External Ingestion | Read-through |
---|---|
Currently have only partial separation (still on same web server / same undertow thread pool) We need to separate external Read-through cases from the rest of the flow (separate web servers) Web server separation is required for us to be comfortable putting this in production to avoid risk to existing TCS consumers | Same story; possibly slightly less risk due to no 3rd-party data source service being involved |
Potential risks associated with onboarding additional traffic
External Ingestion | Read-through |
---|---|
Without separation, there is a risk of requests timing out and cache misses Mitigation: dedicate x% of the traffic to AN Note: this means they don’t get scaling - our nodes will not scale from their traffic alone Two kinds of data:
|
External Ingestion | Read-through |
---|---|
Without separation, there is a risk of requests timing out and cache misses Mitigation: dedicate x% of the traffic to AN Note: this means they don’t get scaling - our nodes will not scale from their traffic alone Two kinds of data:
|
Rate this page: