DROID is a distributed caching store with extremely low latency and high availability. There are trade-offs baked into its design which suits certain use cases but not others. In order to be onboarded to DROID successfully it’s vital to understand both DROID and your use case’s characteristics and requirements and whether it really is a good fit.
Our one-pager is a good starting point. From there you will be familiar with DROID terminologies and the principles of where/how your integration will interact with the DROID pipeline.
This is time to fill in your DROID questionnaire on a best-effort basis. Don’t get scared because of the extensiveness of the questions there, the questions are everything that might be relevant to a DROID integration. Don’t spend tons of time gathering data either. The idea is just to grab some relevant data to inform us of:
When your questionnaire is finished please share it with Context - Integration squad (You can find us in #help-tcs Slack channel).
Give us some time to digest your questionnaire. Meanwhile, schedule a catchup (between 30 minutes and an hour) so we can discuss your use case in more detail and answer any questions you may have.
A comparison to two approaches to ingest metadata into TCS skipping the provisioning path, pros and cons and when to recommend each approach
Read-through | External Ingestion |
---|---|
Using TCS as a cache proxy to another service (source service). Our WebServers call source service on cache misses. We maintain a copy of the last response data we received and send it to consumer. Source service has the option to send invalidation events to TCS via StreamHub. | Data providers send their data to TCS using StreamHub events (live) and Wrench (bulk). TCS ingest the data and write into DDB, and TCS WebServers are used to lookup from DDB. TCS sends invalidation events when DDB writes change data. |
Read-through | External Ingestion |
---|---|
Context team will work with you to define some contracts regarding
| Your service to send events to StreamHub whenever a record is created or updated. The AVI, schema and throughput of the events must be agreed upon by both your team and Context team. |
Read-through | External Ingestion |
---|---|
If you choose to onboard your service as a backing store for a read-through cache there are strict requirements to make sure that your integration with DROID meets our standards in term of availability, reliability and performance. These include but not limited to:
| If you choose the external ingestion path your service must provide DROID with a mean to bootstrap the whole dataset. There are strict requirements regarding availability, reliability and scale of such method so that when necessary (e.g. in accidents) we can restore the whole dataset in a timely manner. |
Read-through | External Ingestion | |
---|---|---|
Invalidation | Yes. Via Streamhub | Yes |
Batch Flushing | Not needed | Yes |
Data Transformation | No transformation | Yes |
Read-through | External Ingestion | |
---|---|---|
Data shape | No transformation | Transformers supported |
Record size | Up to source service and max_item_size of our L2 caches (set as 1MB as of Jan 5, 2024) | Max is determined as min of ( StreamHub event size limit, DynamoDB record size limit, SQS record size limit ) ~ 250KB as of Jan 5, 2024 |
Dataset “liveness” / access pattern | suits use cases with large potential datasets where most records are never accessed | suit use cases where most records may be used at some point |
Frequency of change | once we’ve built invalidation, main relevance will be that this affects the load on the data source service | |
Traffic | Slow | |
Producer Integration Tier | Must be tier 0~1 (access reliability/performance dependent on source datasource) | Can integrate with all service tiers. (ingestion freshness dependent on StreamHub supplier) |
Read-through | External Ingestion | |
---|---|---|
Pros |
|
|
Cons |
|
|
Read-through | External Ingestion |
---|---|
|
|
Read-through | External Ingestion |
---|---|
If the source service is sharded in a traditional sense then TCS may not be able to reach out to the right shard to get necessary data for your keys |
Read-through | External Ingestion |
---|---|
|
|
Read-through | External Ingestion | |
---|---|---|
Effect on our SLOs / performance |
Availability: Cold cache / experience when we do not have the data can only provide the same tier as their data source service To do multi-region they’d need to provide a single Micros alias backed by multiple regions Once it’s in our cache, can we give Tier 0? Maybe, it’s not quite the same because we don’t store in dynamo. Future: considering shared memcache Performance/ Latency: Similar, except: Cold cache / experience when we do not have the data will be significantly slower and dependent on source datastore latency (i.e. sidecar latency distribution is more skewed - some very fast, cold ones slower than normal TCS) AN datastore is basically dynamo, no info yet on performance | |
Effect on our BAU/ KTLO load | New invalidation capability increases the TCS server code base so yes there is an increase expected in KTLO |
Read-through | External Ingestion |
---|---|
We will get paged if their datasource service is having problems. How do we prevent this?
| We will get paged if ingestion StreamHub event source is having problems. (Probably less immediate impact than read-through cache being down). How do we prevent this? |
Read-through | External Ingestion |
---|---|
Currently have only partial separation (still on same WebServer / same undertow thread pool) We need to separate external read through cases from the rest of the flow (separate WebServers) WebServer separation is required for us to be comfortable putting this in production to avoid risk to existing TCS consumers | Same story; possibly slightly less risk due to no 3rd-party datasource service being involved |
Potential risks associated with onboarding additional traffic
Read-through | External Ingestion |
---|---|
Without separation, there is a risk of requests timing out and cache misses Mitigation: dedicate x% of the traffic to AN Note: this means they don’t get scaling - our nodes will not scale from their traffic alone Two kinds of data:
|
Read-through | External Ingestion |
---|---|
Without separation, there is a risk of requests timing out and cache misses Mitigation: dedicate x% of the traffic to AN Note: this means they don’t get scaling - our nodes will not scale from their traffic alone Two kinds of data:
|
Rate this page: