Dashboard | Link |
---|---|
TCS Sidecar Details (user experience) | https://atlassian.signalfx.com/#/dashboard/FpodAP5A4AA |
Tenant Context Service (cache usage, backing store latency, errors) | https://atlassian.signalfx.com/#/dashboard/FpodAEpAwAA |
In most cases, the sidecar dashboard best measures user experience. The sidecar has its own caching and makes concurrent requests to several TCSs - so problems seen for a single TCS region don’t necessarily map to customer impact. However, not all ingestion metrics are visible to sidecars, requiring review of both dashboards.
Where a backing-store returns a stale version on invalidation, TCS will attempt to self-heal (in the background) within reasonable constraints.
For up to 700 keys, TCS will retry twice at t=eventReceievedTime+40s, and t=eventReceievedTime+50s
.
Invalidation event processing is asynchronous. Reported invalidation latencies are an ‘upper bound’ only. Actual latency is certainly less. Why? Measuring invalidation receipt to successful validation uses polling, with a ~10 second interval.
Latency of successful invalidation is: tenant-context-service.invalidation-check.success-time.upper_99 (also .median + other aggregations) Note: this doesn’t include Streamhub latency, nor sidecar invalidation - see below for those. A count of successful events is also published: tenant-context-service.invalidation-check.success.count
If the expected version has not been received after 60 seconds, TCS will cease retrying and emit a failure metric: tenant-context-service.invalidation-check.failure.count
In this case, the stale value will be cached by TCS until either it expires, or a background reload is triggered. See RecordType.java for configured expiry and refresh periods.
The Tenant Context Service dashboard has a chart tracking verification of invalidation events, by record-type. When failures occur, they appear as histogram counts overlayed on the # of total events. This can be filtered using dashboard overrides.
TCS doesn’t currently provide an end-to-end metric for invalidation delay. To validate health of read-through invalidation, review the metrics below.
To estimate ‘total’ invalidation delay, sum each metric e.g. streamhub delay + TCS invalidation + max(backing store latency, sidecar invalidation delay)
. Note that a slow or stale backing store response may significantly increase the delay, if retries are required.
Please let our team know if you’ll configure alerts against any of these charts/metrics, as they could change without notice.
Invalidation Delay
Rate this page: