Last updated Sep 24, 2024

Observability and Monitoring

Metrics and Dashboard

Dashboard	Link
TCS Sidecar Details (user experience)	https://atlassian.signalfx.com/#/dashboard/FpodAP5A4AA
Tenant Context Service (cache usage, backing store latency, errors)	https://atlassian.signalfx.com/#/dashboard/FpodAEpAwAA

Recommended Monitoring

In most cases, the sidecar dashboard best measures user experience. The sidecar has its own caching and makes concurrent requests to several TCSs - so problems seen for a single TCS region don’t necessarily map to customer impact. However, not all ingestion metrics are visible to sidecars, requiring review of both dashboards.

Anomaly detection

It is highly-recommended to set up anomaly detection for your integration. Please see our tutorial on how to setup anomaly detection for more information.

Detecting inconsistent versions and invalidation problems

Where a backing-store returns a stale version on invalidation, TCS will attempt to self-heal (in the background) within reasonable constraints. For up to 700 keys, TCS will retry twice at t=eventReceievedTime+40s, and t=eventReceievedTime+50s.

Invalidation event processing is asynchronous. Reported invalidation latencies are an ‘upper bound’ only. Actual latency is certainly less. Why? Measuring invalidation receipt to successful validation uses polling, with a ~10 second interval.

Latency of successful invalidation is: tenant-context-service.invalidation-check.success-time.upper_99 (also .median + other aggregations) Note: this doesn’t include Streamhub latency, nor sidecar invalidation - see below for those. A count of successful events is also published: tenant-context-service.invalidation-check.success.count

If the expected version has not been received after 60 seconds, TCS will cease retrying and emit a failure metric: tenant-context-service.invalidation-check.failure.count

In this case, the stale value will be cached by TCS until either it expires, or a background reload is triggered. See RecordType.java for configured expiry and refresh periods.

The Tenant Context Service dashboard has a chart tracking verification of invalidation events, by record-type. When failures occur, they appear as histogram counts overlayed on the # of total events. This can be filtered using dashboard overrides.

invalidation-events-received-no-verification-failures.png

Invalidation events received, no verification failures

verification-failures-appearing-on-cache-verification-chart.png

Verification failures appearing on the Cache Verification chart

Monitoring invalidation delay

To monitor DROID's Read-through invalidation delay, please visit the Read-through dashboard and review the following charts:

StreamHub Invalidation Delay - for invalidation delays specifically within StreamHub.
End-to-End Invalidation Delay - for time taken for end-to-end Read-through invalidations to be processed.

Please let our team know if you’ll configure alerts against any of these charts/metrics, as they could change without notice.

TCS Dashboard Invalidation Metrics

Read-through Cache Invalidation Events → latency (p50) streamhub ingestion
- Measures period from your Streamhub write to TCS initially consuming the event
TCS Invalidation Delay
- Measures average time from initial ingestion to other TCS EC2 instances.
- Does not include ‘loading’ of the keys, if at all
- Internal invalidation events accumulate updates for several record types (can’t filter by yours)
Read-through Backing Store Latency
- P50 + P99 latency observed by TCS when calling backing store API