Last updated Sep 24, 2024

Observability and Monitoring

Metrics and Dashboard

DashboardLink
TCS Sidecar Details (user experience)https://atlassian.signalfx.com/#/dashboard/FpodAP5A4AA
Tenant Context Service (cache usage, backing store latency, errors)https://atlassian.signalfx.com/#/dashboard/FpodAEpAwAA

In most cases, the sidecar dashboard best measures user experience. The sidecar has its own caching and makes concurrent requests to several TCSs - so problems seen for a single TCS region don’t necessarily map to customer impact. However, not all ingestion metrics are visible to sidecars, requiring review of both dashboards.

Anomaly detection

It is highly-recommended to set up anomaly detection for your integration. Please see our tutorial on how to setup anomaly detection for more information.

Detecting inconsistent versions and invalidation problems

Where a backing-store returns a stale version on invalidation, TCS will attempt to self-heal (in the background) within reasonable constraints. For up to 700 keys, TCS will retry twice at t=eventReceievedTime+40s, and t=eventReceievedTime+50s.

Invalidation event processing is asynchronous. Reported invalidation latencies are an ‘upper bound’ only. Actual latency is certainly less. Why? Measuring invalidation receipt to successful validation uses polling, with a ~10 second interval.

Latency of successful invalidation is: tenant-context-service.invalidation-check.success-time.upper_99 (also .median + other aggregations) Note: this doesn’t include Streamhub latency, nor sidecar invalidation - see below for those. A count of successful events is also published: tenant-context-service.invalidation-check.success.count

If the expected version has not been received after 60 seconds, TCS will cease retrying and emit a failure metric: tenant-context-service.invalidation-check.failure.count

In this case, the stale value will be cached by TCS until either it expires, or a background reload is triggered. See RecordType.java for configured expiry and refresh periods.

The Tenant Context Service dashboard has a chart tracking verification of invalidation events, by record-type. When failures occur, they appear as histogram counts overlayed on the # of total events. This can be filtered using dashboard overrides.

invalidation-events-received-no-verification-failures.png

Invalidation events received, no verification failures

verification-failures-appearing-on-cache-verification-chart.png

Verification failures appearing on the Cache Verification chart

Monitoring invalidation delay

To monitor DROID's Read-through invalidation delay, please visit the Read-through dashboard and review the following charts:

Please let our team know if you’ll configure alerts against any of these charts/metrics, as they could change without notice.

TCS Dashboard Invalidation Metrics

  • Read-through Cache Invalidation Events → latency (p50) streamhub ingestion
    • Measures period from your Streamhub write to TCS initially consuming the event
  • TCS Invalidation Delay
    • Measures average time from initial ingestion to other TCS EC2 instances.
    • Does not include ‘loading’ of the keys, if at all
    • Internal invalidation events accumulate updates for several record types (can’t filter by yours)
  • Read-through Backing Store Latency
    • P50 + P99 latency observed by TCS when calling backing store API

read-through-invalidation-events-chart.png

Verification failures appearing on the Cache Verification chart

internal-invalidation-delay-chart.png

TCS ‘internal’ invalidation delay

latency-chart.png

‘latency p50 + p90, by environment and record type

TCS Sidecar Dashboard Invalidation Metrics

Invalidation Delay

  • Measures from TCS service to TCS Sidecar
  • Can be filtered by service and TCS parent region

tcs-sidecar-invalidation-delay.png

‘TCS Sidecar Invalidation delay (can be filtered by service and TCS region)

Rate this page: