Last updated Aug 13, 2024

Observability and Monitoring

Metrics and Dashboard

DashboardLink
TCS Sidecar Details (user experience)https://atlassian.signalfx.com/#/dashboard/FpodAP5A4AA
Tenant Context Service (cache usage, backing store latency, errors)https://atlassian.signalfx.com/#/dashboard/FpodAEpAwAA

In most cases, the sidecar dashboard best measures user experience. The sidecar has its own caching and makes concurrent requests to several TCSs - so problems seen for a single TCS region don’t necessarily map to customer impact. However, not all ingestion metrics are visible to sidecars, requiring review of both dashboards.

Detecting inconsistent versions and invalidation problems

Where a backing-store returns a stale version on invalidation, TCS will attempt to self-heal (in the background) within reasonable constraints. For up to 700 keys, TCS will retry twice at t=eventReceievedTime+40s, and t=eventReceievedTime+50s.

Invalidation event processing is asynchronous. Reported invalidation latencies are an ‘upper bound’ only. Actual latency is certainly less. Why? Measuring invalidation receipt to successful validation uses polling, with a ~10 second interval.

Latency of successful invalidation is: tenant-context-service.invalidation-check.success-time.upper_99 (also .median + other aggregations) Note: this doesn’t include Streamhub latency, nor sidecar invalidation - see below for those. A count of successful events is also published: tenant-context-service.invalidation-check.success.count

If the expected version has not been received after 60 seconds, TCS will cease retrying and emit a failure metric: tenant-context-service.invalidation-check.failure.count

In this case, the stale value will be cached by TCS until either it expires, or a background reload is triggered. See RecordType.java for configured expiry and refresh periods.

The Tenant Context Service dashboard has a chart tracking verification of invalidation events, by record-type. When failures occur, they appear as histogram counts overlayed on the # of total events. This can be filtered using dashboard overrides.

invalidation-events-received-no-verification-failures.png

Invalidation events received, no verification failures

verification-failures-appearing-on-cache-verification-chart.png

Verification failures appearing on the Cache Verification chart

Estimating invalidation delay

TCS doesn’t currently provide an end-to-end metric for invalidation delay. To validate health of read-through invalidation, review the metrics below. To estimate ‘total’ invalidation delay, sum each metric e.g. streamhub delay + TCS invalidation + max(backing store latency, sidecar invalidation delay). Note that a slow or stale backing store response may significantly increase the delay, if retries are required.

Please let our team know if you’ll configure alerts against any of these charts/metrics, as they could change without notice.

TCS Dashboard Invalidation Metrics

  • Read-through Cache Invalidation Events → latency (p50) streamhub ingestion
    • Measures period from your Streamhub write to TCS initially consuming the event
  • TCS Invalidation Delay
    • Measures average time from initial ingestion to other TCS EC2 instances.
    • Does not include ‘loading’ of the keys, if at all
    • Internal invalidation events accumulate updates for several record types (can’t filter by yours)
  • Read-through Backing Store Latency
    • P50 + P99 latency observed by TCS when calling backing store API

read-through-invalidation-events-chart.png

Verification failures appearing on the Cache Verification chart

internal-invalidation-delay-chart.png

TCS ‘internal’ invalidation delay

latency-chart.png

‘latency p50 + p90, by environment and record type

TCS Sidecar Dashboard Invalidation Metrics

Invalidation Delay

  • Measures from TCS service to TCS Sidecar
  • Can be filtered by service and TCS parent region

tcs-sidecar-invalidation-delay.png

‘TCS Sidecar Invalidation delay (can be filtered by service and TCS region)

Rate this page: