Last updated Sep 24, 2024

Investigating alerts triggered by your DROID detectors

This page intends to provide guidance on how to investigate alerts generated by the detectors you previously set up using our Setup anomaly detection tutorial. If you haven't done so already, it is recommended to go through the tutorial to set up anomaly detection for your integration first.

Backing store API fetch errors

These alerts are generated by the detector you setup here.

Example alert title

Alert title: "Backing store API fetch errors attributed to service my-service (Production)"

Cause

DROID has encountered one or more errors when trying to fetch records from your Read-through service's backing store API.

Errors can occur for a number of reasons, such as DROID being unable to reach your service (e.g. a transient network issue on either end) or your service returning an unexpected response.

It is recommended to:

Backing store API latency too high

These alerts are generated by the detector you setup here.

Example alert title

Alert title: "Backing store API latency is too high attributed to service my-service (Production)"

Cause

DROID has detected that the mean latency of your Read-through service's backing store API has exceeded the threshold set in your detector.

High latency can be caused by a number of factors, such as high load on your service & cross-region network latency.

It is recommended to:

Invalidation event processing errors

These alerts are generated by the detector you setup here.

Example alert title

Alert title: "Invalidation event processing errors attributed to service my-service (Production)"

Cause

DROID has encountered one or more errors when trying to process cache invalidation events sent by your service via Streamhub.

Although Streamhub validates the schema of the events at the source, there is some additional validation performed by DROID that is not possible via Streamhub, meaning it is still possible for part of the payload to be malformed.

It is recommended to check the following:

To aid with finding the root cause, you can also check the Tenant Context Service's logs to track down the exact error encountered with the following Splunk query:

1
2
`micros_tenant-context-service` env=prod* logger_name="*.StreamhubReceiver" contextMap.streamhubEventId="YOUR_STREAMHUB_EVENT_ID"

Invalidation event traffic anomalies

These alerts are generated by the detector you setup here.

Example alert title

This detector can generate alerts for both abnormal growth and abnormal drops in ingestion traffic.

For abnormal ingestion traffic growth you may see an alert titled: "Abnormal growth in invalidation events received (>50%) attributed to my-service (Production)". For abnormal ingestion traffic drops you may see an alert titled: "Abnormal decrease in invalidation events received (<50%) attributed to my-service (Production)".

Cause

The detector checks the number of invalidation events sent by your service for the past day and compares this to number of entities sent the same day one week ago. If the variance is greater than the threshold set in your detector (e.g. 50%), an alert is generated.

It is important that you verify if the growth or drop in traffic was anticipated.

If the growth or drop in traffic was expected:

  • Communicate this to the DROID team, as it may have cost implications to DROID.
  • It's totally fine if traffic patterns change over time, but it's also important to review thresholds set in your detectors to ensure they are still relevant, and to avoid unnecessary alerts.

If the growth or drop in traffic was not expected:

Additional resources

Here are some additional resources that may help you in your investigation:

Rate this page: