Last updated Sep 24, 2024

Setting up anomaly detection for your integration

As a DROID producer, it is important that you are aware of potential issues with your integration, as most importantly it can impact consumers of your data. This tutorial will walk you through setting up & configuring anomaly detection for your Read-through integration utilising SignalFx detectors and the recommended Config-as-Code solution, Terraform.

At then end of this tutorial, you will have the following detectors set up:

  • One tracking invalidation event processing errors in production environment.
  • Another tracking backing store API anomalies (errors & high-latency) in production environment.
  • And, a detector tracking invalidation event traffic anomalies in production environment.

Before you begin

Before diving in, please ensure you have the following:

If you choose to opt-out of cache invalidations, you can skip creating the detectors related to cache invalidations (steps 4 & 5).

Step 1: Define required variables for notifications

In order to receive notifications for alerts triggered by the detectors via OpsGenie and Slack, you will need to define the following variables:

  • opsgenie_credential_id - The OpsGenie credential ID for your team.
  • opsgenie_team_id - The OpsGenie team ID for your team.
  • opsgenie_responder_id - The OpsGenie responder ID for your team.
  • slack_credential_id - The Slack credential ID for your team.
  • slack_alerts_channel - The Slack channel for your team's alerts channel.

Utilising these variables, you can define the notification channels as a list in your Terraform configuration file, which will be used in the detectors later on. This can be passed in as a variable to your Terraform module, or defined directly in your Terraform detectors configuration file like below:

1
2
locals {
  alert_notifications = [
    "Opsgenie,${var.opsgenie_credential_id},${var.opsgenie_team_name},${var.opsgenie_responder_id},Team",
    "Team,${var.signalfx_team_id}",
    "Slack,${var.slack_credential_id},${var.slack_alerts_channel}",
  ]
}

Step 2: Define required variables for detectors

Now, let's define the variables more specific to the detectors you will be setting up:

  • signalfx_team_id - The SignalFx team ID associated with your team.
  • alert_notifications - a list of notification channels to be used for alerts, see example in previous step.
  • output_record_types - a list of output record types for your integration, which will be served by TCS, e.g. ["APP_DEFINITIONS_IN_CONTEXT", "APP_DEFINITION"]
  • producer_service_name - the name of your service sending data to DROID, e.g. "pipeline-tester-service"

Additionally, you will need to define a variable that can be used as part of the SignalFx detector programs (SignalFlow) to filter out the output record types for your integration.

1
2
locals {
  # Output record types - SignalFlow-friendly filter values
  output_record_types_filter_values = join(", ", [for record_type in local.output_record_types: "'${record_type}'"])
}

Step 3: Create the SignalFx detector resource for backing store API anomalies

The next detector we will create is for tracking backing store API anomalies in production environment. This detector will be triggered whenever DROID fails to fetch data from your backing store API, or if DROID is experiencing high-latency when fetching data from your backing store API.

The outlined example below is a starting point for the detector configuration, alerts will be triggered if:

  • DROID encounters any fetch errors to your API over a one-hour window.
  • DROID experiences an average latency above 5000ms over a one-minute window, for at least two-thirds of a five-minute period.

The thresholds can be adjusted to better suit your requirements.

1
2
resource "signalfx_detector" "DROID_read_through_backing_store_api_anomalies_production" {
  name = "DROID Read Through - Backing Store API anomalies - ${local.producer_service_name} - Production"
  teams = [var.signalfx_tlt_services_team_id]
  time_range        = 3600
  show_data_markers = true
  program_text = <<EOF
# Backing store fetch errors
BackingStoreApiFetchErrorsOver1hFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod') and filter('micros_service_id', 'tcs-web-*', 'tenant-context-service')
BackingStoreApiFetchErrorsOver1h = data('*.readThrough.backing-store-fetch-error.count', filter=BackingStoreApiFetchErrorsOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter', 'recordType']).publish('BackingStoreApiFetchErrorsOver1h')
# Backing store high latency
BackingStoreApiLatencyMeanOver1mFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod') and filter('micros_service_id', 'tcs-web-*', 'tenant-context-service')
BackingStoreApiLatencyMeanOver1m = data('*.readThrough.latency.mean', filter=BackingStoreApiLatencyMeanOver1mFilter).mean(over='1m').mean(by=['perimeter', 'recordType']).publish('BackingStoreApiLatencyMeanOver1m')
# Alert triggers - Prod
detect(when(BackingStoreApiFetchErrorsOver1h > threshold(0))).publish('Backing store API fetch errors attributed to service ${local.producer_service_name} (Production)')
detect(when(BackingStoreApiLatencyMeanOver1m > threshold(5000), lasting='5m', at_least=0.67)).publish('Backing store API latency is too high attributed to service ${local.producer_service_name} (Production)')
EOF

  rule {
    detect_label       = "Backing store API fetch errors attributed to service ${local.producer_service_name} (Production)"
    description        = "The value of readThrough.backing-store-fetch-error.count is above 0."
    runbook_url        = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#backing-store-api-fetch-errors"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nproducer_service_name:${local.producer_service_name}\n"
  }

  rule {
    detect_label       = "Backing store API latency is too high attributed to service ${local.producer_service_name} (Production)"
    description        = "The mean value of readThrough.latency is above 5000ms for the last minute."
    runbook_url        = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#backing-store-api-latency-too-high"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.BackingStoreApiLatencyMeanOver1m.value}}\n{{else}}Current signal value for mean latency: {{inputs.BackingStoreApiLatencyMeanOver1m.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nproducer_service_name:${local.producer_service_name}\n"
  }
}

(OPTIONAL) Step 4: Create the SignalFx detector resource for invalidation event processing errors

Next, we will create the SignalFx detector resource for tracking invalidation event processing errors. This detector will trigger an alert whenever DROID runs into invalidation event parsing errors attributed to your service and transformer.

The outlined example below is a starting point for the detector configuration, which will trigger an alert as soon as any error has been detected within a one-hour window. The thresholds can be adjusted to better suit your requirements.

1
2
resource "signalfx_detector" "DROID_read_through_invalidation_errors_production" {
  name = "DROID Read Through - Invalidation Errors - ${local.producer_service_name} - Production"
  teams = [var.signalfx_tlt_services_team_id]
  time_range        = 3600
  show_data_markers = true
  program_text = <<EOF
# Read-through invalidation processing errors
ReadThroughInvalidationProcessingErrorsFilter = filter('ingestionSource', '*${local.producer_service_name}') and filter('environment_type', 'prod')
ReadThroughInvalidationProcessingErrorsOver1h = data('tenant-context-service.read-through-cache.invalidation-processing-error.count', filter=ReadThroughInvalidationProcessingErrorsFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish('ReadThroughInvalidationProcessingErrorsOver1h')
# Alert triggers - Prod
detect(when(ReadThroughInvalidationProcessingErrorsOver1h > threshold(0))).publish('Invalidation processing errors attributed to service ${local.producer_service_name} (Production)')

EOF

  rule {
    detect_label       = "Invalidation processing errors attributed to service ${local.producer_service_name} (Production)"
    description        = "The value of tenant-context-service.read-through-cache.invalidation-processing-error is above 0."
    runbook_url        = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-processing-errors"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n"
  }
}

(OPTIONAL) Step 5: Create the SignalFx detector resource for invalidation event traffic anomalies

The second detector we will create is for tracking invalidation event traffic anomalies in production environment. This detector will trigger an alert whenever DROID detects a growth or decline of greater than 50% in invalidation events sent by your service via the LIVE pipeline (StreamHub), over a one-day period, compared to the same day one week ago. Please note that the 50% threshold mentioned is just an example and should be adjusted to suit your integration's expected traffic patterns.

Why is this important? It is crucial to be aware of any sudden changes in the number of entities ingested, as this can have cost implications for both your service and DROID.

The outlined example below is a starting point for the detector configuration, which will trigger an alert if the variance in ingestion traffic is above the defined threshold. More specifically, it will compare the number of invalidation events received over the past day with the number of invalidation events received over the same day one week ago, and alert if the variance is above the defined threshold over a thirty-minute period. The thresholds can be adjusted to better suit your requirements.

1
2
locals {
  # Thresholds for the percentage change in invalidation events received
  invalidation_events_variance_threshold = 50.0
  # Lower bound for the number of invalidation events received (if the number of entities ingested was too low to begin with, we don't want to trigger an alert)
  # This can be set to the expected minimum number of invalidation events received in a day.
  invalidation_event_traffic_minimum = 0
}

resource "signalfx_detector" "DROID_read_through_invalidation_event_rate_anomaly_production" {
  name = "[SAMPLE] DROID Read Through - Invalidation Event Rate Anomaly - ${local.producer_service_name} - Production"
  teams = [var.signalfx_tlt_services_team_id]
  time_range        = 3600
  show_data_markers = true
  program_text = <<EOF
# Invalidations received (current & previous week)
ReadThroughInvalidationEventsReceivedFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod')
ReadThroughInvalidationEventsReceivedCurrent = data('tenant-context-service.read-through-cache.invalidation-received.count', filter=ReadThroughInvalidationEventsReceivedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'recordType']).publish('ReadThroughInvalidationEventsReceivedCurrent')
ReadThroughInvalidationEventsReceivedPreviousWeek = data('tenant-context-service.read-through-cache.invalidation-received.count', filter=ReadThroughInvalidationEventsReceivedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'recordType']).timeshift('1w').publish('ReadThroughInvalidationEventsReceivedPreviousWeek')
# Calculate the percentage variance
PercentageVarianceFromPreviousWeek = ((((ReadThroughInvalidationEventsReceivedCurrent - ReadThroughInvalidationEventsReceivedPreviousWeek) / ReadThroughInvalidationEventsReceivedPreviousWeek) * 100) if ReadThroughInvalidationEventsReceivedPreviousWeek > ${local.invalidation_event_traffic_minimum} else 0).publish(label='PercentageVarianceFromPreviousWeek')
# Alert triggers - Prod
detect(when(PercentageVarianceFromPreviousWeek > ${local.invalidation_events_variance_threshold}, lasting='30m')).publish("Abnormal growth in invalidation events received (>${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)")
detect(when(PercentageVarianceFromPreviousWeek < -${local.invalidation_events_variance_threshold}, lasting='30m')).publish("Abnormal decrease in invalidation events received (<${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)")

EOF

  rule {
    detect_label       = "Abnormal growth in invalidation events received (>${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)"
    description        = "An abnormal growth in invalidation events received compared to last week (>${local.invalidation_events_variance_threshold}%) detected for the past 30 minutes."
    runbook_url        = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-traffic-anomalies"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n"
  }

  rule {
    detect_label       = "Abnormal decrease in invalidation events received (<${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)"
    description        = "An abnormal decrease in invalidation events received compared to last week (<${local.invalidation_events_variance_threshold}%) detected for the past 30 minutes."
    runbook_url        = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-traffic-anomalies"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n"
  }
}

Next steps

Now that you have set up anomaly detection for your integration, you are ready to deploy the detectors to SignalFx via Sauron. Once deployed you can further tweak your thresholds utilising historical data in SignalFx's detector view.

For more information on how to investigate and resolve alerts triggered by these detectors, please refer to the Investigate alerts triggered by detectors guide.

Detector reference

You can view a full working example of the detectors covered in this tutorial in our source-code repository here.

Rate this page: