Last updated Sep 24, 2024

Setting up anomaly detection for your integration

As a DROID producer, it is important that you are aware of potential issues with your integration, as most importantly it can impact consumers of your data. This tutorial will walk you through setting up & configuring anomaly detection for your External Ingestion integration utilising SignalFx detectors and the recommended Config-as-Code solution, Terraform.

At then end of this tutorial, you will have the following detectors set up:

  • Alert on any ingestion & transformation errors in production environment.
  • Another tracking ingestion traffic anomalies in production environment.

Before you begin

Before diving in, please ensure you have the following:

Step 1: Define required variables for notifications

In order to receive notifications for alerts triggered by the detectors via OpsGenie and Slack, you will need to define the following variables:

  • opsgenie_credential_id - The OpsGenie credential ID for your team.
  • opsgenie_team_id - The OpsGenie team ID for your team.
  • opsgenie_responder_id - The OpsGenie responder ID for your team.
  • slack_credential_id - The Slack credential ID for your team.
  • slack_alerts_channel - The Slack channel for your team's alerts channel.

Utilising these variables, you can define the notification channels as a list in your Terraform configuration file, which will be used in the detectors later on. This can be passed in as a variable to your Terraform module, or defined directly in your Terraform detectors configuration file like below:

1
2
locals {
  alert_notifications = [
    "Opsgenie,${var.opsgenie_credential_id},${var.opsgenie_team_name},${var.opsgenie_responder_id},Team",
    "Team,${var.signalfx_team_id}",
    "Slack,${var.slack_credential_id},${var.slack_alerts_channel}",
  ]
}

Step 2: Define required variables for detectors

Now, let's define the variables more specific to the detectors you will be setting up:

  • signalfx_team_id - The SignalFx team ID associated with your team.
  • alert_notifications - a list of notification channels to be used for alerts, see example in previous step.
  • entity_types - a list of source entity types that you are sending to DROID, e.g. ["PIPELINE_TESTER"]
  • producer_service_name - the name of your service sending data to DROID, e.g. "pipeline-tester-service"
  • transformer_name - the Java class name of your transformer as-is in Transformer Service, e.g. "PipelineTesterRecordTransformer"

Step 3: Create the SignalFx detector resource for ingestion & transformation errors

Next, we will create the SignalFx detector resource for tracking ingestion & transformation errors in production environment. This detector will trigger an alert whenever DROID runs into ingestion parsing errors or transformation failures attributed to your service and transformer.

The outlined example below is a starting point for the detector configuration, which will trigger an alert as soon as any error has been detected within a one-hour window. The thresholds can be adjusted to better suit your requirements.

1
2
resource "signalfx_detector" "DROID_external_ingestion_errors_production" {
  name              = "DROID External Ingestion - Ingestion errors - ${local.producer_service_name} (Production)"
  teams             = [var.signalfx_tlt_services_team_id]
  time_range        = 3600
  show_data_markers = true

  program_text = <<-EOF
# Ingestion Parsing errors
IngestionParsingErrorsProdOver1hFilter = filter('ingestionSource', '*${local.producer_service_name}') and filter('ingestionPipeline', 'EXTERNAL_LIVE') and filter('enviornment_type', 'prod')
IngestionParsingErrorsProdOver1h = data('transformerservice.record.parsing.errors.count', filter=IngestionParsingErrorsProdOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish(label='IngestionParsingErrorsProdOver1h')
# Transformation errors
TransformerFailureProdOver1hFilter = filter('transformer', '*${local.transformer_name}*') and filter('entityType', ${local.entity_types_filter_values}) and filter('result', 'error') and filter('pipeline', 'EXTERNAL_LIVE', 'EXTERNAL_BATCH') and filter('micros_service_id', 'transformerservice') and filter('environment_type', 'prod')
TransformerFailuresProdOver1h = data('logback.events.count', filter=TransformerFailureProdOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish(label='TransformerFailuresProdOver1h')
# Alert triggers - Prod
detect(when(IngestionParsingErrorsProdOver1h > threshold(0))).publish('Ingestion parsing errors attributed to service ${local.producer_service_name} (Production)')
detect(when(TransformerFailuresProdOver1h > threshold(0))).publish('Transformation failures attributed to transformer ${local.transformer_name} (Production)')
  EOF

  rule {
    detect_label       = "Ingestion parsing errors attributed to service ${local.producer_service_name} (Production)"
    description        = "The value of transformerservice.record.parsing.errors is above 0."
    runbook_url        = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-error-alerts"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.IngestionParsingErrorsProdOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.IngestionParsingErrorsProdOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n"
  }

  rule {
    detect_label       = "Transformation failures attributed to transformer ${local.transformer_name} (Production)"
    description        = "The value of logback.events.count result error is above 0."
    runbook_url        = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#transformation-error-alerts"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.TransformerFailuresProdOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.TransformerFailuresProdOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n"
  }
}

Step 4: Create the SignalFx detector resource for ingestion traffic anomalies

The second detector we will create is for tracking ingestion traffic anomalies in production environment. This detector will trigger an alert whenever DROID detects a growth or decline of greater than 50% in ingestion entities sent by your service via the LIVE pipeline (StreamHub), over a one-day period, compared to the same day one week ago. Please note that the 50% threshold mentioned is just an example and should be adjusted to suit your integration's expected traffic patterns.

Why is this important? It is crucial to be aware of any sudden changes in the number of entities ingested, as this can have cost implications for both your service and DROID.

The outlined example below is a starting point for the detector configuration, which will trigger an alert if the variance in ingestion traffic is above the defined threshold. More specifically, it will compare the number of entities ingested over the past day with the number of entities ingested over the same day one week ago, and alert if the variance is above the defined threshold over a thirty-minute period. The thresholds can be adjusted to better suit your requirements.

1
2
locals {
  # Thresholds for the percentage change in ingestion entities received
  entity_ingestion_traffic_variance_threshold = 50.0
  # Lower bound for the number of entities ingested (if the number of entities ingested was too low to begin with, we don't want to trigger an alert)
  # This can be set to the expected minimum number of entities ingested in a day.
  entity_ingestion_traffic_minimum = 0
}

resource "signalfx_detector" "DROID_external_ingestion_traffic_anomaly_production" {
  name         = "DROID External Ingestion - Ingestion traffic anomaly - Production"
  teams        = [var.signalfx_tlt_services_team_id]
  time_range        = 3600
  show_data_markers = true

  program_text = <<-EOF
# Entities ingested (current & previous week)
EntitiesIngestedFilter = filter('entityType', ${local.entity_types_filter_values}) and filter('micros_group', 'WebServer') and filter('environment_type', 'prod')
EntitiesIngestedCurrent = data('transformerservice.ingestion.delay.count', filter=EntitiesIngestedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'entityType']).publish(label='EntitiesIngestedCurrent')
EntitiesIngestedPreviousWeek = data('transformerservice.ingestion.delay.count', filter=EntitiesIngestedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'entityType']).timeshift('1w').publish(label='EntitiesIngestedPreviousWeek')
# Calculate the percentage variance
PercentageVarianceFromPreviousWeek = ((((EntitiesIngestedCurrent - EntitiesIngestedPreviousWeek) / EntitiesIngestedPreviousWeek) * 100) if EntitiesIngestedPreviousWeek > ${local.entity_ingestion_traffic_minimum} else 0).publish(label='PercentageVarianceFromPreviousWeek')
# Alert trigger - Prod
detect(when(PercentageVarianceFromPreviousWeek > ${local.entity_ingestion_traffic_variance_threshold}, lasting='30m')).publish("Abnormal growth in ingestion traffic (>${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)")
detect(when(PercentageVarianceFromPreviousWeek < -${local.entity_ingestion_traffic_variance_threshold}, lasting='30m')).publish("Abnormal decrease in ingestion traffic (<${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)")
EOF

  rule {
    detect_label       = "Abnormal growth in ingestion traffic (>${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)"
    description        = "An abnormal growth in ingestion traffic compared to last week (>${local.entity_ingestion_traffic_variance_threshold}%) detected for the past 30 minutes."
    runbook_url        = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-traffic-anomalies"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n"
  }

  rule {
    detect_label       = "Abnormal decrease in ingestion traffic (<${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)"
    description        = "An abnormal decrease in ingestion traffic compared to last week (<${local.entity_ingestion_traffic_variance_threshold}%) detected for the past 30 minutes."
    runbook_url        = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-traffic-anomalies"
    severity           = "Warning"
    notifications      = var.alert_notifications
    parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n"
  }

}

Next steps

Now that you have set up anomaly detection for your integration, you are ready to deploy the detectors to SignalFx via Sauron. Once deployed you can further tweak your thresholds utilising historical data in SignalFx's detector view.

For more information on how to investigate and resolve alerts triggered by these detectors, please refer to the Investigate alerts triggered by detectors guide.

Detector reference

You can view a full working example of the detectors covered in this tutorial in our source-code repository here.

Rate this page: