As a DROID producer, it is important that you are aware of potential issues with your integration, as most importantly it can impact consumers of your data. This tutorial will walk you through setting up & configuring anomaly detection for your External Ingestion integration utilising SignalFx detectors and the recommended Config-as-Code solution, Terraform.
At then end of this tutorial, you will have the following detectors set up:
Before diving in, please ensure you have the following:
In order to receive notifications for alerts triggered by the detectors via OpsGenie and Slack, you will need to define the following variables:
opsgenie_credential_id
- The OpsGenie credential ID for your team.opsgenie_team_id
- The OpsGenie team ID for your team.opsgenie_responder_id
- The OpsGenie responder ID for your team.slack_credential_id
- The Slack credential ID for your team.slack_alerts_channel
- The Slack channel for your team's alerts channel.Utilising these variables, you can define the notification channels as a list in your Terraform configuration file, which will be used in the detectors later on. This can be passed in as a variable to your Terraform module, or defined directly in your Terraform detectors configuration file like below:
1 2locals { alert_notifications = [ "Opsgenie,${var.opsgenie_credential_id},${var.opsgenie_team_name},${var.opsgenie_responder_id},Team", "Team,${var.signalfx_team_id}", "Slack,${var.slack_credential_id},${var.slack_alerts_channel}", ] }
Now, let's define the variables more specific to the detectors you will be setting up:
signalfx_team_id
- The SignalFx team ID associated with your team.alert_notifications
- a list of notification channels to be used for alerts, see example in previous step.entity_types
- a list of source entity types that you are sending to DROID, e.g. ["PIPELINE_TESTER"]
producer_service_name
- the name of your service sending data to DROID, e.g. "pipeline-tester-service"
transformer_name
- the Java class name of your transformer as-is in Transformer Service, e.g. "PipelineTesterRecordTransformer"
Next, we will create the SignalFx detector resource for tracking ingestion & transformation errors in production environment. This detector will trigger an alert whenever DROID runs into ingestion parsing errors or transformation failures attributed to your service and transformer.
The outlined example below is a starting point for the detector configuration, which will trigger an alert as soon as any error has been detected within a one-hour window. The thresholds can be adjusted to better suit your requirements.
1 2resource "signalfx_detector" "DROID_external_ingestion_errors_production" { name = "DROID External Ingestion - Ingestion errors - ${local.producer_service_name} (Production)" teams = [var.signalfx_tlt_services_team_id] time_range = 3600 show_data_markers = true program_text = <<-EOF # Ingestion Parsing errors IngestionParsingErrorsProdOver1hFilter = filter('ingestionSource', '*${local.producer_service_name}') and filter('ingestionPipeline', 'EXTERNAL_LIVE') and filter('enviornment_type', 'prod') IngestionParsingErrorsProdOver1h = data('transformerservice.record.parsing.errors.count', filter=IngestionParsingErrorsProdOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish(label='IngestionParsingErrorsProdOver1h') # Transformation errors TransformerFailureProdOver1hFilter = filter('transformer', '*${local.transformer_name}*') and filter('entityType', ${local.entity_types_filter_values}) and filter('result', 'error') and filter('pipeline', 'EXTERNAL_LIVE', 'EXTERNAL_BATCH') and filter('micros_service_id', 'transformerservice') and filter('environment_type', 'prod') TransformerFailuresProdOver1h = data('logback.events.count', filter=TransformerFailureProdOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish(label='TransformerFailuresProdOver1h') # Alert triggers - Prod detect(when(IngestionParsingErrorsProdOver1h > threshold(0))).publish('Ingestion parsing errors attributed to service ${local.producer_service_name} (Production)') detect(when(TransformerFailuresProdOver1h > threshold(0))).publish('Transformation failures attributed to transformer ${local.transformer_name} (Production)') EOF rule { detect_label = "Ingestion parsing errors attributed to service ${local.producer_service_name} (Production)" description = "The value of transformerservice.record.parsing.errors is above 0." runbook_url = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-error-alerts" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.IngestionParsingErrorsProdOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.IngestionParsingErrorsProdOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n" } rule { detect_label = "Transformation failures attributed to transformer ${local.transformer_name} (Production)" description = "The value of logback.events.count result error is above 0." runbook_url = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#transformation-error-alerts" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.TransformerFailuresProdOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.TransformerFailuresProdOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n" } }
As the DROID team is still evaluating whether a trend-based approach for this detector is viable, we will leave it to the producer's discretion to integrate for their service. If you have some feedback or suggestions on how to improve this detector, feel free to reach out to the DROID team on Slack at (#help-tcs).
The second detector we will create is for tracking ingestion traffic anomalies in production environment. This detector will trigger an alert whenever DROID detects a growth or decline of greater than 50% in ingestion entities sent by your service via the LIVE pipeline (StreamHub), over a one-day period, compared to the same day one week ago. Please note that the 50% threshold mentioned is just an example and should be adjusted to suit your integration's expected traffic patterns.
Why is this important? It is crucial to be aware of any sudden changes in the number of entities ingested, as this can have cost implications for both your service and DROID.
The outlined example below is a starting point for the detector configuration, which will trigger an alert if the variance in ingestion traffic is above the defined threshold. More specifically, it will compare the number of entities ingested over the past day with the number of entities ingested over the same day one week ago, and alert if the variance is above the defined threshold over a thirty-minute period. The thresholds can be adjusted to better suit your requirements.
1 2locals { # Thresholds for the percentage change in ingestion entities received entity_ingestion_traffic_variance_threshold = 50.0 # Lower bound for the number of entities ingested (if the number of entities ingested was too low to begin with, we don't want to trigger an alert) # This can be set to the expected minimum number of entities ingested in a day. entity_ingestion_traffic_minimum = 0 } resource "signalfx_detector" "DROID_external_ingestion_traffic_anomaly_production" { name = "DROID External Ingestion - Ingestion traffic anomaly - Production" teams = [var.signalfx_tlt_services_team_id] time_range = 3600 show_data_markers = true program_text = <<-EOF # Entities ingested (current & previous week) EntitiesIngestedFilter = filter('entityType', ${local.entity_types_filter_values}) and filter('micros_group', 'WebServer') and filter('environment_type', 'prod') EntitiesIngestedCurrent = data('transformerservice.ingestion.delay.count', filter=EntitiesIngestedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'entityType']).publish(label='EntitiesIngestedCurrent') EntitiesIngestedPreviousWeek = data('transformerservice.ingestion.delay.count', filter=EntitiesIngestedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'entityType']).timeshift('1w').publish(label='EntitiesIngestedPreviousWeek') # Calculate the percentage variance PercentageVarianceFromPreviousWeek = ((((EntitiesIngestedCurrent - EntitiesIngestedPreviousWeek) / EntitiesIngestedPreviousWeek) * 100) if EntitiesIngestedPreviousWeek > ${local.entity_ingestion_traffic_minimum} else 0).publish(label='PercentageVarianceFromPreviousWeek') # Alert trigger - Prod detect(when(PercentageVarianceFromPreviousWeek > ${local.entity_ingestion_traffic_variance_threshold}, lasting='30m')).publish("Abnormal growth in ingestion traffic (>${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)") detect(when(PercentageVarianceFromPreviousWeek < -${local.entity_ingestion_traffic_variance_threshold}, lasting='30m')).publish("Abnormal decrease in ingestion traffic (<${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)") EOF rule { detect_label = "Abnormal growth in ingestion traffic (>${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)" description = "An abnormal growth in ingestion traffic compared to last week (>${local.entity_ingestion_traffic_variance_threshold}%) detected for the past 30 minutes." runbook_url = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-traffic-anomalies" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n" } rule { detect_label = "Abnormal decrease in ingestion traffic (<${local.entity_ingestion_traffic_variance_threshold}%) attributed to ${local.producer_service_name} (Production)" description = "An abnormal decrease in ingestion traffic compared to last week (<${local.entity_ingestion_traffic_variance_threshold}%) detected for the past 30 minutes." runbook_url = "https://developer.atlassian.com/platform/droid/external-ingestion-producers/investigate-alerts-triggered-by-detectors/#ingestion-traffic-anomalies" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: transformerservice\nproducer_service_name:${local.producer_service_name}\n" } }
Now that you have set up anomaly detection for your integration, you are ready to deploy the detectors to SignalFx via Sauron. Once deployed you can further tweak your thresholds utilising historical data in SignalFx's detector view.
For more information on how to investigate and resolve alerts triggered by these detectors, please refer to the Investigate alerts triggered by detectors guide.
You can view a full working example of the detectors covered in this tutorial in our source-code repository here.
Rate this page: