As a DROID producer, it is important that you are aware of potential issues with your integration, as most importantly it can impact consumers of your data. This tutorial will walk you through setting up & configuring anomaly detection for your Read-through integration utilising SignalFx detectors and the recommended Config-as-Code solution, Terraform.
At then end of this tutorial, you will have the following detectors set up:
Before diving in, please ensure you have the following:
If you choose to opt-out of cache invalidations, you can skip creating the detectors related to cache invalidations (steps 4 & 5).
In order to receive notifications for alerts triggered by the detectors via OpsGenie and Slack, you will need to define the following variables:
opsgenie_credential_id
- The OpsGenie credential ID for your team.opsgenie_team_id
- The OpsGenie team ID for your team.opsgenie_responder_id
- The OpsGenie responder ID for your team.slack_credential_id
- The Slack credential ID for your team.slack_alerts_channel
- The Slack channel for your team's alerts channel.Utilising these variables, you can define the notification channels as a list in your Terraform configuration file, which will be used in the detectors later on. This can be passed in as a variable to your Terraform module, or defined directly in your Terraform detectors configuration file like below:
1 2locals { alert_notifications = [ "Opsgenie,${var.opsgenie_credential_id},${var.opsgenie_team_name},${var.opsgenie_responder_id},Team", "Team,${var.signalfx_team_id}", "Slack,${var.slack_credential_id},${var.slack_alerts_channel}", ] }
Now, let's define the variables more specific to the detectors you will be setting up:
signalfx_team_id
- The SignalFx team ID associated with your team.alert_notifications
- a list of notification channels to be used for alerts, see example in previous step.output_record_types
- a list of output record types for your integration, which will be served by TCS, e.g. ["APP_DEFINITIONS_IN_CONTEXT", "APP_DEFINITION"]
producer_service_name
- the name of your service sending data to DROID, e.g. "pipeline-tester-service"
Additionally, you will need to define a variable that can be used as part of the SignalFx detector programs (SignalFlow) to filter out the output record types for your integration.
1 2locals { # Output record types - SignalFlow-friendly filter values output_record_types_filter_values = join(", ", [for record_type in local.output_record_types: "'${record_type}'"]) }
The next detector we will create is for tracking backing store API anomalies in production environment. This detector will be triggered whenever DROID fails to fetch data from your backing store API, or if DROID is experiencing high-latency when fetching data from your backing store API.
The outlined example below is a starting point for the detector configuration, alerts will be triggered if:
The thresholds can be adjusted to better suit your requirements.
1 2resource "signalfx_detector" "DROID_read_through_backing_store_api_anomalies_production" { name = "DROID Read Through - Backing Store API anomalies - ${local.producer_service_name} - Production" teams = [var.signalfx_tlt_services_team_id] time_range = 3600 show_data_markers = true program_text = <<EOF # Backing store fetch errors BackingStoreApiFetchErrorsOver1hFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod') and filter('micros_service_id', 'tcs-web-*', 'tenant-context-service') BackingStoreApiFetchErrorsOver1h = data('*.readThrough.backing-store-fetch-error.count', filter=BackingStoreApiFetchErrorsOver1hFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter', 'recordType']).publish('BackingStoreApiFetchErrorsOver1h') # Backing store high latency BackingStoreApiLatencyMeanOver1mFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod') and filter('micros_service_id', 'tcs-web-*', 'tenant-context-service') BackingStoreApiLatencyMeanOver1m = data('*.readThrough.latency.mean', filter=BackingStoreApiLatencyMeanOver1mFilter).mean(over='1m').mean(by=['perimeter', 'recordType']).publish('BackingStoreApiLatencyMeanOver1m') # Alert triggers - Prod detect(when(BackingStoreApiFetchErrorsOver1h > threshold(0))).publish('Backing store API fetch errors attributed to service ${local.producer_service_name} (Production)') detect(when(BackingStoreApiLatencyMeanOver1m > threshold(5000), lasting='5m', at_least=0.67)).publish('Backing store API latency is too high attributed to service ${local.producer_service_name} (Production)') EOF rule { detect_label = "Backing store API fetch errors attributed to service ${local.producer_service_name} (Production)" description = "The value of readThrough.backing-store-fetch-error.count is above 0." runbook_url = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#backing-store-api-fetch-errors" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nproducer_service_name:${local.producer_service_name}\n" } rule { detect_label = "Backing store API latency is too high attributed to service ${local.producer_service_name} (Production)" description = "The mean value of readThrough.latency is above 5000ms for the last minute." runbook_url = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#backing-store-api-latency-too-high" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.BackingStoreApiLatencyMeanOver1m.value}}\n{{else}}Current signal value for mean latency: {{inputs.BackingStoreApiLatencyMeanOver1m.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nproducer_service_name:${local.producer_service_name}\n" } }
Next, we will create the SignalFx detector resource for tracking invalidation event processing errors. This detector will trigger an alert whenever DROID runs into invalidation event parsing errors attributed to your service and transformer.
The outlined example below is a starting point for the detector configuration, which will trigger an alert as soon as any error has been detected within a one-hour window. The thresholds can be adjusted to better suit your requirements.
1 2resource "signalfx_detector" "DROID_read_through_invalidation_errors_production" { name = "DROID Read Through - Invalidation Errors - ${local.producer_service_name} - Production" teams = [var.signalfx_tlt_services_team_id] time_range = 3600 show_data_markers = true program_text = <<EOF # Read-through invalidation processing errors ReadThroughInvalidationProcessingErrorsFilter = filter('ingestionSource', '*${local.producer_service_name}') and filter('environment_type', 'prod') ReadThroughInvalidationProcessingErrorsOver1h = data('tenant-context-service.read-through-cache.invalidation-processing-error.count', filter=ReadThroughInvalidationProcessingErrorsFilter, rollup='sum', extrapolation='zero').sum(over='1h').sum(by=['perimeter']).publish('ReadThroughInvalidationProcessingErrorsOver1h') # Alert triggers - Prod detect(when(ReadThroughInvalidationProcessingErrorsOver1h > threshold(0))).publish('Invalidation processing errors attributed to service ${local.producer_service_name} (Production)') EOF rule { detect_label = "Invalidation processing errors attributed to service ${local.producer_service_name} (Production)" description = "The value of tenant-context-service.read-through-cache.invalidation-processing-error is above 0." runbook_url = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-processing-errors" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{else}}Current signal value for prod errors: {{inputs.ReadThroughInvalidationProcessingErrorsOver1h.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n" } }
As the DROID team is still evaluating whether a trend-based approach for this detector is viable, we will leave it to the producer's discretion to integrate for their service. If you have some feedback or suggestions on how to improve this detector, feel free to reach out to the DROID team on Slack at (#help-tcs).
The second detector we will create is for tracking invalidation event traffic anomalies in production environment. This detector will trigger an alert whenever DROID detects a growth or decline of greater than 50% in invalidation events sent by your service via the LIVE pipeline (StreamHub), over a one-day period, compared to the same day one week ago. Please note that the 50% threshold mentioned is just an example and should be adjusted to suit your integration's expected traffic patterns.
Why is this important? It is crucial to be aware of any sudden changes in the number of entities ingested, as this can have cost implications for both your service and DROID.
The outlined example below is a starting point for the detector configuration, which will trigger an alert if the variance in ingestion traffic is above the defined threshold. More specifically, it will compare the number of invalidation events received over the past day with the number of invalidation events received over the same day one week ago, and alert if the variance is above the defined threshold over a thirty-minute period. The thresholds can be adjusted to better suit your requirements.
1 2locals { # Thresholds for the percentage change in invalidation events received invalidation_events_variance_threshold = 50.0 # Lower bound for the number of invalidation events received (if the number of entities ingested was too low to begin with, we don't want to trigger an alert) # This can be set to the expected minimum number of invalidation events received in a day. invalidation_event_traffic_minimum = 0 } resource "signalfx_detector" "DROID_read_through_invalidation_event_rate_anomaly_production" { name = "[SAMPLE] DROID Read Through - Invalidation Event Rate Anomaly - ${local.producer_service_name} - Production" teams = [var.signalfx_tlt_services_team_id] time_range = 3600 show_data_markers = true program_text = <<EOF # Invalidations received (current & previous week) ReadThroughInvalidationEventsReceivedFilter = filter('recordType', ${local.output_record_types_filter_values}) and filter('environment_type', 'prod') ReadThroughInvalidationEventsReceivedCurrent = data('tenant-context-service.read-through-cache.invalidation-received.count', filter=ReadThroughInvalidationEventsReceivedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'recordType']).publish('ReadThroughInvalidationEventsReceivedCurrent') ReadThroughInvalidationEventsReceivedPreviousWeek = data('tenant-context-service.read-through-cache.invalidation-received.count', filter=ReadThroughInvalidationEventsReceivedFilter, rollup='sum', extrapolation='zero').sum(over='1d').sum(by=['perimeter', 'recordType']).timeshift('1w').publish('ReadThroughInvalidationEventsReceivedPreviousWeek') # Calculate the percentage variance PercentageVarianceFromPreviousWeek = ((((ReadThroughInvalidationEventsReceivedCurrent - ReadThroughInvalidationEventsReceivedPreviousWeek) / ReadThroughInvalidationEventsReceivedPreviousWeek) * 100) if ReadThroughInvalidationEventsReceivedPreviousWeek > ${local.invalidation_event_traffic_minimum} else 0).publish(label='PercentageVarianceFromPreviousWeek') # Alert triggers - Prod detect(when(PercentageVarianceFromPreviousWeek > ${local.invalidation_events_variance_threshold}, lasting='30m')).publish("Abnormal growth in invalidation events received (>${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)") detect(when(PercentageVarianceFromPreviousWeek < -${local.invalidation_events_variance_threshold}, lasting='30m')).publish("Abnormal decrease in invalidation events received (<${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)") EOF rule { detect_label = "Abnormal growth in invalidation events received (>${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)" description = "An abnormal growth in invalidation events received compared to last week (>${local.invalidation_events_variance_threshold}%) detected for the past 30 minutes." runbook_url = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-traffic-anomalies" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n" } rule { detect_label = "Abnormal decrease in invalidation events received (<${local.invalidation_events_variance_threshold}%) attributed to ${local.producer_service_name} (Production)" description = "An abnormal decrease in invalidation events received compared to last week (<${local.invalidation_events_variance_threshold}%) detected for the past 30 minutes." runbook_url = "https://developer.atlassian.com/platform/droid/read-through-producers/investigate-alerts-triggered-by-detectors/#invalidation-event-traffic-anomalies" severity = "Warning" notifications = var.alert_notifications parameterized_body = "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{else}}Current signal value for PercentageVarianceFromPreviousWeek: {{inputs.PercentageVarianceFromPreviousWeek.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}environment_type:prod\nservice_name: tenant-context-service\nproducer_service_name:${local.producer_service_name}\n" } }
Now that you have set up anomaly detection for your integration, you are ready to deploy the detectors to SignalFx via Sauron. Once deployed you can further tweak your thresholds utilising historical data in SignalFx's detector view.
For more information on how to investigate and resolve alerts triggered by these detectors, please refer to the Investigate alerts triggered by detectors guide.
You can view a full working example of the detectors covered in this tutorial in our source-code repository here.
Rate this page: