Debugging Asset Readiness Run Failures

ar-evaluator runs a daily cronjob at 12:10:05AM UTC to collect evidence about services and perform a compliance scan. This document outlines key steps to take if you notice the cronjob or run have failed, including some key nuances and helpful tips.

Vital Resources

SignalFX Dashboards

Service Overview & Control Owner SignalFx Dashboards

The Service Overview and Control Owner SignalFX Dashboards display key metrics to help you detect run failures by step and control key. These dashboards allow you to determine at which stage a run failed:

Evidence collection via Socrates query
Evaluation publishing
Sending events to the Interventions service via StreamHub

Such charts include control key information, making it easier to identify whether failures affect all controls or only specific ones. Reviewing these metrics around the cronjob execution time can reveal patterns or underlying issues.

Service Health SignalFX Dashboard

The Service Health Skunky Dashboard provides charts that offer insights into the health of the ar-evaluator service, useful for debugging run failures. Pay special attention to SQS metrics, which can indicate slow latency, high traffic, or messages in the DLQ at various run stages.

Splunk

Service Logs

Service logs are valuable for investigating run failures, especially when only a subset of controls is affected. Some key tips:

Timing: Look at the logs around the time of the cronjob
Errors: Filter by level to view key errors (level: 50) and warnings (level: 40)
Messages: View message or msg content (Some examples: Failed to process message, Sending event to StreamHub.)

Manual Evidence Collection

If evidence collection has failed or you would like to manually trigger evidence collection again, utilize the /api/v1/evidence/collect/{key} endpoint:

Local Testing

1
2
curl -X POST "http://localhost:8080/api/v1/evidence/collect/{key}?limit=10" \
  -H "Content-Type: application/json"

Testing Per Environment

1
2
atlas slauth curl -a ar-evaluator -- --request POST 'https://{your-staging-or-dev-url}/api/v1/evidence/collect/{key}?limit=10' \
--header 'X-Slauth-Egress: true' \
--data ''

Users can use atlas micros service show --service ar-evaluator to get the service URL for different environments. Once triggered, use the table below to check results: engex_asset_readiness.evaluation_event_v2. By default, if the limit is not provided, it will default to a limit of 10. Set the limit to -1 to collect all evidence.

Common Gotchas

Federal Boundary

If the control is being run in-boundary (typically the control key ends in -FR), access Splunk in your VDI. A key tip is to add the msg filter to see key information in boundary.

The controls utilize different tables if accessing federal or commercial data. Reference constants.ts for the EPM tables used according to environment and use.

Need More Help?

Use the listed slack channel if you think the issue may be related to:

StreamHub: #help-streamhub
Socrates ingestion: ping !ingest in #help-socrates
Actions: #interventions-dev