Last updated Jul 24, 2025
Internal use only

Getting Started

This document covers the steps to take to discover, diagnose, and resolve commonly known issues with egress of federal assessments data for the ar-evaluator service.

Data Egress Architecture

To understand commonly known issues with egress of assessment data, it's important to understand the architecture of ar-evaluator and where egress occurs.

Every day, when the daily cron job runs to produce evaluations data, the following occurs in commercial:

  1. Per assessment evaluated in commercial ar-evaluator (i.e. per unique asset and per assessment ran) the service produces a streamhub event.
  2. Each streamhub event is consumed by a socrates stream which dumps data into commercial vNext (engex-asset-readiness) and commercial Classic (zone_ar) tables.

In federal, a similar process happens:

  1. Per assessment evaluated in federal ar-evaluator (i.e. per unique asset and per assessment ran) the service produces a streamhub event.
  2. Each streamhub event is consumed by a socrates stream which dumps data into federal vNext (engex-asset-readiness) tables.
  3. Each streamhub event also undergoes obfuscation per defined transit policies and is sent across the boundary to the commercial environment through streamhub-relay.
  4. Cross boundary events are consumed by a commercial socrates stream which dumps data into the commercial vNext (engex-asset-readiness) and commercial Classic (zone_ar) tables.

Issues most commonly occur in steps 3 and 4 of the federal assessment evaluation process.

Common Issues

Malformed data sent to commercial

Symptoms

  1. Federal assessment data (i.e. typically where the control_key ends in -FR) is not present in the commercial evaluation_v1 or evaluation_v2 tables AND/OR commercial assessment evaluation data has suddenly stopped appearing in commercial tables.
  2. The federal evaluations tables contain the missing assessment results for the same -FR controls on that same day.
  3. The Socrates Job that corresponds to the table has the following error: com.databricks.sql.transaction.tahoe.schema.DeltaInvariantViolationException: [DELTA_NOT_NULL_CONSTRAINT_VIOLATED] NOT NULL constraint violated for column: <COL-NAME>.

Cause

This error typically occurs when federal assessment data that was sent to commercial has null values for a non-nullable field on the table, like asset_id. Most of the time, this is due to the event specified not properly matching up with the transit policy for the event type. In particular, when the transit policy does not match up properly with an event that should be sent cross boundary, fields from the event can be dropped when the event is sent to commercial, causing federal data to be correctly present in federal Socrates but missing/breaking flows in commercial Socrates.

Resolution

If malformed data makes it to the commercial tables, resolution can be an arduous process, mainly because the Socrates job will refuse to process well-formed data after seeing malformed data, even if the transit policy is fixed. Below are the steps required to first unblock Socrates from processing data and then remove the malformed data.

  1. Update the transit policies (for the evaluation_v1 and evaluation_v2 events) to ensure that the "malformed" events are correctly formed and will not drop required fields. Typically, this involves making sure that the asset_class in the events you are sending are accepted by the transit policy.
  2. Update table definitions to allow null data to flow through for the problematic columns (see this sample PR).
    • Note that this may need to be done for both evaluation_v1 and evaluation_v2
  3. Monitor the Socrates Job. You must wait for it to process through all the malformed data before continuing.
  4. Update the sea query to query for and delete all malformed data in the table. After PR-ing the change, run the fluid/schema-evolution pipeline to run the query against the production tables.
    • Note that this may need to be done for both evaluation_v1 and evaluation_v2
  5. Run queries against the evaluations tables in databricks to confirm the malformed data is no longer present.
  6. Revert changes made in step 2; the non-nullable columns should be set back to nullable: false (see this sample PR).
  7. Monitor the tables on the next run of the assessments workflow to verify the issue is resolved.

Useful References for Diagnostics

Monitoring a Socrates Job

  1. Ensure you are part of the developers access level on the Socrates Data Product Engex Asset Readiness SSAM container.
  2. Navigate to databricks. Select the production workspace.
  3. On the left bar, select Jobs & Pipelines.
  4. In the search bar, search for engex_asset_readiness. Click on the job you want to examine, typically ends in source_evaluation_v1 or source_evaluation_v2.
  5. Click on the date that you want to monitor the job for. This will typically be the job on the top of the list.
  6. Examine the output for any obvious errors. You can also select the View Details option under Compute on the right hand bar of the screen to see more detailed logs from the job.

Examining Table Definitions

vNext tables for ar-evaluator are defined here.

Socrates Stream registrations are defined here.

Rate this page: