Rate this page:
The reliability requirements for Cloud Fortified apps are built around these principles:
Service Level Indicator (SLI) is the measurement you use to track your app's capabilities (such as uptime or response time).
Service Level Objective (SLO) is the target declared about a specific SLI (for example, 99.95% uptime).
The reliability requirements enables Cloud Fortified apps to:
When Marketplace Partners begin the onboarding process, we measure the following for your application:
Atlassian will detect when an SLI is in breach of its SLO and notify you.
When you sign up for the Cloud Fortified apps program, you provide Atlassian with an email address that will receive notifications on breach events as per the alert conditions described below. If your team already uses an alerting or on-call management platform such as Opsgenie, you can use these emails to trigger alerts in your system.
SLI |
Current Target SLO (over 28 days) |
Alert condition |
---|---|---|
App availability |
99.9% |
Healthcheck success rate is less than 99.9% for the past 15m or |
Iframe load success rate |
95% |
Iframe load success rate is less than 95% for the past 15m |
Installation callback success rate |
99% |
Installation callback success rate is less than 99% for the past 24h1 |
Webhook delivery success rate |
99% |
Webhook delivery success rate is less than 99% for the past 15m |
The notification email is sent from no-reply@am.atlassian.com. You can filter messages by this sender.
The email HTML body contains a script tag of type application/json
and ID incident-data
.
This HTML element contains a JSON object that holds details about the alert. Here is an example:
1 2<script type="application/json" id="incident-data"> { "appKey": "com.example.appkey", "appDetails": { "name": "AppName", "partnerName": "Atlassian" }, "incidentId": "FOotlQrY5ZQ", "incidentStatus": "ANOMALOUS", "time": "2022-03-24T15:40:00Z", "metricName": "Jira Installation Callback Success Rate", "metricUnit": "PERCENT", "measurementDurationSeconds": 86400, "sliValue": 60.0, "sloValue": 99.0 } </script>
If you need it, here is a download of the JSON schema.
Partners will not be penalized if their app is in breach of SLOs due to:
Our aim is to help Marketplace Partners improve their apps' reliability. If an app breaches its SLOs continually, it's in our interest to get it back up to Cloud Fortified app standards. Therefore, there is an escalation of actions depending on the frequency and consistency of SLO breaches:
While Atlassian detects when an SLI is in breach and notifies you, you also need to consider ongoing insight into your app's metrics (for example, identifying trends that suggest impending issues or reviewing the metrics before an app is fully enrolled to identify any required remediation).
As part of your onboarding, you are granted access to the Cloud Fortified apps Statuspage, which includes an overview of the key metrics for your apps:
The dashboard provides access to 1 month of data at a 5-minute resolution, with a latency of approximately 10 minutes. Only your team and Atlassian can see metrics for your apps (different partners cannot see each others' metrics). Data collection begins when Atlassian starts processing your application to join the program.
In addition to the metrics tracked for SLIs, the following reliability-related metrics are tracked for Cloud Fortified apps:
SLI metrics are collected from production data, so SLOs will be assessed against Prod Groups, as shown in this diagram. At the same time, the synthetic checks are run against the Developer-First Rollout group.
An incident means an event that disrupts or reduces the quality of a service that requires an emergency response.
Incidents can be raised by the following methods, all of which will produce an EcoHOT ticket.
The EcoHOT ticket is the source of reference for partner and Atlassian efforts. We may also start a Slack channel for faster comms between Atlassian and partners.
The Developer Relations Incident Manager is an on-call position within Atlassian who is paged if the partner or Atlassian Support believes Atlassian to be the cause of the incident. The Developer Relations IM then initiates Atlassian's internal incident management process to resolve the problem, providing feedback to the partner through the EcoHOT ticket. The Developer Relations IM is also responsible for signing off any actions a partner comes up with after a Production Incident Review is conducted.
To monitor your app's availability, we need a URL we can regularly poll to ensure your app is "up." This URL:
The URL does not need to exercise end-to-end user experiences. A simple check to confirm your app is up, responding to requests, and able to access its datastores is sufficient.
Feel free to use an existing URL that meets the above criteria (you may have such a resource, for example, for load balancer health-checking or provided by your application framework).
As part of the Cloud Fortified app approval process, we need information about your app and organization in these areas:
Area |
Type of information |
Why we need it |
---|---|---|
Your app's scalability characteristics |
|
We’re looking to ensure you have considered your app's scalability characteristics and have validated them to a reasonable degree. |
Testing against Developer First Rollout instances |
If you haven't, sign up for a Developer First Rollout instance for validating that your app (more info and sign up for Jira and Confluence). This is a pre-requisite for the Cloud Fortified apps program. Describe any pre- or post-deployment testing you do against the Developer First Rollout instance. This may include synthetic tests, as described below. |
We expect Cloud Fortified apps to use Developer First Rollout instances to detect unexpected behavior as early as possible when product changes are rolled out. |
Your service restoration plan |
Please describe the recovery plan for your app. Consider:
|
We’re looking to ensure you have a well-trodden path for rolling out fixes to production and for managing data restores where applicable. |
Your existing incident management process |
Describe your incident management process. Consider:
|
We’re looking to understand your approach to incident management so we can collaborate on improving incident response. |
Marketplace Partners submit this information in the approval ticket.
One of the most important measures for your app its success rate for correctly service iframes. But by default, Atlassian can only measure whether the iframe establishes a bridge connection to the host within a reasonable time. We cannot determine whether the content of the iframe is as expected. Therefore, any iframe load that successfully establishes the bridge within 12s is counted as a success. This means that if the iframe loads but displays an error to the user because, say, a product API returned an incorrect result, this event is not be recorded as a failure.
To improve the accuracy of the metric, modify your code to emit success or failure events when your iframe finishes rendering by sending a PUT request to the addons-metrics API with metricType set to IFRAME.
We expect Cloud Fortified apps to make some use of Developer First Rollout instances so that you can detect unexpected behavior as early as possible when product changes are rolled out. One way to make the most of these instances is to run regular synthetic tests against them.
Synthetic tests are automated tests that simulate real user interactions to validate core app capabilities and experiences. They are usually implemented with emulated web browsers or recorded web requests. In this context, we suggest you implement automated tests that simulate users interacting with your app through Jira or Confluence, run them regularly against your Developer First Rollout instance, and publish the test result to Atlassian. This enables you to quickly spot cases where product changes degrade your app's core capabilities and gives us visibility into when a product change impacts apps.
To implement synthetic tests:
Note: The authentication on the metric API means that the publish request must emanate from the app, not from the test framework. This means that you need to add logic to your app to report synthetic test results, which is suboptimal. We're exploring ways to remove this complexity.
Once Atlassian has granted you access, validate access to and familiarize yourself with the Cloud Fortified apps Statuspage.
Confirm the metrics you're seeing match your expectations and raise any unexpected behavior (for example, suspicious or missing data) with Atlassian.
Confirm you're receiving email notifications when SLIs breach their SLOs.
If your team uses an alerting or on-call management platform, such as OpsGenie, you can use these emails to trigger alerts in your system.
Once Atlassian grants you access, familiarize yourself with the EcoHOT project.
These are estimates of the effort required to fulfil the onboarding requirements for the reliability program.
PUT - /rest/atlassian-connect/latest/addons-metrics/${addon_key}/publish
This API is used by Fortified apps to:
Task |
Effort (days) |
Required or recommended |
Identify or implement a health check resource to enable availability metrics |
1 |
Required |
Review and document your app's production-readiness characteristics |
1 - 3 |
Required |
Refine your iframe success rate metrics |
2 |
Recommended |
Implement synthetic checks |
2 - 5 |
Recommended |
Miscellaneous learning and setup, for example:
|
1 |
Required |
Once Atlassian has granted you access, review and monitor your SLIs |
1 |
Required |
Once Atlassian has granted you access, trial the EcoHOT process. |
1 |
Required |
Header | Description |
---|---|
Authorization | JWT ${token} See Connect JWT documentation |
Content-Type | application/json |
Name | Description |
---|---|
appKey *required (Path parameter) | App Key |
body *required (Body) | Array<AddonMetrics> List of metrics data to publish AddonMetrics: metricsType *required enum: IFRAME or SYNTHETIC Metrics type which will be used to construct metrics name. (eg. metrics.external.connect.synthetic.successful) moduleType string Module type of the metric. Only applicable to IFRAME metrics. This value must be a valid Connect module type for the product, a valid key in the "modules" field in the app descriptor, e.g. "adminPages" .durationInMillis long Duration of the successful checks that will be used for Capability metrics. Metrics with failed status will not publish this latency data. status *required enum: SUCCESS or FAIL Indicates if the check was successful. This will be used when calculating reliability metrics. |
Code | Description |
---|---|
200 Success | Successfully published metrics |
403 Forbidden | Request not allowed for appKey Wrong authentication signature Request is only allowed from the app server with a valid installation |
408 Bad Request | Latency should not be a negative value Metrics type not supported Unknown module type |
Rate this page: