Last updatedJul 6, 2021

Rate this page:

Cloud Fortified apps program reliability requirements

Principles

The reliability requirements for Cloud Fortified apps are built around these principles:

  • A consistent reliability experience for customers: The focus Atlassian has given to reliability over the past 12 months has helped lift our services' quality and reduce the number and severity of incidents. To keep this experience consistent for customers, we want to ensure app reliability matches or betters product reliability.
  • Detect issues before customers: The program should drive patterns and behaviours that enable partners and Atlassian to detect issues before customers feel their impact.
  • A program that scales: While the initial program may involve manual processes as we iterate the boarding experience, the goal is to scale the reliability program to cope with new apps and partners.

Goals

Service Level Indicator (SLI) is the measurement you use to track your app's capabilities (such as uptime or response time).

Service Level Objective (SLO) is the target declared about a specific SLI (for example, 99.95% uptime).

The reliability requirements enables Cloud Fortified apps to:

  1. Benefit from a consistent measure of the reliability of their core capabilities by using SLIs and SLOs. These measures provide shared visibility and understanding in reliability-related discussions with Atlassian and ultimately with customers.
  2. Reduce their Mean Time To Detect (MTTD) through automated issue detection. More issues are detected by monitoring and not by customers.
  3. Reduce their Mean Time To Recover (MTTR) by using an incident process that interfaces with Atlassian's incident process.
  4. Continuously raise their quality bar by having partners conduct Post Incident Reviews with actionable improvements, signed off by Atlassian.

Key concepts

Reliability metrics

Generic SLIs

When Marketplace Partners begin the onboarding process, we measure the following for your application:

  1. App availability: Measures whether your app is available. Determined by periodically calling a health check URL for your app that you supply, which is expected to return a 200 (any other result counts as a failure). Atlassian will call this URL about once per minute from an automated monitoring system, so we recommend keeping it simple and lightweight. It does not need to exercise end-to-end user experiences. A simple check to confirm your app is up and responding to requests is sufficient. Consistent response times above 3s are interpreted as failures.
  2. Iframe load success rate & latency: Measures whether your app is serving iframes correctly and responsively. By default, we can only measure whether the iframe establishes a bridge connection to the host within a reasonable time. We cannot determine whether the iframe's content is as expected. So any iframe load that successfully initiates the host bridge within 3s is counted as a success. You can refine this metric by emitting your own success and failure events when your iframe finishes rendering. See "Refine your iframe success metrics" below for more info.
  3. Lifecycle event success rate & latency: Measures whether your app is processing install and uninstall requests successfully and responsively.
  4. Webhook delivery success rate & latency: Measures whether your app is processing webhooks successfully and responsively.

Target SLOs and alerts

Atlassian will detect when an SLI is in breach of its SLO and notify you.

When you sign up for the Cloud Fortified apps program, you provide Atlassian with an email address that will receive notifications on breach events as per the alert conditions described below. If your team already uses an alerting or on-call management platform such as Opsgenie, you can use these emails to trigger alerts in your system.

SLI

Current Target SLO (over 28 days)

Alert condition

App availability

99.9%

  • Healthcheck success rate is less than 99.9% for the past 15m or

  • Healthcheck latency is over 3s for the past 1h for 10% or more of requests

Iframe load success rate

95%

Iframe load success rate is less than 95% for the past 15m

Iframe load latency

p90 @ 3s

Iframe load latency is over 3s for the past 1h for 10% or more of requests

Lifecycle event success rate

99%

Lifecycle success rate is less than 99% for the past 24h1

Lifecycle event latency

p90 @ 3s

Lifecycle latency is over 3s for the past 1h for 10% or more of requests

Webhook delivery success rate

99%

Webhook delivery success rate is less than 99% for the past 15m

Webhook delivery latency

p90 @ 3s

Webhook latency is over 3s for the past 1h for 10% or more of requests

1) Lifecycle events are infrequent, so to avoid noisy alerts, we use a wide time window for the success rate calculations. Your metrics dashboard also includes success and failure counts so that you can check for fluctuations over shorter time periods.

Impact of SLO breaches for Cloud Fortified apps

Partners will not be penalized if their app is in breach of SLOs due to:

  • Atlassian failure. System-at-fault analysis can be done retrospectively as part of the PIR.
  • Planned downtime, if Atlassian is notified in advance. Please inform us in a comment on your Cloud Fortified app approval ticket if you have upcoming planned downtime.

Our aim is to help Marketplace Partners improve their apps' reliability. If an app breaches its SLOs continually, it's in our interest to get it back up to Cloud Fortified app standards. Therefore, there is an escalation of actions depending on the frequency and consistency of SLO breaches:

  • Notification: For each SLO breach, we notify you as described above. You can assess the breach's impact and initiate the EcoHOT process, as described below, if the app is degraded and needs an emergency response.
  • Remediation: For repeated breaches, a member of the Atlassian team contacts you to identify what is causing the breaches and assist in forming a strategy to bring the app back up to scratch.
  • Demotion: Apps that continuously breach SLOs and fail to fix problems are demoted to a non-Cloud Fortified designation in the marketplace. We want to avoid this scenario because it has ramifications for customer expectations.

Monitoring your SLIs

While Atlassian detects when an SLI is in breach and notifies you, you also need to consider ongoing insight into your app's metrics (for example, identifying trends that suggest impending issues or reviewing the metrics before an app is fully enrolled to identify any required remediation).

As part of your onboarding, you are granted access to the Cloud Fortified apps Statuspage, which includes an overview of the key metrics for your apps:

SLI metrics dashboard

The dashboard provides access to 1 month of data at a 5-minute resolution, with a latency of approximately 10 minutes. Only your team and Atlassian can see metrics for your apps (different partners cannot see each others' metrics). Data collection begins when Atlassian starts processing your application to join the program.

Non-SLI metrics

In addition to the metrics tracked for SLIs, the following reliability-related metrics are tracked for Cloud Fortified apps:

Data sources

SLI metrics are collected from production data, so SLOs will be assessed against Prod Groups, as shown in this diagram. At the same time, the synthetic checks are run against the Developer-First Rollout group.

SLOs data sources

Incident management

An incident means an event that disrupts or reduces the quality of a service that requires an emergency response.

Incidents can be raised by the following methods, all of which will produce an EcoHOT ticket.

Sourses of ECOHOT tickets

The EcoHOT ticket is the source of reference for partner and Atlassian efforts. We may also start a Slack channel for faster comms between Atlassian and partners.

EcoHOT workflow

EcoHOT workflow

Developer Relations Incident Manager

The Developer Relations Incident Manager is an on-call position within Atlassian who is paged if the partner or Atlassian Support believes Atlassian to be the cause of the incident. The Developer Relations IM then initiates Atlassian's internal incident management process to resolve the problem, providing feedback to the partner through the EcoHOT ticket. The Developer Relations IM is also responsible for signing off any actions a partner comes up with after a Production Incident Review is conducted.

Steps to follow

Identify or implement a health check resource (required for Connect apps)

To monitor your app's availability, we need a URL we can regularly poll to ensure your app is "up." This URL:

  • Must return 200 if the app is healthy. Any other result counts as a failure.
  • Must not return 200 if the app is down.
  • Must be unauthenticated.
  • Must return in under 3s. Consistent response times above 3s are interpreted as failures.
  • Must be lightweight. Atlassian will poll this URL approximately every 60s.

The URL does not need to exercise end-to-end user experiences. A simple check to confirm your app is up, responding to requests, and able to access its datastores is sufficient.

Feel free to use an existing URL that meets the above criteria (you may have such a resource, for example, for load balancer health-checking or provided by your application framework).

Production readiness documentation (required)

As part of the Cloud Fortified app approval process, we need information about your app and organization in these areas:

Area

Type of information

Why we need it

Your app's scalability characteristics

  • What are your app's scaling factors (for example, database accesses, concurrent request processing, queues, bulk operations, non-linear operations, pagination, or N+1 API calls for additional data)?

  • What do you expect your app to do in the presence of:

    • thousands of concurrent users

    • large datasets

    • distributed users?

  • What do you do to respond to rate limits from Atlassian APIs?

  • What testing do you undertake to assess your app’s ability to work at scale?

We’re looking to ensure you have considered your app's scalability characteristics and have validated them to a reasonable degree.

Testing against Developer First Rollout instances

If you haven't, sign up for a Developer First Rollout instance for validating that your app (more info and sign up for Jira and Confluence). This is a pre-requisite for the Cloud Fortified apps program.

Describe any pre- or post-deployment testing you do against the Developer First Rollout instance.

This may include synthetic tests, as described below.

We expect Cloud Fortified apps to use Developer First Rollout instances to detect unexpected behavior as early as possible when product changes are rolled out.

Your service restoration plan

Please describe the recovery plan for your app. Consider:

  • What process do you follow when you determine a new deployment of your app has an issue that is severely impacting customers (for example, rollback, rollforward, or similar mechanism)

  • What is the typical duration from the point where the fix is merged into the code to the point where it is fully deployed to production?

  • If your app stores data, how would you recover from lost or corrupt datastores? What is your backup frequency and retention? How fast can a data restore be performed?

  • Do you test your data restore process? How?

We’re looking to ensure you have a well-trodden path for rolling out fixes to production and for managing data restores where applicable.

Your existing incident management process

Describe your incident management process. Consider:

  • During what hours is your team available to respond to issues? If they respond out of business hours, how are they notified of issues?

  • How do you typically discover production issues with your app (for example, from support tickets, monitoring, or other means)?

  • Do you have automated alerting on metrics in production? If so, please summarize the types of metrics you monitor and the alert thresholds.

  • How do your team members communicate with each other during incidents?

  • How does your team communicate with customers during incidents?

We’re looking to understand your approach to incident management so we can collaborate on improving incident response.

Marketplace Partners submit this information in the approval ticket.

One of the most important measures for your app its success rate for correctly service iframes. But by default, Atlassian can only measure whether the iframe establishes a bridge connection to the host within a reasonable time. We cannot determine whether the content of the iframe is as expected. Therefore, any iframe load that successfully establishes the bridge within 3s is counted as a success. This means that if the iframe loads but displays an error to the user because, say, a product API returned an incorrect result, this event is not be recorded as a failure.

To improve the accuracy of the metric, modify your code to emit success or failure events when your iframe finishes rendering by sending a PUT request to the addons-metrics API with metricType set to IFRAME.

We expect Cloud Fortified apps to make some use of Developer First Rollout instances so that you can detect unexpected behavior as early as possible when product changes are rolled out. One way to make the most of these instances is to run regular synthetic tests against them.

Synthetic tests are automated tests that simulate real user interactions to validate core app capabilities and experiences. They are usually implemented with emulated web browsers or recorded web requests. In this context, we suggest you implement automated tests that simulate users interacting with your app through Jira or Confluence, run them regularly against your Developer First Rollout instance, and publish the test result to Atlassian. This enables you to quickly spot cases where product changes degrade your app's core capabilities and gives us visibility into when a product change impacts apps.

To implement synthetic tests:

Note: The authentication on the metric API means that the publish request must emanate from the app, not from the test framework. This means that you need to add logic to your app to report synthetic test results, which is suboptimal. We're exploring ways to remove this complexity.

Review and Monitor your SLIs (required)

Once Atlassian has granted you access, validate access to and familiarize yourself with the Cloud Fortified apps Statuspage.

Confirm the metrics you're seeing match your expectations and raise any unexpected behavior (for example, suspicious or missing data) with Atlassian.

Confirm you're receiving email notifications when SLIs breach their SLOs.

If your team uses an alerting or on-call management platform, such as OpsGenie, you can use these emails to trigger alerts in your system.

Trial the EcoHOT process (required)

Once Atlassian grants you access, familiarize yourself with the EcoHOT project.

  • Confirm you can raise an EcoHOT ticket and that your team can access it
  • If you'd like to ensure other partners cannot see your incident ticket, set the "Restrict to" field to just your app and administrators.
  • Have your operations contacts watch the incident and PIR video training material (coming soon, not required at this time)

EcoHOT ticket example

Approximate effort

These are estimates of the effort required to fulfil the onboarding requirements for the reliability program.

Task

Effort (days)

Required or recommended

Identify or implement a health check resource to enable availability metrics

1

Required

Review and document your app's production-readiness characteristics

1 - 3

Required

Refine your iframe success rate metrics

2

Recommended

Implement synthetic checks

2 - 5

Recommended

Miscellaneous learning and setup, for example:

1

Required

Once Atlassian has granted you access, review and monitor your SLIs

1

Required

Once Atlassian has granted you access, trial the EcoHOT process.

1

Required

Metrics publish API doc

PUT - /rest/atlassian-connect/latest/addons-metrics/${addon_key}/publish

This API is used by Fortified apps to:

  • Refine iframe success rate metrics by submitting custom success/failure events.
  • Publish synthetic test results

Headers

HeaderDescription
AuthorizationJWT ${token}

[See Connect JWT documentation](/cloud/jira/platform/understanding-jwt-for-connect-apps/)
Content-Typeapplication/json

Parameters

NameDescription
appKey *required
(Path parameter)
App Key
body *required
(Body)
Array<AddonMetrics>
List of metrics data to publish


AddonMetrics:
metricsType *requiredenum: IFRAME or SYNTHETIC

Metrics type which will be used to construct metrics name. (eg. metrics.external.connect.synthetic.successful)


moduleType
string

Module type of the check if available. This value will be checked if it is a valid moduleType for the product.

durationInMillis
long

Duration of the successful checks that will be used for Capability metrics. Metrics with failed status will not publish this latency data.

status
enum: SUCCESS or FAIL

Indicates if the check was successful. This will be used when calculating reliability metrics.

Responses

CodeDescription
200
Success
Successfully published metrics
403
Forbidden
Request not allowed for appKey
Wrong authentication signature
Request is only allowed from the app server with a valid installation
408
Bad Request
Latency should not be a negative value
Metrics type not supported
Unknown module type

Rate this page: