Performance and stability at scale is essential for large Data Center installations. Following these general guidelines will help you ensure your app is performant and reliable in a cluster.
If you're starting with an app already used on Bitbucket/Confluence/Jira Server instances, we recommend developing with Data Center compatibility in mind for all future versions. You might be tempted to simply fork your code and develop separate versions. Instead, we recommend using a single set of source code for future app releases--If you develop your app following these guidelines, it should play well with Server instances in addition to clustered Data Center instances.
You should avoid long-running tasks in shared thread pools as they may impact other core tasks. For example, tasks that run in response to product events, indexing tasks and scheduled jobs. If these long-running tasks are required, you should move them to a separate executor and consider what should happen if the executor cannot keep up with the tasks being scheduled. Can tasks be dropped safely? Can tasks be batched to run more efficiently?
When introducing a new thread pool in the system, consider the memory usage of the thread pool, and the queue. Each thread consumes about 1MB of off-heap memory (from the stack space). Tasks in the queue will also take up memory. You should make the tasks in the queue as small as possible, for example, you could store the project ID of a task instead of the full project object. The task queue should be bounded to prevent unbounded memory usage. Consider whether the tasks in the queue can be de-duplicated.
You should make sure your feature can recover if a task cannot be processed because the task queue is full.
Avoid cluster locks in high frequency code paths as they will introduce bottlenecks in the system. The larger the cluster, the worse the bottleneck can be. You should design your feature to be lock-free or use optimistic locking in the database (with recovery) to prevent cluster lock bottlenecks.
When cluster locks can't be avoided, use (possibly with a timeout) and make sure there is a reasonable fallback behaviour if the lock can't be acquired.
We strongly recommend avoiding deprecated and internal APIs.
Everything should be bounded. You should not retrieve unbounded sets of data. When returning data in services or REST endpoints, make sure that the returned values are paged. Use streaming APIs where the caller provides a callback to process retrieved objects where possible (for example, projects / issues / pages / comments / commits).
Avoid holding large amounts of data in memory. Use streaming APIs where possible to process one item at a time without keeping it in memory. When streaming is not an option, use paged APIs to retrieve and process a page of data at a time.
When returning large amounts of data over REST, you should use a streaming writer to prevent buffering the whole payload in memory.
When performing background operations, consider the memory, CPU and I/O cost of those operations. Limit operations that consume a lot of resources by restricting the number of operations that are allowed to run concurrently (for example, only a specified number of background indexing threads).
Be mindful of transaction boundaries. Loading a large amount of data in a Hibernate session means that all these objects are cached in the Hibernate session. You should break up the transaction in limited batches to avoid building up a massive Hibernate session. This specifically applies to Confluence and Bitbucket.
As a general rule, don't cache if you don't have to. Most caches need to be kept consistent across the whole cluster. If you do need to cache data, you should use node-local caches with asynchronous validation (known as 'hybrid caches' in ).
If you must have strongly-consistent caches, use remote caches and ensure that both the keys and the data stored in the cache is .
Always consider the performance impact of a scheduled job. Don't run them more frequently than necessary.
Bitbucket provides a , which can be used to batch tasks and ensure that they are only executed on one node at a time.
You should avoid frequent polling from the browser to load or reload data.
If your app injects scripts into a page, all errors should be handled. For example, failure to load data from a REST endpoint should not break or impact other functionality on the page.
Avoid sending large amounts of messages between nodes. Keep the payloads as small as possible.
Avoid blocking calls to external systems or other cluster nodes. When possible, you should switch to asynchronous I/O (for example ) or asynchronous publish/subscribe.
There are also some product-specific considerations:
When using low-level APIs to make ref changes to a repository in Bitbucket, the plugin must raise a (subclass of) to ensure that caches are invalidated across the cluster.
Always consider the impact your app may have in instances with large data sets, for example many users, groups, projects, spaces, issues, pages, repos, commits, comments. You should ensure you have appropriate indexes set up on all tables introduced by your feature or app. Wherever possible, you should filter data in database queries, rather than relying on in-product filtering, which may require loading and discarding lots of data.
Finally, always test your feature or app on a cluster instance with a big data set.
For product-specific guidelines, see Testing your app with a large data set.
For more product-specific guidelines, see the following: