Last updatedSep 10, 2019

Rate this page:

Create a search extractor

Applicable:This tutorial applies to Confluence 7.0 or higher.
Level of experience:Advanced. You should complete at least one intermediate tutorial before working through this tutorial.

The Extractor2 plugin module allows you to hook into the mechanism that Confluence uses to populate its search index. Each time content is created or updated in Confluence, it is passed through a chain of extractors that assemble the fields and data that will be added to the search index for that content. By writing your own extractor you can add information to the index.

To help you become familiar with the extractor2 module, we've created a simple plugin demonstrating how to create an extractor. In the following sections we'll explain how different parts of the plugin are implemented and fit together.

Before you begin

To complete this tutorial, you'll need to be familiar with:

Source code

You can find the source code for this tutorial on Atlassian Bitbucket.

To clone the repository, run the following command:

1
git clone https://bitbucket.org/atlassian_tutorial/confluence-extractor2-tutorial.git

Alternatively, you can download the source as a ZIP archive.

This tutorial was last tested with Confluence 7.0 using Atlassian SDK 8.0.2.

Step 1: Implement Extractor2 interface

The ExtraCommentDataExtractor implements 2 methods extractText and extractFields. They receive a searchable object which is the Confluence content object (e.g. Page, Attachment, Comment) that is being saved, and passed through the extractor chain.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class ExtraCommentDataExtractor implements Extractor2 {
    private final CommentManager commentManager;
 
    @Autowired
    public ExtraCommentDataExtractor(@ComponentImport CommentManager commentManager) {
        this.commentManager = requireNonNull(commentManager, "commentManager");
    }
 
    public StringBuilder extractText(Object searchable) {
    ...
    }
 
    public Collection<FieldDescriptor> extractFields(Object searchable) {
    ...
    }

The result of the extractText method is appended into the default Text field "content" that is used in a regular Confluence site search. The result of the extractFields method is a collection of fields to be added into the Search index document.

The @Autowired and @ComponentImport annotation in the constructor asks Confluence to inject CommentManager into the extractor at the creation time.

Step 2: Add text to the default content field

The ExtraCommentDataExtractor#extractText concatenates comments of the given page.

1
2
3
4
5
6
7
8
9
10
11
12
...
    public StringBuilder extractText(Object searchable) {
        StringBuilder builder = new StringBuilder();
        if (searchable instanceof Page) {
            Page page = (Page) searchable;
            builder.append(commentManager.getPageComments(page.getId(), page.getCreationDate()).stream()
                    .map(ContentEntityObject::getBodyAsString)
                    .collect(joining(" ")));
        }
        return builder;
    }
...

Step 3: Create fields of different types

The ExtraCommentDataExtractor#extractFields demonstrates how to create fields of different types.

  • text field "comment-creator": contains multiple values that are result from breaking an user name into series of 2-grams.
  • long field "comment-modified": contains a numerical value representing when the page was updated last time.
  • int field "comment-count": contains a number of comments created the page.
  • double field "comment-score": contains a single real number representing an average length of a comment.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
...  
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        Page page = getPage(searchable);
        if (page == null) {
            return emptyList();
        }
        List<Comment> comments = commentManager.getPageComments(page.getId(), page.getCreationDate());
        if (comments.isEmpty()) {
            return emptyList();
        }
 
        ImmutableList.Builder<FieldDescriptor> builder = ImmutableList.builder();
        AnalyzerDescriptor analyzer = AnalyzerDescriptor.builder(new NGramTokenizerDescriptor(2, 2))
                .build();
        comments.stream()
                .map(ConfluenceEntityObject::getCreator)
                .filter(Objects::nonNull)
                .map(ConfluenceUser::getLowerName)
                .filter(Objects::nonNull)
                .forEach(username -> builder.add(new TextFieldDescriptor("comment-creator", username, FieldDescriptor.Store.YES, analyzer)));
 
        Comment lastComment = comments.get(comments.size() - 1);
        builder.add(new LongFieldDescriptor("comment-modified", lastComment.getLastModificationDate().getTime(), FieldDescriptor.Store.YES));
        builder.add(new IntFieldDescriptor("comment-count", comments.size(), FieldDescriptor.Store.YES));
 
        int commentTextLength = comments.stream()
                .mapToInt(x -> x.getBodyAsString().length())
                .sum();
        double commentScore = Math.log1p((double) commentTextLength / comments.size());
        builder.add(new DoubleFieldDescriptor("comment-score", commentScore, FieldDescriptor.Store.YES));
 
        return builder.build();
    }
...

Step 4: Make it visible to Confluence

Here is an example atlassian-plugin.xml file containing a single search extractor:

1
2
3
4
5
...
<extractor2 name="extraCommentDataExtractor" key="extraCommentDataExtractor"
                class="com.atlassian.confluence.plugins.extractor.tutorial.ExtraCommentDataExtractor" priority="1100">
</extractor2>
...
  • The class attribute defines the class that will be added to the extractor chain. This class must implement Extractor2.
  • The priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.
  • The key is an unique string that identifies the extractor.
  • The name is the extractor name.

Step 5: See how it works

One way to see how an extractor works is to debug into a running Confluence instance. Here's the key steps:

  1. Start your Confluence instance in debug mode.
  2. Install the plugin containing the extractor.
  3. Attach a debugger to the instance and set breakpoints on methods extractText and extractFields.
  4. Try to add or modify a page.
  5. Observe that the debugger stops at breakpoints.

Next steps

Learn more about extending Confluence's search capabilities with these tutorials:

Rate this page: