Last updated Oct 20, 2022

Rate this page:

Create a search extractor

Applicable:This tutorial applies to Confluence 7.0 or higher.
Level of experience:Advanced. You should complete at least one intermediate tutorial before working through this tutorial.

If you are updating your app to be compatible for Confluence 8.0 and newer, see Upgrading for Confluence 8.0.

The Extractor2 plugin module allows you to hook into the mechanism that Confluence uses to populate its search indexes. Each time content is created or updated in Confluence, it is passed through a chain of extractors that assemble the fields and data that will be added to the search indexes for that content. By writing your own extractor you can add information to the content index.

To help you become familiar with the extractor2 module, we've created a simple plugin demonstrating how to create an extractor. In the following sections we'll explain how different parts of the plugin are implemented and fit together.

Before you begin

To complete this tutorial, you'll need to be familiar with:

Source code

You can find the source code for this tutorial on Atlassian Bitbucket.

To clone the repository, run the following command:

1
2
git clone https://bitbucket.org/atlassian_tutorial/confluence-extractor2-tutorial.git

Alternatively, you can download the source as a ZIP archive.

This tutorial was last tested with Confluence 7.0 using Atlassian SDK 8.0.2.

Step 1: Implement Extractor2 interface

The ExtraCommentDataExtractor implements 2 methods extractText and extractFields. They receive a searchable object which is the Confluence content object (e.g. Page, Attachment, Comment) that is being saved, and passed through the extractor chain.

1
2
public class ExtraCommentDataExtractor implements Extractor2 {
    private final CommentManager commentManager;
 
    @Autowired
    public ExtraCommentDataExtractor(@ComponentImport CommentManager commentManager) {
        this.commentManager = requireNonNull(commentManager, "commentManager");
    }
 
    public StringBuilder extractText(Object searchable) {
    ...
    }
 
    public Collection<FieldDescriptor> extractFields(Object searchable) {
    ...
    }

The result of the extractText method is appended into the default Text field "content" that is used in a regular Confluence site search. The result of the extractFields method is a collection of fields to be added into the Search index document.

The @Autowired and @ComponentImport annotation in the constructor asks Confluence to inject CommentManager into the extractor at the creation time.

Step 2: Add text to the default content field

The ExtraCommentDataExtractor#extractText concatenates comments of the given page.

1
2
...
    public StringBuilder extractText(Object searchable) {
        StringBuilder builder = new StringBuilder();
        if (searchable instanceof Page) {
            Page page = (Page) searchable;
            builder.append(commentManager.getPageComments(page.getId(), page.getCreationDate()).stream()
                    .map(ContentEntityObject::getBodyAsString)
                    .collect(joining(" ")));
        }
        return builder;
    }
...

Step 3: Create fields of different types

The ExtraCommentDataExtractor#extractFields demonstrates how to create fields of different types.

  • text field "comment-creator": contains multiple values that are result from breaking an user name into series of 2-grams.
  • long field "comment-modified": contains a numerical value representing when the page was updated last time.
  • int field "comment-count": contains a number of comments created the page.
  • double field "comment-score": contains a single real number representing an average length of a comment.
1
2
...  
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        Page page = getPage(searchable);
        if (page == null) {
            return emptyList();
        }
        List<Comment> comments = commentManager.getPageComments(page.getId(), page.getCreationDate());
        if (comments.isEmpty()) {
            return emptyList();
        }
 
        ImmutableList.Builder<FieldDescriptor> builder = ImmutableList.builder();
        AnalyzerDescriptor analyzer = AnalyzerDescriptor.builder(new NGramTokenizerDescriptor(2, 2))
                .build();
        comments.stream()
                .map(ConfluenceEntityObject::getCreator)
                .filter(Objects::nonNull)
                .map(ConfluenceUser::getLowerName)
                .filter(Objects::nonNull)
                .forEach(username -> builder.add(new TextFieldDescriptor("comment-creator", username, FieldDescriptor.Store.YES, analyzer)));
 
        Comment lastComment = comments.get(comments.size() - 1);
        builder.add(new LongFieldDescriptor("comment-modified", lastComment.getLastModificationDate().getTime(), FieldDescriptor.Store.YES));
        builder.add(new IntFieldDescriptor("comment-count", comments.size(), FieldDescriptor.Store.YES));
 
        int commentTextLength = comments.stream()
                .mapToInt(x -> x.getBodyAsString().length())
                .sum();
        double commentScore = Math.log1p((double) commentTextLength / comments.size());
        builder.add(new DoubleFieldDescriptor("comment-score", commentScore, FieldDescriptor.Store.YES));
 
        return builder.build();
    }
...

Step 4: Make it visible to Confluence

Here is an example atlassian-plugin.xml file containing a single search extractor:

1
2
...
<extractor2 name="extraCommentDataExtractor" key="extraCommentDataExtractor"
                class="com.atlassian.confluence.plugins.extractor.tutorial.ExtraCommentDataExtractor" priority="1100">
</extractor2>
...
  • The class attribute defines the class that will be added to the extractor chain. This class must implement Extractor2.
  • The priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.
  • The key is an unique string that identifies the extractor.
  • The name is the extractor name.

Step 5: See how it works

One way to see how an extractor works is to debug into a running Confluence instance. Here's the key steps:

  1. Start your Confluence instance in debug mode.
  2. Install the plugin containing the extractor.
  3. Attach a debugger to the instance and set breakpoints on methods extractText and extractFields.
  4. Try to add or modify a page.
  5. Observe that the debugger stops at breakpoints.

Upgrading for Confluence 8.0

Extractor module

The old Extractor module will be removed in Confluence 8.0. It is being replaced by the Extractor2 module. This is part of an initiative to make the Confluence search API agnostic from the information retrieval implementation, Lucene. This will enable future upgrades to the library, without breaking changes to the API.

There will be no loss of functionality, when re-writing Extractor classes to Extractor2.

To learn how, take an example inspired by how the internal CommentExtractor was re-written.

Extractor example

1
2
public class CommentExtractor implements Extractor2 {

    @Override
    public StringBuilder extractText(Object searchable) {
        StringBuilder resultBuilder = new StringBuilder();
        
        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            resultBuilder.add(comment.getTitle());
        }
        return new StringBuilder();
    }

    @Override
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        final ImmutableList.Builder<FieldDescriptor> resultBuilder = ImmutableList.builder();

        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            ContentEntityObject owner = comment.getContainer();

            //only add the URL if this comment belongs to a page as others currently have no UI
            if (owner instanceof AbstractPage) {
                AbstractPage page = (AbstractPage) owner;
                resultBuilder.add(new StoredFieldDescriptor(SearchFieldNames.PAGE_URL_PATH, GeneralUtil.getIdBasedPageUrl(page))); // use id based url to avoid dependency on page title (and the link breaking if the page title is renamed)
            }

            if (owner != null) {
                resultBuilder.add(new StringFieldDescriptor(SearchFieldNames.CONTAINER_CONTENT_TYPE, owner.getType(), FieldDescriptor.Store.NO));
                resultBuilder.add(new StoredFieldDescriptor(SearchFieldNames.PAGE_DISPLAY_TITLE, owner.getDisplayTitle()));
            }
        }

        return resultBuilder.build();
    }
}

Extractor2 example

1
2
public class CommentExtractor implements Extractor {
    @Override
    public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable) {
        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            ContentEntityObject owner = comment.getContainer();
            
            defaultSearchableText.append(comment.getTitle());

            //only add the URL is this comment belongs to a page as others currently have no UI
            if (owner instanceof AbstractPage) {
                AbstractPage page = (AbstractPage) owner;
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.PAGE_URL_PATH, GeneralUtil.getIdBasedPageUrl(page), Field.Store.YES, Field.Index.NO)); // use id based url to avoid dependency on page title (and the link breaking if the page title is renamed)
            }

            if (owner != null) {
                // Add the type of owner this is attached to.
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.CONTAINER_CONTENT_TYPE, owner.getType(), Field.Store.NO, Field.Index.NOT_ANALYZED));
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.PAGE_DISPLAY_TITLE, owner.getDisplayTitle(), Field.Store.YES, Field.Index.NO));
            }
        }
    }
}

Extract default searchable text independently

Rather than adding to the default searchable text via a StringBuffer, instead implement the extractText method and add to the StringBuilder.

Use FieldDescriptor(s) instead of Document

A FieldDescriptor corresponds to a Field on a Lucene Document. Rather than creating the Document, describe the document with a Collection of FieldDescriptor.

XML Lucene configuration

XML configuration files used to define indexed fields for specified content types are being replaced by the Extractor2 module in Confluence 8.0. The new Extractor2 module provides a greater range of functionality.

To learn how, take the example of re-writing a Page.lucene.xml configuration file.

XML config example

1
2
<configuration>
    <field type="UnIndexed" fieldName="versionComment" attributeName="versionComment"/>
    <field type="Text" fieldName="content-name-unstemmed" attributeName="title"/>
    <field type="Keyword" fieldName="exact-title" attributeName="title"/>
</configuration>

Extractor2 example

1
2
public class PageExtractor implements Extractor2 {

    @Override
    public StringBuilder extractText(Object searchable) {
        return new StringBuilder();
    }

    @Override
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        final ImmutableList.Builder<FieldDescriptor> resultBuilder = ImmutableList.builder();

        if (searchable instanceof Page) {
            Page page = (Page) searchable;
            
            if (page.isVersionCommentAvailable()) {
                resultBuilder.add(new StoredFieldDescriptor("versionComment", page.getVersionComment()); 
            }
            
            String title = page.getTitle();
           
            if (!isBlank(title)) {
                resultBuilder.add(new TextFieldDescriptor("content-name-unstemmed", page.getTitle(), Store.YES); 
                resultBuilder.add(new StringFieldDescriptor("exact-title", page.getTitle(), Store.NO); 
            }
        }

        return resultBuilder.build();
    }
}

Attribute names

Attribute names can be re-written by using getter methods on the content. For example title is getTitle().

Fields

For the different field types, there is an equivalent FieldDescriptor implementation.

Additional functionality

Extractor2 allows definition of field values with more complex logic. Rather than a getter function call, multiple services can be called for data and multiple transformations can be done. In the above example, blank and null checks are performed before creating the fields.

FieldDescriptor provides the following additional functionality

  • TextFieldDescriptor can specify custom analysis, rather than relying on the Confluence default.
  • FieldDescriptor.Store enum allows choosing to store or not store for an indexed field.

Further reading

Learn more about extending Confluence's search capabilities with these tutorials:

Rate this page: