Last updated Jul 8, 2024

Create a search extractor

Applicable:This tutorial applies to Confluence 7.0 or higher.
Level of experience:Advanced. You should complete at least one intermediate tutorial before working through this tutorial.

If you are updating your app to be compatible for Confluence 8.0 and newer, see Upgrading for Confluence 8.0.

The Extractor2 plugin module allows you to hook into the mechanism that Confluence uses to populate its search indexes. Each time content is created or updated in Confluence, it is passed through a chain of extractors that assemble the fields and data that will be added to the search indexes for that content. By writing your own extractor you can add information to the content index.

To help you become familiar with the extractor2 module, we've created a simple plugin demonstrating how to create an extractor. In the following sections we'll explain how different parts of the plugin are implemented and fit together.

Before you begin

To complete this tutorial, you'll need to be familiar with:

Source code

You can find the source code for this tutorial on Atlassian Bitbucket.

To clone the repository, run the following command:

1
2
git clone https://bitbucket.org/atlassian_tutorial/confluence-extractor2-tutorial.git

Alternatively, you can download the source as a ZIP archive.

This tutorial was last tested with Confluence 7.0 using Atlassian SDK 8.0.2.

Step 1: Implement Extractor2 interface

The ExtraCommentDataExtractor implements 2 methods extractText and extractFields. They receive a searchable object which is the Confluence content object (e.g. Page, Attachment, Comment) that is being saved, and passed through the extractor chain.

1
2
public class ExtraCommentDataExtractor implements Extractor2 {
    private final CommentManager commentManager;
 
    @Autowired
    public ExtraCommentDataExtractor(@ComponentImport CommentManager commentManager) {
        this.commentManager = requireNonNull(commentManager, "commentManager");
    }
 
    public StringBuilder extractText(Object searchable) {
    ...
    }
 
    public Collection<FieldDescriptor> extractFields(Object searchable) {
    ...
    }

The result of the extractText method is appended into the default Text field "content" that is used in a regular Confluence site search. The result of the extractFields method is a collection of fields to be added into the Search index document.

The @Autowired and @ComponentImport annotation in the constructor asks Confluence to inject CommentManager into the extractor at the creation time.

Step 2: Add text to the default content field

The ExtraCommentDataExtractor#extractText concatenates comments of the given page.

1
2
...
    public StringBuilder extractText(Object searchable) {
        StringBuilder builder = new StringBuilder();
        if (searchable instanceof Page) {
            Page page = (Page) searchable;
            builder.append(commentManager.getPageComments(page.getId(), page.getCreationDate()).stream()
                    .map(ContentEntityObject::getBodyAsString)
                    .collect(joining(" ")));
        }
        return builder;
    }
...

Step 3: Create fields of different types

The ExtraCommentDataExtractor#extractFields demonstrates how to create fields of different types.

First we'll define the mapping for each field:

  • text field "comment-creator": contains multiple values that are result from breaking a user name into series of 2-grams.
  • date field "comment-modified": contains the time (in 'yyyyMMddHHmmssSSS' format) when the page was last updated.
  • int field "comment-count": contains the number of comments created the page.
  • double field "comment-score": contains a single real number representing the average length of a comment.
1
2
public class ExtraCommentFields implements FieldMappingsProvider {
    public static final TextFieldMapping CREATOR = TextFieldMapping.builder("comment-creator").store(true).analyzer(new TwoGramAnalyzerDescriptor()).build();
    public static final DateFieldMapping MODIFIED = DateFieldMapping.builder("comment-modified").store(true).build();
    public static final IntFieldMapping COUNT = IntFieldMapping.builder("comment-count").store(true).build();
    public static final DoubleFieldMapping SCORE = DoubleFieldMapping.builder("comment-score").store(true).build();

    @Override
    public Collection<FieldMapping> getFieldMappings() {
        return List.of(MODIFIED, COUNT, CREATOR, SCORE);
    }
}

The ExtraCommentDataExtractor#extractFields extracts the fields for each page.

1
2
...  
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        Page page = getPage(searchable);
        if (page == null) {
            return emptyList();
        }
        List<Comment> comments = commentManager.getPageComments(page.getId(), page.getCreationDate());
        if (comments.isEmpty()) {
            return emptyList();
        }
 
        ImmutableList.Builder<FieldDescriptor> builder = ImmutableList.builder();
        comments.stream()
                .map(ConfluenceEntityObject::getCreator)
                .filter(Objects::nonNull)
                .map(ConfluenceUser::getLowerName)
                .filter(Objects::nonNull)
                .forEach(username -> builder.add(ExtraCommentFields.CREATOR.createField(username)));

        Comment lastComment = comments.get(comments.size() - 1);
        builder.add(ExtraCommentFields.MODIFIED.createField(lastComment.getLastModificationDate()));
        builder.add(ExtraCommentFields.COUNT.createField(comments.size()));

        int commentTextLength = comments.stream()
                .mapToInt(x -> x.getBodyAsString().length())
                .sum();
        double commentScore = Math.log1p((double) commentTextLength / comments.size());
        builder.add(ExtraCommentFields.SCORE.createField(commentScore));

        return builder.build();
    }
...

Step 4: Make it visible to Confluence

Here is an example atlassian-plugin.xml file containing a single search extractor:

1
2
...
<field-mappings-provider key="extraCommentFields" index="CONTENT"
            class="com.atlassian.confluence.plugins.extractor.tutorial.ExtraCommentFields" />

<extractor2 name="extraCommentDataExtractor" key="extraCommentDataExtractor"
                class="com.atlassian.confluence.plugins.extractor.tutorial.ExtraCommentDataExtractor" priority="1100">
</extractor2>
...
  • The class attribute defines the class that will be added to the extractor chain. This class must implement Extractor2.
  • The priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.
  • The key is an unique string that identifies the extractor.
  • The name is the extractor name.

Step 5: See how it works

One way to see how an extractor works is to debug into a running Confluence instance. Here's the key steps:

  1. Start your Confluence instance in debug mode.
  2. Install the plugin containing the extractor.
  3. Attach a debugger to the instance and set breakpoints on methods extractText and extractFields.
  4. Try to add or modify a page.
  5. Observe that the debugger stops at breakpoints.

Upgrading for Confluence 8.0

Extractor module

The old Extractor module will be removed in Confluence 8.0. It is being replaced by the Extractor2 module. This is part of an initiative to make the Confluence search API agnostic from the information retrieval implementation, Lucene. This will enable future upgrades to the library, without breaking changes to the API.

There will be no loss of functionality, when re-writing Extractor classes to Extractor2.

To learn how, take an example inspired by how the internal CommentExtractor was re-written.

Extractor example

1
2
public class CommentExtractor implements Extractor {
    @Override
    public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable) {
        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            ContentEntityObject owner = comment.getContainer();
            
            defaultSearchableText.append(comment.getTitle());

            // only add the URL if this comment belongs to a page as others currently have no UI
            if (owner instanceof AbstractPage) {
                AbstractPage page = (AbstractPage) owner;
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.PAGE_URL_PATH, GeneralUtil.getIdBasedPageUrl(page), Field.Store.YES, Field.Index.NO)); // use id based url to avoid dependency on page title (and the link breaking if the page title is renamed)
            }

            if (owner != null) {
                // Add the type of owner this is attached to.
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.CONTAINER_CONTENT_TYPE, owner.getType(), Field.Store.NO, Field.Index.NOT_ANALYZED));
                document.add(new Field(PageContentEntityObjectExtractor.FieldNames.PAGE_DISPLAY_TITLE, owner.getDisplayTitle(), Field.Store.YES, Field.Index.NO));
            }
        }
    }
}

Extractor2 example

1
2
public class CommentExtractor implements Extractor2 {

    @Override
    public StringBuilder extractText(Object searchable) {
        StringBuilder resultBuilder = new StringBuilder();
        
        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            resultBuilder.add(comment.getTitle());
        }
        return new StringBuilder();
    }

    @Override
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        final ImmutableList.Builder<FieldDescriptor> resultBuilder = ImmutableList.builder();

        if (searchable instanceof Comment) {
            Comment comment = (Comment) searchable;
            ContentEntityObject owner = comment.getContainer();

            //only add the URL if this comment belongs to a page as others currently have no UI
            if (owner instanceof AbstractPage) {
                AbstractPage page = (AbstractPage) owner;
                resultBuilder.add(SearchFieldMappings.PAGE_URL_PATH.createField(GeneralUtil.getIdBasedPageUrl(page))); // use id based url to avoid dependency on page title (and the link breaking if the page title is renamed)
            }

            if (owner != null) {
                resultBuilder.add(SearchFieldMappings.CONTAINER_CONTENT_TYPE.createField(owner.getType()));
                resultBuilder.add(SearchFieldMappings.PAGE_DISPLAY_TITLE.createField(owner.getDisplayTitle()));
            }
        }

        return resultBuilder.build();
    }
}

Extract default searchable text independently

Rather than adding to the default searchable text via a StringBuffer, instead implement the extractText method and add to the StringBuilder.

Use FieldMapping to define your field

A FieldMapping corresponds to a Mapping on OpenSearch. For each different field type, there is an equivalent FieldMapping implementation. You can use the createField(value) method on the FieldMapping to create a FieldDescriptor.

Use FieldDescriptor(s) instead of Document

A FieldDescriptor corresponds to a Field on an individual Document in Lucene or OpenSearch. Rather than creating the Document, describe the document with a Collection of FieldDescriptor.

XML Lucene configuration

XML configuration files used to define indexed fields for specified content types are being replaced by the Extractor2 module in Confluence 8.0. The new Extractor2 module provides a greater range of functionality.

To learn how, take the example of re-writing a Page.lucene.xml configuration file.

XML config example

1
2
<configuration>
    <field type="UnIndexed" fieldName="versionComment" attributeName="versionComment"/>
    <field type="Text" fieldName="content-name-unstemmed" attributeName="title"/>
    <field type="Keyword" fieldName="exact-title" attributeName="title"/>
</configuration>

Extractor2 example

1
2
public class PageExtractor implements Extractor2 {

    @Override
    public StringBuilder extractText(Object searchable) {
        return new StringBuilder();
    }

    @Override
    public Collection<FieldDescriptor> extractFields(Object searchable) {
        final ImmutableList.Builder<FieldDescriptor> resultBuilder = ImmutableList.builder();

        if (searchable instanceof Page) {
            Page page = (Page) searchable;
            
            if (page.isVersionCommentAvailable()) {
                resultBuilder.add(SearchFieldMappings.LAST_UPDATE_DESCRIPTION.createField(page.getVersionComment())); 
            }
            
            String title = page.getTitle();
           
            if (!isBlank(title)) {
                resultBuilder.add(SearchFieldMappings.UNSTEMMED_TITLE_FIELD_NAME.createField(page.getTitle())); 
                resultBuilder.add(SearchFieldMappings.EXACT_TITLE.createField(page.getTitle())); 
            }
        }

        return resultBuilder.build();
    }
}

Attribute names

Attribute names can be re-written by using getter methods on the content. For example title is getTitle().

Additional functionality

Extractor2 allows definition of field values with more complex logic. Rather than a getter function call, multiple services can be called for data and multiple transformations can be done. In the above example, blank and null checks are performed before creating the fields.

FieldMapping provides the following additional functionality

  • TextFieldMapping can specify custom analysis, rather than relying on the Confluence default.
  • TextFieldMapping.isStored() allows choosing to store or not store for an indexed field.

Further reading

Learn more about extending Confluence's search capabilities with these tutorials:

Rate this page: