Create an attachment text extractor

Applicable:	This tutorial applies to Confluence 7.0 or higher.
Level of experience:	Advanced. You should complete at least one intermediate tutorial before working through this tutorial.

If you are writing an extractor that indexes the contents of a particular attachment type (for example, OpenOffice documents or Flash files), you should create an AttachmentTextExtractor module. Unlike a normal extractor (see Extractor2 module), text extracted by this plugin module is stored as an artifact along with attachment content. It can be be reused to avoid an unnecessary extraction process each time the metadata of a page containing the attachment is modified.

Before you begin

To complete this tutorial, you'll need to be familiar with:

The atlassian-sdk
basic unix command line utilities: rm, ls, curl
The basics of Java development - classes, interfaces, methods, how to use the compiler, and so on
The basics of Atlassian plugin development - Confluence plugin modules, Atlassian Spring Scanner

Source code

You can find the source code for this tutorial on Atlassian Bitbucket.

To clone the repository, run the following command:

1
2
git clone https://bitbucket.org/atlassian_tutorial/confluence-attachment-text-extractor-tutorial.git

Alternatively, you can download the source as a ZIP archive.

This tutorial was last tested with Confluence 7.0 using the Atlassian SDK 8.0.2.

Step 1: Implement `AttachmentTextExtractor` interface

First we need to implement the AttachmentTextExtractor interface.

The SimpleAttachmentTextExtractor implements AttachmentTextExtractor which has 3 methods:

getFileExtensions
getMimeTypes
extract

1
2
public class SimpleAttachmentTextExtractor implements AttachmentTextExtractor {
    private final AttachmentManager attachmentManager;
 
    @Autowired
    public SimpleAttachmentTextExtractor(@ComponentImport AttachmentManager attachmentManager) {
        this.attachmentManager = requireNonNull(attachmentManager);
    }
 
    @Override
    public List<String> getFileExtensions() {
        return Collections.singletonList("java");
    }
 
    @Override
    public List<String> getMimeTypes() {
        return Collections.singletonList("text/java");
    }
 
    @Override
    public Optional<InputStreamSource> extract(Attachment attachment) {
        try (InputStream is = attachmentManager.getAttachmentData(attachment)) {
            if (is != null) {
                String text = IOUtils.toString(is, StandardCharsets.UTF_8);
                return Optional.of(() -> IOUtils.toInputStream(text, StandardCharsets.UTF_8));
            }
            return Optional.empty();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

AttachmentTextExtractor#getFileExtensions and AttachmentTextExtractor#getMimeTypes are used to indicate which type of attached files the extractor can work with.

The AttachmentTextExtractor#extract method performs the extraction to return an extracted text if it can. Otherwise it returns Optional.empty() to indicate that the extraction failed for a transient reason, and Confluence can reattempt the extraction at later time.

Step 2: Make it visible to Confluence

Now we need to make the extractor visible to Confluence.

Here is an example atlassian-plugin.xml file containing an attachment text extractor:

1
2
...
<attachment-text-extractor name="simpleAttachmentTextExtractor" key="simpleAttachmentTextExtractor"
    class="com.atlassian.confluence.plugins.extractor.tutorial.SimpleAttachmentTextExtractor" priority="1100">
</attachment-text-extractor>
...

The class attribute defines the class that will be added to the extractor chain. This class must implement AttachmentTextExtractor.
The priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.
The key is an unique string that identifies the extractor.
The name is the extractor name.

Step 3: See how it works

One way to see how an extractor works is to debug into a running Confluence instance. Here's the key steps:

Start your Confluence instance in debug mode.
Install the plugin containing the extractor.
Attach a debugger to the instance and set breakpoints on the extractText and extractFields methods.
Try to add or modify a page.
Observe that the debugger stops at breakpoints.

Next steps

Learn more about extending Confluence's search capabilities with these tutorials: