Applicable: | This tutorial applies to Confluence 7.0 or higher. |
Level of experience: | Advanced. You should complete at least one intermediate tutorial before working through this tutorial. |
If you are writing an extractor that indexes the contents of a particular attachment type (for example, OpenOffice documents or Flash files), you should create an AttachmentTextExtractor
module. Unlike a normal extractor (see Extractor2 module), text extracted by this plugin module is stored as an artifact along with attachment content. It can be be reused to avoid an unnecessary extraction process each time the metadata of a page containing the attachment is modified.
To complete this tutorial, you'll need to be familiar with:
rm
, ls
, curl
You can find the source code for this tutorial on Atlassian Bitbucket.
To clone the repository, run the following command:
1 2git clone https://bitbucket.org/atlassian_tutorial/confluence-attachment-text-extractor-tutorial.git
Alternatively, you can download the source as a ZIP archive.
This tutorial was last tested with Confluence 7.0 using the Atlassian SDK 8.0.2.
AttachmentTextExtractor
interfaceFirst we need to implement the AttachmentTextExtractor
interface.
The SimpleAttachmentTextExtractor
implements AttachmentTextExtractor
which has 3 methods:
getFileExtensions
getMimeTypes
extract
1 2public class SimpleAttachmentTextExtractor implements AttachmentTextExtractor { private final AttachmentManager attachmentManager; @Autowired public SimpleAttachmentTextExtractor(@ComponentImport AttachmentManager attachmentManager) { this.attachmentManager = requireNonNull(attachmentManager); } @Override public List<String> getFileExtensions() { return Collections.singletonList("java"); } @Override public List<String> getMimeTypes() { return Collections.singletonList("text/java"); } @Override public Optional<InputStreamSource> extract(Attachment attachment) { try (InputStream is = attachmentManager.getAttachmentData(attachment)) { if (is != null) { String text = IOUtils.toString(is, StandardCharsets.UTF_8); return Optional.of(() -> IOUtils.toInputStream(text, StandardCharsets.UTF_8)); } return Optional.empty(); } catch (IOException e) { throw new RuntimeException(e); } } }
AttachmentTextExtractor#getFileExtensions
and AttachmentTextExtractor#getMimeTypes
are used to indicate which type of attached files the extractor can work with.
The AttachmentTextExtractor#extract
method performs the extraction to return an extracted text if it can. Otherwise it returns Optional.empty()
to indicate that the extraction failed for a transient reason, and Confluence can reattempt the extraction at later time.
Now we need to make the extractor visible to Confluence.
Here is an example atlassian-plugin.xml
file containing an attachment text extractor:
1 2... <attachment-text-extractor name="simpleAttachmentTextExtractor" key="simpleAttachmentTextExtractor" class="com.atlassian.confluence.plugins.extractor.tutorial.SimpleAttachmentTextExtractor" priority="1100"> </attachment-text-extractor> ...
The class attribute defines the class that will be added to the extractor chain. This class must implement AttachmentTextExtractor
.
The priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.
The key is an unique string that identifies the extractor.
The name is the extractor name.
One way to see how an extractor works is to debug into a running Confluence instance. Here's the key steps:
extractText
and extractFields
methods.Learn more about extending Confluence's search capabilities with these tutorials:
Rate this page: