Extractor Module

Available:

Confluence 1.4 and later

 Extractor plugins allow you to hook into the mechanism by which Confluence populates its search index.Each time content is created or updated in Confluence, it is passed through a chain of extractors that assemble the fields and data that will be added to the search index for that content. By writing your own extractor you can add information to the index.

Extractor plugins can be used to extract the content from attachment types that Confluence does not support,

Confluence's internal search is built on top of the Lucene Java library. While familiarity with Lucene is not an absolute requirement for writing an extractor plugin, you'll need it to write anything more than the most basic of plugins.

Extractor Plugins

Here is an example atlassian-plugin.xml file containing a single search extractor:

<atlassian-plugin name="Sample Extractor" key="confluence.extra.extractor">
    ...
    <extractor name="Page Metadata Extractor" key="pageMetadataExtractor" 
               class="confluence.extra.extractor.PageMetadataExtractor" priority="1000">
        <description>Extracts certain keys from a page's metadata and adds them to the search index.</description>
    </extractor>
    ...
</atlassian-plugin>
  • the class attribute defines the class that will be added to the extractor chain. This class must implement bucket.search.lucene.Extractor
  • the priority attribute determines the order in which extractors are run. Extractors are run from the highest to lowest priority. Extractors with the same priority may be run in any order.

As a general rule, all extractors should have priorities below 1000, unless you are writing an extractor for a new attachment type, in which case it should be greater than 1000.

If you are not sure what priority to choose, just go with priority="900" for regular extractors, and priority="1200" for attachment content extractors.

To see the priorities of the extractors that are built into Confluence, look in WEB-INF/classes/plugins/core-extractors.xml and WEB-INF/classes/plugins/attachment-extractors.xml. From Confluence-2.6.0, these files are packaged inside confluence-2.6.0.jar; we have instructions for Editing Files within JAR Archives if you're unfamiliar with the process.

The Extractor Interface

All extractors must implement the following interface:

package bucket.search.lucene;

import bucket.search.Searchable;
import org.apache.lucene.document.Document;

public interface Extractor
{
    public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable);
}
  • The document parameter is the Lucene document that will be added to the search index for the object that is being saved. You can add fields to this document, and the fields will be associated with the object in the index.
  • The defaultSearchableText is the main body of text that is associated with this object in the search index. It is stored in the index as a Text field with the key "content". If you want to add text to the index such that the object can be found by a regular Confluence site search, append it to the defaultSearchableText. (Remember to also append a trailing space, or you'll confuse the next piece of text that's added!)
  • The searchable is the object that is being saved, and passed through the extractor chain.

Attachment Content Extractors

If you are writing an extractor that indexes the contents of a particular attachment type (for example, OpenOffice documents or Flash files), you should extend the abstract class bucket.search.lucene.extractor.BaseAttachmentContentExtractor. This class ensures that only one attachment content extractor successfully runs against any file (you can manipulate the priorities of attachment content extractors to make sure they run in the right order).

For more information, see: Attachment Content Extractor Plugins

An Example Extractor

The following example extractor is untested, but it associates a set of page-level properties with the page in the index, both as part of the regular searchable text, and also as Lucene Text fields that can be searched individually, for example in a custom {abstract-search} macro.

package com.example.extras.extractor;

import bucket.search.lucene.Extractor;
import bucket.search.Searchable;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import com.atlassian.confluence.core.ContentEntityObject;
import com.atlassian.confluence.core.ContentPropertyManager;
import com.opensymphony.util.TextUtils;

public class ContentPropertyExtractor implements Extractor
{
    public static final String[] INDEXABLE_PROPERTIES = {"status", "abstract"};
    
    private ContentPropertyManager contentPropertyManager;
    
    public void addFields(Document document, StringBuffer defaultSearchableText, Searchable searchable)
    {
        if (searchable instanceof ContentEntityObject)
        {
            ContentEntityObject contentEntityObject = (ContentEntityObject) searchable;
            for (int i = 0; i < INDEXABLE_PROPERTIES.length; i++)
            {
                String key = INDEXABLE_PROPERTIES[i];
                String value = contentPropertyManager.getStringProperty(contentEntityObject, key);

                if (TextUtils.stringSet(value))
                {
                    defaultSearchableText.append(value).append(" ");
                    document.add(new Field(key, value, Field.Store.YES,Field.Index.TOKENIZED));
                }
            }
        }
    }

    public void setContentPropertyManager(ContentPropertyManager contentPropertyManager)
    {
        this.contentPropertyManager = contentPropertyManager;
    }
}

Debugging

There's a really primitive Lucene index browser hidden in Confluence which may help when debugging. You'll need to tell it the filesystem path to your $conf-home/index directory.

http://yourwiki.example.com/admin/indexbrowser.jsp

Was this page helpful?

Have a question about this article?

See questions about this article

Powered by Confluence and Scroll Viewport