Lucene Boosting Strategy Module

Available:

Confluence 3.0 and later

 Lucene Boosting Strategy plugins allow you to configure the scoring mechanism used by Lucene to order search results in Confluence.Each time a document is found via search, it is passed through the set of boosting strategies to determine its score for ranking in the search results. By writing your own boosting strategy you can customise the order of search results found by Confluence.

Confluence's internal search is built on top of the Lucene Java library. Familiarity with Lucene is a requirement for writing a boosting strategy plugin, and this documentation assumes you understand how Lucene works.

Lucene Boosting Strategy Plugins

Here is an example atlassian-plugin.xml file containing a single search extractor:

<atlassian-plugin name='Sample Boosting Strategies' key='example.boosting.strategies'>
...
    <lucene-boosting-strategy key="boostByModificationDate" class="com.example.boosting.strategies.BoostByModificationDateStrategy"/>
...
</atlassian-plugin>

The BoostingStrategy Interface

All strategies must implement the following interface, BoostingStrategy:

package com.atlassian.confluence.search.v2.lucene.boosting;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldCache;

import com.atlassian.confluence.search.service.SearchQueryParameters;

/**
 * An implementation of this interface may be passed to {@link BoostingQuery} to achieve an arbitrary per document score
 * boost.
 */
public interface BoostingStrategy
{
    /**
     * <p>Apply a relevant boost to the specified document with the specified score. Returning a score
     * of 0 will remove the document from the results.</p>
     * <p><em>Warning:</em> This method needs to return extremely fast, so any I/O like using the index reader 
     * to load the actual document is discouraged. If you need access to a documents field values you should rather
     * consider using a {@link FieldCache} instead.</p> 
     * 
     * @param reader a reader instance associated with the current scoring process
     * @param doc the doc id
     * @param score the original score for the document specified by doc
     * @return the boosted score, 0 to remove the document from the results, or <code>score</score> to make no change to the score
     * @throws IOException
     */
    float boost(IndexReader reader, int doc, float score) throws IOException;
    
    /**
     * <p>Apply a relevant boost to the specified document with the specified score. Returning a score
     * of 0 will remove the document from the results.</p>
     * <p><em>Warning:</em> This method needs to return extremely fast, so any I/O like using the index reader 
     * to load the actual document is discouraged. If you need access to a documents field values you should rather
     * consider using a {@link FieldCache} instead.</p> 
     * <p>If you are implementing this method but do not use the <code>searchQueryParameters</code>, it is safe to delegate
     * directly to the <code>boost(IndexReader, int, float)</code> method.</p>
     * 
     * @param reader a reader instance associated with the current scoring process
     * @param searchQueryParameters extra state information used by more complex boosting strategies 
     * @param doc the doc id
     * @param score the original score for the document specified by doc, or <code>score</score> to make no change to the score
     * @return the boosted score or 0 to remove the document from the results
     * @throws IOException
     */
    float boost(IndexReader reader, SearchQueryParameters searchQueryParameters, int doc, float score) throws IOException;
    
}

The reader should not be used to retrieve data directly, otherwise it will be incredibly slow to retrieve search results in Confluence. The reader should only be used with the FieldCache object to retrieve a cache of values from the index. See the example and discussion below.

An Example Boosting Strategy

The following boosting strategy is used in Confluence to boost search results by last-modified date. Some of the logic to do with date-handling has been removed to simplify the example.

package com.example.boosting.strategies;

import java.io.IOException;
import java.util.Calendar;
import java.util.Date;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldCache;

import com.atlassian.bonnie.LuceneUtils;
import com.atlassian.confluence.search.service.SearchQueryParameters;
import com.atlassian.confluence.search.v2.lucene.boosting.BoostingStrategy;

/**
 * A {@link BoostingStrategy} that boost the scores based on the modification date of scored document. Recently modified
 * Document get a higher boost.
 */
public class BoostByModificationDateStrategy implements BoostingStrategy
{
    static final String MODIFIED_FIELD = "modified";

    private static final float BOOST_TODAY = 1.5f;
    private static final float BOOST_YESTERDAY = 1.3f;
    private static final float BOOST_WEEK_AGO = 1.25f;
    private static final float BOOST_MONTH_AGO = 1.2f;
    private static final float BOOST_THREE_MONTH_AGO = 1.15f;
    private static final float BOOST_SIX_MONTH_AGO = 1.10f;
    private static final float BOOST_ONE_YEAR_AGO = 1.05f;

    public float boost(IndexReader reader, int doc, float score) throws IOException
    {
        String[] fieldcaches = FieldCache.DEFAULT.getStrings(reader, MODIFIED_FIELD);

        // more recent hits get a boost
        Date age = LuceneUtils.stringToDate(fieldcaches[doc]);
        score *= getAgeBoostFactor(age);

        return score;
    }

    public float boost(IndexReader reader, SearchQueryParameters searchQueryParameters, int doc, float score) throws IOException
    {
        return boost(reader, doc, score);
    }

    private float getAgeBoostFactor(Date date)
    {
        // ... irrelevant Date/Calendar mangling ...

        float boostFactor;
        if (date.after(startOfToday))
            boostFactor = BOOST_TODAY;
        else if (date.after(startOfYesterday))
            boostFactor = BOOST_YESTERDAY;
        else if (date.after(startOfWeekAgo))
            boostFactor = BOOST_WEEK_AGO;
        else if (date.after(oneMonthAgo))
            boostFactor = BOOST_MONTH_AGO;
        else if (date.after(threeMonthsAgo))
            boostFactor = BOOST_THREE_MONTH_AGO;
        else if (date.after(sixMonthsAgo))
            boostFactor = BOOST_SIX_MONTH_AGO;
        else if (date.after(oneYearAgo))
            boostFactor = BOOST_ONE_YEAR_AGO;
        else
            boostFactor = 1;
        return boostFactor;
    }
}

Using Field Caches

Note that this example uses a Lucene FieldCache, which stores a copy of all the modification data for all index entries in memory. If you are implementing a BoostingStrategy yourself, you should also use a FieldCache (rather than reading the index entries from disk) and be aware of their behaviour:

  • the first time you use a field cache, it requires iterating through every index entry to warm up the cache in a synchronised block
  • field caches are cleared every time the search index is updated (normally every minute in Confluence), which requires another warm-up
  • field caches keep a copy of each term in memory, usually requiring a large amount of memory.

Be sure to measure the increase in memory usage required after installing your plugin and how well your custom boosting strategy copes with a large amount of data in the index that is updated every minute.

Confluence itself has only two active field caches: one for the "modified" field in the main index (as shown above), and one for "word" in the Did-You-Mean index. When a new Searcher is created after each write to the index, Confluence manually warms up the "modified" field cache with the following call:

FieldCache.DEFAULT.getStrings(searcher.getIndexReader(), "modified");

It might improve performance to warm up any field caches when your plugin is initialised. There's currently no way for a plugin to determine when IndexSearchers are refreshed, so there may be a relatively frequent performance hit if you are accessing a FieldCache which hasn't been warmed up.

Related Pages

Was this page helpful?

Have a question about this article?

See questions about this article

Powered by Confluence and Scroll Viewport