Test collections for enterprise search evaluation, validation and optimization

Test collections for enterprise search evaluation, validation and optimization

by | Jan 5, 2021 | Search

A core element of information retrieval evaluation is the availability of a test collection that enables A/B testing (not as easy as you might think!) to be carried out as a means of evaluating changes in the ranking algorithm. The importance of defining a representative test collection goes back to the fundamental work on search evaluation undertaken by Cyril Cleverdon in the 1960s. The use of test collections has been a feature of the TREC, CLEF and NTCIR conferences for many years, and the development of test collections is the subject of a recent thesis by Dan Li, University of Amsterdam.

Although there are a number of quasi-enterprise text collections available the extent to which these are representative of actual enterprise collections has not been studied. The Microsoft MSMarco database collection has around 10 million documents but these originate from the WWW and may not (indeed in my opinion will not) be representative of enterprise documents. One of the factors is the scale of enterprise collections, which might well run into the hundreds of millions of documents (using document as a generic for a content item) and that creates significant file management issues for the research community. Although many novel information retrieval approaches are being proposed every year the extent to which they scale is never considered. The implications of this are demonstrated in a recent research paper which showed that performance for dense representations (for example, based on BERT) decreases quicker than sparse representations (such as BM25) for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. Determining the tipping point is not that easy! When you read research papers on information retrieval developments always start by understanding how close the document set is to your collection of documents in volume, degree of curation and use cases. 

When it comes to enterprise search there is a potential requirement for test collections with both selection and implementation. In principle running the application against a test collection should be a crucial step in assessing its performance but this requires a depth of knowledge about user intents and requirements that are rarely defined in the procurement process. The increasing prevalence of AI/ML routines and personalization make the assessment even more challenging, and then there is the issue of which languages should be represented in the test collection.  A core element of enterprise search is security management and that needs to be assessed in any procurement process. My advice is always to write the specification in parallel with the assessment process so that you are aware from the beginning about what features you may need to take on trust or (preferably) check out with other customers. The danger of talking to customers is that their index and query architecture may not match yours in crucial areas.

A similar set of problems arises when it comes to user acceptance testing prior to sign off and the final payment. There is a further complication at this stage because the implementation may well have been managed by a systems integrator and deciding whether red flags are attributable to the software, the implementation or the nature of the content in the test collection is always challenging, especially since at the acceptance stage the entire index may not have been built and not all the connectors will have been stress-tested.

Then comes the never-ending process of optimization. There are of course many metrics that have been developed over the years for relevance ranking performance but all depend, to a greater or lesser extent, on assessing the performance over a manageable test collection and standard enquiries with no consideration of the extent to which the optimization scales across the entire index and the current range of queries, which (especially at the present moment) may be quite substantially different from the queries posted in 2019.

There are ways of working through these challenges but all depend on there being a dedicated (in all senses of the word!) search team that is in place before the implementation of a search application and who stay with the project right through to the post-implementation phase. This continuity is essential in achieving the best possible return on the investment you have made.

Martin White