Enterprise search research in 2020
It is important to appreciate that enterprise content presents some significant challenges to search developers. There are likely to be tens of millions of files of uncurated content in a wide range of file formats and the subject scope will be quite focused compared with web search. The scope is set by the business interests of the organisation and if that is pharmaceuticals there are going to be a massive number of documents relating to perhaps just a dozen therapeutic areas. Then of course there are the trade-offs between precision and recall!
The papers listed below are just a selection of those published in 2020. It is very difficult to do justice to these papers in just a sentence or two so even if they look only slightly interesting I would recommend a click and read. Many are not explicitly ‘enterprise search’ but cover issues that are a feature of searching enterprise content. Only the final paper in this selection is not open-access.
Inverted index architecture
The standard index database structure is the inverted index and it has served its purpose well, but now the advent of the processing capabilities of cloud applications could potentially offer alternative high-performance index architectures. File collections in the hundreds of millions also give rise to substantial index sizes, and that requires careful attention to index compression.
IIU: Specialized Architecture for Inverted Index Search (taejunham.github.io)
Techniques for Inverted Index Compression (arxiv.org)
Interoperability
Federated search has come on a great deal over the last few years but interoperability is still a challenging issue.
BM25
The BM25 ranking model was developed in the early 1990s and has become the de facto ranking model for most text search applications. However, there are many variants of BM25 and there is on-going research to understand the opportunities and challenges of these variants
Improvements to BM25 and Language Models Examined (otago.ac.nz)
Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants (ru.nl)
Query understanding
Although much of the focus of enterprise search is on standalone applications many (indeed most) enterprise applications have a good search functionality. The paper below is an account of the development by Salesforce of the search capabilities in its CRM application. There are many wider lessons to be learned from the team’s experience with search optimization.
[2012.06238] Query Understanding for Natural Language Enterprise Search (arxiv.org)
Over the last few years the concept of ‘professional search’ has come to the fore. Clinicians, lawyers and patent agents are just a few professions which need to create conplex queries, often using Boolean strings.
City Research Online – Towards Explainability in Professional Search
Microsoft Research
The level of detail in the Salesforce paper is very unusual. Most vendors talk about AI, NLP and ML in very generic terms. However, this year Microsoft Research published two papers that went beneath the surface of SharePoint. The first of these is an analysis of millions of search sessions originating from within Microsoft Office applications, collected over one month of activity, in an effort to characterize search behavior in productivity software
Characterizing Search Behavior in Productivity Software (microsoft.com)
In another large scale study Microsoft Research analysed a number of factors, including display position, file type, authorship, recency of last access, and most importantly, the recommendation explanations, that are associated with whether users will recognize or open the recommended documents.
Understanding User Behavior For Document Recommendation – Microsoft Research
Machine learning
The Salesforce paper referred to above touches on the issues around deep learning. Sinequa reports on the use of BERT with the longer document formats that are common in enterprise search, and in doing so makes the point that web search and enterprise search invariably require very different solutions. This paper will give you a good sense of what is involved in taking a powerful but basic framework and transforming it into a workable solution for a specific piece of search software
The challenges of building on NLP techniques which might work well with short curated web content into the enterprise search space are discussed in detail in this paper in which approaches from categorical or bag-of-words representations to word embeddings representations in the latent space are outlined.
When Google itself raises questions about the value of machine learning then it is probably time to take a deep breath and read through this paper co-authored by a substantial team of Google developers.
Perceptual speed
So we have come to the point of scanning the list of results. This can be much more challenging than it might seem as the results may have different snippet formats and may present metadata in different ways. This is where the concept of perceptual speed comes into play.
Predicting perceptual speed from search behaviour — University of Strathclyde
Auto summarization
One of the features of enterprise content is that the documents can be long and complex, making it very difficult to make an immediate judgment on their value. This is where auto-summarisation can make a substantial difference.
The true measure of relevance
One of the most disturbing research papers of 2020 suggests that even that finding relevant papers does not necessarily lead to better decisions.
The authors found that the ability to interpret documents correctly was a much more important factor impacting task success. Despite the aid of the search engine, half of the clinical questions that were used as test cases were answered incorrectly. The authors commented that if their findings are representative, information retrieval research may need to reorient its emphasis towards helping users to better understand information, rather than just finding it for them. [This paper is not open access]
The aim of this post is to surface three important considerations. The first is that there is a great deal of development being undertaken into optimizing search through complex documents and this will continue. As a result, the performance of enterprise search software (ignoring for the moment what ‘performance’ means!) is going to improve, but it will take more than the insertion of AI routines to achieve.
This brings me to the second consideration. Enterprise search applications are modular. This gives application vendors immense flexibility to build a technology stack but it has to work in a totally integrated fashion and be capable of post-implementation creativity to meet requirements which may not have been on the original specification a year earlier, especially when that year is 2020!
The third is that enterprise search, as with all search applications, is not a solved problem even though the origins of enterprise search date back to the 1970s. The question that needs to be asked when considering the replacement of a current search application is whether the technology offered by a vendor will meet future user requirements encompassing a very wide range of search intents.
Martin White