Although a lot of attention is paid to information retrieval research the techniques of index construction and management are much more of a black art. One reason for this is that search index performance is a competitive advantage to a cloud vendor. The basis of the inverted file index architecture was conceived by Samuel Kaufmann and his colleagues at the IBM Technical Information Service, Yorktown, USA and then further developed by Alfonso Cardenas. The quality of the index has a direct bearing on the quality of the retrieval process and requires considerable attention to the detail of the tokenizing of the content and much else. Index performance is also very important, both in terms of the speed at which the index can be updated, and the speed with which the index can be queried. When enterprise search used on-premise indexes the adroit management of the index was the mark of excellence for a search engineer.
Latency management is especially complex with enterprise search as for each query (and its refinement) and the return of results there will be multiple calls to the index. Google has led users to regard sub-500ms as the default but achieving that with enterprise search, especially when federated, is a substantial engineering challenge.
Another factor to take into account in index management is the extent to which the index needs to be either partially or completely re-indexed to support a major change in the IR routines. Both are operations that can go catastrophically wrong because they will be carried out under a time constraint to get the system back up and running. This situation is why search vendors often seem slow to adopt the latest techniques in query management because they know that to optimize their performance will require at the minimum a partial re-index. This is also a requirement when two businesses merge and wish to have a unified index. Sometimes the risks totally outweigh the rewards.
Another index management issue is that of compression. The index size might be as much as 60% of the content volume and of course you need a back-up as well. With cloud services many of these issues become easier to manage so long as the budget is available. Pushing availability from (say) 99.2% to 99.5% can result in a substantial increase in license fees.
The research papers below cover a range of search index design and management topics at an IT (rather than a linguistic) level published over the last couple of years and illustrate some of the directions of travel.
The Performance Envelope of Inverted Indexing on Modern Hardware (2019)
Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study (2019)
IIU: Specialized Architecture for Inverted Index Search (2020)
A Neural Corpus Indexer for Document Retrieval (2022)
Techniques for Inverted Index Compression (2021)
AIRPHANT: Cloud-oriented Document Indexing (2021)
Transformer Memory as a Differentiable Search Index (2022)
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (2022)
The Bottom Line
The opportunities to gain substantial search performance through novel indexing approaches are significant but vendor cloud services managers may not be familiar with these and the potential impact on service levels and costs as the requirements are specific to search.
Martin White Principal Analyst