“Information retrieval – recent advances and beyond”. A research commentary

A feature of the SearchResearch Online blog will be commentaries on recent research papers that have relevance to enterprise search managers, and also to academic research teams and vendor development teams. Each commentary will have a Bottom Line at the end that highlights the potential value of the research. The choice of the word ‘commentary’ is deliberate. I’m not aiming to ‘review’ the paper in a strict academic sense of peer review, nor will I attempt to summarise a research paper. I’m just aiming to bring it to your attention and suggest why you might benefit from reading the full paper.

“Information retrieval – recent advances and beyond” is the title of a very comprehensive review of the development of information retrieval models by Kailash Hambarde (University of Beira Interior, Portugal) and Professor Hugo Proença (Instituto de Telecomunicações, Portugal). It was published on 20 January 2023 as a pre-print in the arXiv repository, so has not yet been peer-reviewed. It is also published on the DeepAI site.

After a brief introduction to the topics of retrieval and ranking the authors discuss conventional term based (Boolean) query management and then outline the approaches which have been used for semantic retrieval. They then review in some detail research related to the first stage of information retrieval, covering sparse techniques for semantic retrieval, state-of-the-art deep learning techniques for semantic retrieval, and also hybrid techniques such as those making use of word embedding. This section is very well structured.

The next section of the paper examines a range of approaches to ranking, including the vector space model (VSM) and learning-to-rank. This is quite a substantial section but unlike the earlier section on IR techniques there are no sub-headings  and it is not so easy to read and comprehend.

The paper includes a listing of the datasets that are currently being used to benchmark the performance of information retrieval models.  Finally the authors set out their views on current challenges, notably

Handling long-tail queries: A recent study has highlighted that handling long-tail queries, which are queries that are infrequent or rare, is a major challenge for semantic retrieval systems.

Handling multilingual retrieval: With the increasing amount of multilingual information available on the web, handling multilingual retrieval has also become an important challenge for semantic retrieval.

The paper lists 237 references to research papers, most of them to papers published in the last five years and so  provides a very good starting point for a student or a developer starting work on information retrieval application development.

The Bottom Line

Enterprise search applications are a complex mix of modules, some of which may be third party add-ins. This review gives a sense of the underlying complexity. It is important to understand the vendor’s technical stack to be certain that the application not only meets current requirements but that the vendor has a clear forward development strategy.

Martin White Principal Analyst