Coping (only just!) with the deluge of AI/search-related research

One of my objectives in setting up SearchResearch Online was to bring to the attention of the search management community (primarily enterprise/professional search) research that had been published that would be of relevance to their work. I launched the web site in November last year ahead of the announcement of ChatGPT from OpenAI. Since that announcement the impact search research has been very significant indeed, and I halted blogging while I took a deep breath and worked out how best to cope with the deluge. The reality is that there is no way that any single person can cope with the deluge!

In volume terms, the major platform for research publication in AI-related topics is ArXiv, a service operated by Cornell University.  This is a pre-print server that dates back to the early 1990s but moved to Cornell in 2001. The primary benefit of pre-prints is the speed of publishing, because there is no peer-review process involved. The downside is that the quality of the papers released on ArXiv is variable despite some degree of moderation.

I scan a number of sections on a daily basis. Below are the number of papers published in these sections in a typical week in May

  • Artificial Intelligence  718
  • Information Retrieval 198
  • Computers and Society 204
  • Digital Libraries 7
  • Human Computer Interaction 166
  • Computers and Language 957
  • Machine Learning 1004

In total that comes to just over 3200 titles to scan, but because papers can be published in more than one section there is a substantial overlap. I would guess that there are around 2000 unique papers released in a typical week.  Since November 2022 I have added around 200 research papers to my collection, limiting myself to research that is specifically relevant to enterprise and professional search, and to AI governance in the enterprise. I have now stopped scanning the Machine Learning section because so many of the papers (usually somewhat nearer to ‘applied research’) appear in other sections.

There are of course many research papers published in peer-reviewed publications, both open and closed (subscription) access, but these probably account for (at a total guess!) a further 30 or so papers a week.

Judging the value of an ArXiv paper is not easy, and it is very much the same process that enterprise search users are faced with on a regular basis. The evaluation starts with an assessment of the author and the institution they work for. It is especially difficult (at least for me) to assess papers from researchers in China, which account for a substantial amount of the research being conducted at present. I also look at the list of references to get a sense of the rigour with which the authors have considered prior research. An important element of a research paper is a discussion of the limitations of the research, and this section is often disappointing in its scope and transparency. There are often papers from the supply-side of the search business, notably Microsoft! Quite a number of the papers are early releases of conference papers, and that is a useful indicator of quality.

I count myself fortunate that I have the facility of scanning titles at a very high speed, almost looking peripherally at key terms in the title. It is very unhelpful when the authors of a paper think that it would be fun to create novel acronyms for the titles of their paper. ‘Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings’ is a recent example.

As well as what is being published I’m also interested to see where the gaps are. For example, there are very few papers that take a ‘user experience’ perspective, or consider how a novel application can be incorporated into an existing (usually Microsoft) IT stack. With all academic research there is always the question about the level of on-going support for novel software once the project has come to an end and the author heads off with their PhD.

Looking back at the last few months I can start to see some degree of reality dawning in published research as authors move away from hype delivery towards more balanced view of the benefits, risks and limitations of the work they have been undertaking.

Perhaps its time to start blogging again, but focusing on what I regard as research of exceptional potential value.

Martin White

6 June 2023