Tracking LLM and AIGC research – the role of pre-print servers

The opportunities for research into the performance and possibilities of LLMs (I’m using this in a very generic way) are both colossal and essential if we are to get the best from this technology and avoid the worst it has to offer. It has struck me that the publication of this research is not keeping up with the speed of development. Even in journals that pride themselves on early publication the papers have a historical perspective which is interesting but of questionable long term value. There is also the challenge of finding peer reviewers that have an appropriate level of expertise in the topics.

A couple of weeks ago I spotted two papers on arXiv providing a review of the published literature on what is now being notated as AI-Generated Content (AIGC). At least it is not a TLA! AI-Generated Content (AIGC): A Survey has 116 citations and One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era offers 226. (In passing I might comment that ‘clever titles’ are not helpful and the authors do not explain how they know it is ‘complete’)

Looking through the lists of citations, what is immediately obvious is the number of references to pre-print papers, notably in arXiv. In a typical week I’m adding perhaps 15-20 pre-prints to my digital library but no more than around 5 from peer-reviewed journals. This of course raises the question about the validity of pre-prints without the sanity (hopefully!) of peer review. I recently came across a pre-print recently on the technology of IR and enterprise search that was just so full of inaccuracies I was almost moved to tears. This paper was of course in a topic area I’ve been following since the late 1970s  – how much trust should I place in other papers in arXiv on these current topics?

I do not want to be seen to dismiss all pre-prints and pre-print servers. Over the last few months there have been many exceptional papers, but my judgement has been based (arguably biased) on the institutions of the authors and on their biographical profiles. Even so I have added 70 papers to my collection since 1 January and my focus has been very much on research that has a potential impact on enterprise search.

I have no immediate solutions but that does make tracking both the research outcomes and the commercial offerings from an increasingly large number of profit-chasing vendors very challenging. My reference to ‘profit’ is a reminder that in the end someone has to pay the bills for the computational power needed to make all this technology work, and what is very noticeable from the vast amounts of vendor PR I am seeing is that there are no indications of the pricing models for the commercial versions of any of their current offerings, and that includes the cost of water. Perhaps all LLM variants should come with an environmental impact statement!

[This is a slightly edited version of a contribution I posted in the Spring issue of Informer]

Martin White

25 April 2023