Who benchmarks the AI application benchmarks?

The information retrieval community was one of the pioneers in the development of benchmarks and test collections. The Cranfield experiments were the result of the vision and patience of Cyril Cleverdon, and can trace their history back to 1958. The baton was taken up by the US National Institute of Standards and Technology in 1992 with the establishment of TREC (Text Retrieval Conference). One of the objectives of TREC is to increase the availability of appropriate evaluation techniques for use by industry and academia, including development of new evaluation techniques more applicable to current systems

The outcomes of the TREC conferences have been of very substantial benefit to the development of search technology and have changed with the times and user requirements. You can find the agenda for the 2023 Conference here. For example, there is a recent assessment of IR benchmarks jointly authored by Stanford and IBM which raises a number of issues about current benchmarks. New benchmarks are emerging, such as FAIR, which assesses whether there is a balance in a set of search results which corresponds to the diversity of approaches to a given problem/query.

When it comes to LLMs there are already an enormous number of benchmarks. Every research paper is seemingly required to show how the authors have exceeded their chosen benchmark – you never come across a paper that reports “we tried this and it didn’t work”, even though the failure to do so might be of significant interest and value. The question we need to ask is how relevant these benchmarks are to the optimization of AI in practice, especially in the enterprise environment. Wouldn’t it be nice to have a Productivity Benchmark! Equally, who owns these benchmarks and is there a forum for a consideration of their value and enhancement, and the development of new benchmarks.

A step in the right direction is the HELM project which assesses each benchmark against 57 metrics. Recent developments include a novel two-word test for the semantic abilities of LLMs. Can they differentiate between ‘goat sky’ and ‘knife army’ (which are nonsense) and ‘soap bubble’ and ‘computer programmer’ which make perfect sense? Context is also important and AGIEval is benchmark specifically designed to assess foundation model in the context of human-centric standardized exams such as legal qualification tests.

Benchmarks can be fallible. A recent research paper identified that there were two versions of the Microsoft MS MARCO benchmark, the official one and a second one where passages were augmented with titles, mostly due to the introduction of the Tevatron code base. The authors commented that if a paper does not properly report which version is used, reproducing fairly its results is basically impossible. Another recent paper concluded that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction. Although this paints a dire situation, it also presents an opportunity for a rethink about how to design and report human evaluations in NLP.

The most obvious gap in benchmarking and test collections is in enterprise search, where for far too long a collection of Enron emails circa 2001 was often used as de facto enterprise content. Nothing could be farther from reality.

The underlying issue is whether AIGC/LLM applications  meet the ‘fit to specification’ test but not the ‘fit for purpose’ test. When application development is focused on a functional specification and does not take account of non-functional (e.g. usability and user experience) requirements the results are usually the development of workarounds and/or the use of shadow IT, which can result in an increase in technical debt and in corporate risk exposure.

Martin White

8 June 2023