Foundation models – some issues that seem not yet to have been addressed.

In my post on the Alan Turing Institute Foundation Models event I noted that the event had helped me develop a list of issues that were not discussed at the event. This is not a criticism of the event – the event could easily have run to three days and still there would be topics that had not been covered!

So this is my list of ten, and to a significant extent it is a scope note to myself to track developments, many of which are from the business/practitioner perspective. There is no order of priority.

  1. Foundation models (FM) and the applications that they will support all require a substantial investment in skills and computing power. It is unclear to me where this investment is going to come from and much will depend on the funding/business model of the institution/organisation undertaking the development.
  2. That also leads me to wonder about the components of the delivery channel. Is there a role of ‘systems integrators’ in the delivery of FMs and applications?
  3. Then of course comes the issue about who ‘owns’ the use of FMs in the enterprise. No one wants to own search, so will that be the same for FMs? Or will FMs catalyze the recognition of the need to manage discovery as a corporate asset?
  4. Getting the best out of an FM will require training, and that includes the choice of the FM, the scope of the repository and the way in which prompts can best be used. In an enterprise context search is regarded (incorrectly!) as being intuitive. Who will own the use of FMs in the organisation?
  5. Nothing was said at the event about language support. It was sort of assumed that the models would use English language training sets and be used by people fluent in the language. I would note that even people fluent in spoken English may not be fluent in constructing queries to post to an FM.
  6. Computing power is constraining the length of responses to around 200 words, and that is probably not adequate for many discovery use cases. How might this constraint be removed, even if only to 400 words?
  7. When OpenAI (and other FMs) talk about indexing the web they are primarily talking about HTML content. A substantial amount of content (especially research) is published as pdf files, and behind subscription firewalls. Extracting content from pdfs is a significant challenge and I’d be interested to understand to what extent pdf and similar content is in current training and master collections.
  8. There are a number of technological initiatives towards watermarking FM content. Should there be an international regime for this work, along the lines of ISBN, DOI and ISSN, amongst many others?
  9. Without a watermark it is down to the individual to try to spot the use of FM content. That is getting increasingly more difficult as the FMs develop but could there be issues for people with dyslexia?
  10. No one seems to be talking about reproducibility. If I ask a query today will I get the same response in two weeks’ time? Probably not, but is that important?

Martin White, Principal Analyst, 23 February 2023