Assessing relevance in search results – the role of document surrogates
Over the years there has been an enormous amount of research into the optimisation of search interfaces, with a great deal of attention being paid to facet design and layout and the positioning of the search box. These provide dialogue support in the quest to arrive at a ranked list of results on what is often termed the Search Results Presentation page (SERP). Once the list of results has been presented the task of the user is to review them and decide which might be the most relevant for their purpose. This involves a very close scrutiny of the summary text for the result. A document that may be tens of pages long is represented by a document surrogate of perhaps 50 words together with some metadata. When I look through RFPs for search applications, or the technical documentation provided by a search vendor the issues of document summarisation are not considered, and yet how else is the user going to gauge the potential relevance of a result?
There is a much wider issue here than creating a summary for a search result. When we summarise a document we will read, or at least scan, through it to get a measure of the scope and the balance and then take a view on what might be a useful summary. Technology cannot work this way – at least at present. The challenge of creating document summaries has been on the agenda since the pioneering work of H.P Luhn in 1958. As the volume of information increases the requirement to be able to create a summary of a document that is fit for purpose increases at a similar rate.
There are two core approaches to creating document summaries. Extractive summarisation locates and presents sections of sentences from the document. First an intermediate representation is created, the sentences in the document are then scored against the query terms and finally a summary is presented that may include sections of more than one sentence, usually with the query terms highlighted. Creating the representation is quite a challenge and there are many different approaches. Increasingly Latent Semantic Analysis and Latent Dirichlet allocation are being used to identify topics within a document. The fundamental challenge is whether the combination of the sentence extracts and the highlighting of the query terms will give the search user a valid representation of the value of the document. This is especially the case when query terms are in tables and charts, which are very difficult to deconstruct through extractive summarisation. The second approach is abstractive summarisation in which natural language processing is used to create (in effect) a new document which is a summary that in principle would be a best match for the query terms. This is technically very difficult but over the last few years advances in NLP and machine learning are giving a fresh impetus to providing solutions. There are many extractive summarisation software applications available as well as applications such as ROUGE which can be used to assess the quality of a summarisation technique. A good starting point to learn more about this topic is Automatic Summarisation by Nenkova and McKeown , published by Now Publishing in 2011 but the rate of development since this book was published has been very substantial.
Of course the requirement for summarization is not just one for the search community. News services also need to be able to generate summaries of stories for news feeds, and there is an emerging requirement to be able to summarise social networking exchanges so that these can be analysed for the purposes of topic extraction and sentiment analysis. When deciding what summarization approaches to use the requirements of the document content and format need to be taken into account. Research reports and patents need different approaches that corporate policies and procedures. A particular challenge are documents written by people using their second language because the misuse of syntax might have a significant impact on the result summary. The question that a search manager needs to consider is whether the summarizer supplied by the vendor is appropriate to the documents and other content being indexed, and if not can it be changed or modified. Great attention may be being applied to optimising relevance for given queries but the benefits may be clouded in testing because of poor, or more often unpredictable, performance of the summary.
So when you next undertake a search and click on a relevant result that is perhaps not as relevant as you might have expected the problem may lie in the document surrogate that has been generated and not the ranking model.