Unstructured text? It does not exist!
Along with ‘killer application’ I’d like to ban the description ‘unstructured text’. If a piece of text is going to have a semantic meaning then it has to be structured. The challenge for the search application is to take a sentence and parse it (i.e.reduce it) to a set of component words along with their syntactical relationship, such as subject, object and verb. How well the application responds to the challenge has a significant impact on the quality of the search. Understanding some of the consequences of parsing (derived from the Latin pars = part) is very important in understanding why certain queries results in poor user relevance. Different languages present different problems, which is why cross-language search is very difficult to deliver. In the case of English synonyms are a challenge. Take as an example
“Martin was sounding off about the fact that making a sound (perhaps by sounding a bell) is a sound thing to do entering a sound after sounding its depth.”
In that single sentence the word ‘sound’ is used in six different ways. A typical English noun has two forms (singular and plural), a typical German noun has eight forms (singular and plural in four different cases), and a typical Hungarian noun has several hundreds of forms. Using an English parser on German text is ineffective because the parser will not have the set of rules needed to parse German compound word nouns. There are a number of parsing applications for each of the main languages that use an extended ASCII character set. Moving into Chinese, Japanese and Korean (collectively referred to as CJK) is a very significant leap in complexity. To get a sense of the complexity of parsing the FAQ for the Stanford open source parser is a good place to start, as is a special issue of Computational Linguistics. The reason for highlighting parsers is that there could be words or phrases that are important to search users that cannot be resolved with the parser supplied with the search application.
Stemming and lemmatization
Another element within the text processing stage of a search application is recognise variant spellings which are semantically the same. The most common example is the plural of the word, so that a search for ‘cars’ will also find ‘car’. However stemming is a piece of brute force programming and usually only applies to the end of a word. In general stemming increases recall at the expense of precision. For example the standard Porter stemmer (dating from 1980) will reduce operate, operating, operates, operation, operative, operatives, and operational to ‘oper’. Proper names are also a challenge. Stemming will not distinguish between Christian and Kristian. Ideally users querying ‘Christian’ be offered ‘Did you mean Kristian?’. Another case in which stemming might not work is with a word like ‘contractual’. So someone searching for information on how to write a specific contract might search for ‘contract’ but would not find any reference to ‘contractual’. This is where lemmatization comes in, which is a way of recognising the root of a word, in this case ‘contract’. Probably the best concise account of stemming (and its converse and equally important ‘word expansion’) and lemmatization has been published by Idea Engineering
Implications for search managers
The implications of stemming and lemmatization are considerable and need careful consideration by the search team. Indeed the complexities of language are why there needs to be a search team as it will take someone with a background in computational linguistics, information retrieval or information science to not only understand the potential challenges but also to come up with solutions. A good starting place is to build a test collection of content that is representative of some of the linguistic challenges presented by the organisation. Some of these are about specialised terminology used in organisation. Lawyers often refer to ‘matters’ not cases. No-one is going to search for ‘matter’ as a verb so the ranking of these specialised business-related terms has to be accommodated. Reviews of search results should also take into account where a failed or poorly performing search is the result of a stemming problem.
Once you get into searching across more than one language (and that includes British English and American English!) the complexities mount up very quickly. Good search applications will recognise the language of a piece of text and initiate an appropriate set of parsing and stemming tools but are you be confident that not only is your application doing this but doing it well enough to meet the expectations of search users? Moreover will the application can cope with names where there are linguistic variants because these will generally not be picked up by the linguistic recognition software. Basis Technologies publishes excellent guidance notes on all aspects of the linguistic issues around search. Of course open source search solutions will enable you to choose which parser to use, but how will you decide which of the many available will be best for your organisation? Install and test is the only sensible approach.
Martin White