The importance of integrating corporate language and enterprise search strategies
I recently listened to a very interesting presentation by the Enterprise Search Manager for a US-based Fortune 100 company with subsidiaries in just about every country in the world. The investment the company had made in enterprise search was substantial, as was the size of the support team and the hundreds of millions of indexed documents. However, during the Q&A the question was asked about how many languages were being searched. To my astonishment (though not surprise!) the presenter had no idea.
The scale of use of local languages is rarely on the agenda of the IT department when specifying and implementing search applications, especially (I’m sorry to say) companies with their HQ in the USA. It also seems to escape the notice of the information research community that enterprise search has to accommodate multiple languages. The current level of investment in NLP solutions is admirable and I have seen some impressive demonstrations. Questions I then ask about the applicability and cross-language consistency of these solutions result in some very vague responses. At least Google is working on some solutions and the challenges and benefits are recognized. Many search vendors are now promoting the value of passage identification in relevant documents, but in which languages? Just English?
In any country in the world where English is not the dominant business and social language it is highly likely that any documents with a compliance relevance (employee contracts and policies and both supplier and customer contracts as just a few examples) will be in local language, and in addition there will be local language versions of press releases, local news, marketing material and user manuals. Government documents (especially Covid-related!) will also be in local language. Intranets are a particular problem because it is quite possible that there are local language intranets that are invisible to Corporate Communications. These intranets are being used by employees who may have a limited, if any, command of English and may well be constructed and managed by a local software development company. One of my clients was quite surprised to find that it had a Tagalog/Taglish intranet (Philippines) and that it had more users than any other intranet outside of Europe/USA.
Language support has so many elements. The first is the quality of the index, in particular stop words in one language having an important relevant in another. The crawler has to be able to identify the language and the curved ball is when the document is in two or more languages. It happens more frequently than you might imagine. Then there are issues around query management. One of my German clients had 74% of its content in English, but only 26% of employees spoke English as their prime language. That raises all sorts of challenges for spell checking, auto-suggestion and facet ordering. It is not just a question of word-for-word translation. There are five words in German that have the semantic meaning of working together; English just has ‘collaboration’.
Next come equivalent documents. It can be very important to be able to find the definitive (usually in English) version of a press release and the versions in multiple languages. This requires adroit metadata management, and is why any organisation working in multiple languages has to have a corporate language strategy that a) defines what languages are supported by search and b) which types of content need to be professionally translated rather than machine translated for compliance reasons. Machine translation is improving rapidly but still has issues with context. If you want to explore the nuances read this case study.
Another challenge is the presentation of results in different languages in the SERP. If you think this is anywhere near a solved problem read an excellent thesis by Chenjun Ling. This thesis also raises the issue of language competency. Our language competency is defined in terms of how well we can speak a language, how well we can write it, how well we can listen with comprehension and how well we can read it. It is essential reading. The differences in these competencies have a marked impact on our ability to scan lists of results, identify potentially relevant results in a fragmented sentence format and then be able to read the document when we download it.
When I look at the web sites of search vendors I am dismayed by the lack of attention to detail on searching on a multi-lingual and/or cross-lingual basis. Vendors have a lot to say about their ability to support federated search across multiple applications but far less (indeed if anything) on multiple languages. A search on the vendor web site for [language] invariably lists out references to NLP!
Finally comes the requirement to optimize searching in multiple languages from an analysis of search logs. Staff on the search team need to have a fluency in at least one of the indexed languages and in English. Identifying relevance ranking issues requires not only a social fluency in a language but also a knowledge of the business language. Remember that the European vocabulary and usage of Portuguese and Spanish differs from Brazilian Portuguese and Latin American Spanish (which itself varies between countries).
This is a highly condensed outline of language management and enterprise search just to get you thinking. I could also write about social media search (the province of computational sociolinguistics), fuzzy name searching for personal names, supporting diacritic symbols, indexing Arabic and Hebrew and of course the CJK languages, but they will keep for another day.
For now the bottom line is this. Does your enterprise strategy support the corporate language policies (see also this link) of your organisation? If it does not then there will be substantial business risks to your organisation through information that is vanishing into a linguistic black hole, to say nothing of the impact on employees who feel that due regard is not being paid to their national language.