Janus Boye asked me recently what the ‘silver bullet’ was for success in enterprise search. It is a question that no one has asked me before and it took a moment to come up with a response. My answer to his question was that in my view the way in which an enterprise search application provided users with the ability to specify a date or date range as a key search parameter was probably the silver bullet.
When searching a web site ‘date’ is rarely an important selection criterion, and where it is (for example in finding the annual report of a company) the user is offered a list in chronological order. In the enterprise there is a constant need to define a specific date (‘the 2011 corporate social responsibility statement’), a date range (‘projects undertaken in Germany between 2009 and 2011’) or a date representation (‘the Q3 and Q4 sales reports for alternators’). Achieving even a reasonable degree of user satisfaction is difficult at least five reasons.
The first is a lack of consistent metadata on the content item. The content item may have the date on the front cover but there will also be the date it was first uploaded to the server, the dates on which various revisions were made and (far too often!) the date that IT reinstalled the server and gave the same upload date to every document. In the case of Q3 and Q4 reports the date may be implicit because readers of the reports know that they are the current year reports. Things get more interesting when the query is about reports published in Q3 and Q4 which requires the search engine to deduce the implied date range. Document management systems tend to be good at enforcing date metadata but web CMS applications can make it very difficult to determine the ‘accepted’ date of a content item.
Second is the way in which the search application normalises and tokenises dates . This is a particular problem for multinational companies where 6th May 2012 could be 6/5/2012, 5/6/2012 or 2012-6-5. In the case of Q3 and Q4 this may also require the search indexation to know whether Q3 and Q4 refer to a calendar year or to a financial year.
The third problem is that few search user interfaces allow for a chronological date order. In some circumstances seeing results in reverse date order is useful, so that the earliest reference to when work started on a patent application can be found. Then there is the issue of how the query is normalised against the index, which is a component of search technology which is often overlooked. Finally the ranking algorithm, which is probably not ranking date very highly, has to recognise that the query is about a date-related topic and then display the results on the first, or at least the first two pages of results.
The end result is that providing good date search requires the search support team to know how every content management application (and not just web) manages date metadata, what date formats are (or have been) used, how the search engine tokenises dates, how the query is normalised and how to achieve query-specific ranking.
Mark Bennett (New Idea Engineering) has been exploring tokenising issues in a couple of recent blog posts. It is an important topic and it’s worth asking a search vendor to walk you through its approach to tokenising. You might have to wait for a while but in the meantime Mark’s posts provide a very clear view of the implications . For an in-depth account of tokenising read this regrettably anonymous paper. Welcome to the complex world of enterprise search management!
Martin White