Inside the enterprise there is a substantial world of corporate language that is challenging both to index and query. An element of this language is alphanumeric strings such as ER-OS-36X as the internal designation of a product, but I’ll come back to these in a moment.
I have already covered personal name search in an earlier post, but it deserves to be included in this post as well. It is however fairly easy to specify and test as you can just as HR for an employee list and highlight a selection that you can check with the vendor can be managed with their software. My favourite test is to see how they cope with the differences between Spanish and Portuguese name formats. As with all entity extraction tasks you need to understand how the index manages entities and how you can add in any special cases you have, ideally as a bulk upload from another application and not on a one-by-one basis.
Another entity class is date format. Finding content within a date range requires the date to be indexed to a normalized format that can recognize that 4/3/2023, 3/4/2023 and 2023/4/3 are the same date in different regions. Ask the vendor if its software can support ISO8601 and if they hesitate for even a moment you may want to dig deeper.
Then we come to product and service descriptions. There will almost certainly be a list of these in your ERP application that can be used as a test set. As with ER-OS-36X (or is it ER/OS/36-X?) the index software needs to be able to tokenise the string in a way that which ever variant is used in a query the same results are presented. That is much more difficult than it may seem. In the process of indexing you should also be presented with a list of entities that the software has not been able to cope with so that you can come up with some workarounds. The interface to manage these additions and corrections should be a graphical UI – you do not want to be working at code-level!
Some other issues I have come across are project names (especially if they are common English words) and the names of countries and cities in different languages. Cologne vs Köln for example. Chemical names are also a challenge, especially when your company uses a particular product which has a manufacturers’ product name and not the chemical name. An employee may want to find the Safety Data Sheet very quickly indeed.
And so the list goes on!
You can never be sure that you have covered off all the instances, so the issue on the table with the vendor is how exceptions to their index process can be identified and addressed. In an ideal world you’d like to talk to one or two of their customers to get their opinion on entity extraction management.
See here for Part 1
Martin White Principal Analyst 17 March 2023