“Between the Spreadsheets – Classifying and Fixing Dirty Data” by Susan Walsh

What a superb title! It makes you smile before you even open the book. At last there is a book that focuses on data and content quality and does so in a very practical way. Susan Walsh (aka The Classification Guru)  is an information entrepreneur who somewhat accidentally fell into the business of sorting out messy data. At the heart of Susan’s methodology is COAT, which focuses on data Consistency, Organisation, Accuracy and Trustworthiness. Having spent much of this year working on an e-commerce search project I can confirm that even market-leading e-commerce companies are at the mercy of poor quality data generated by suppliers. The company had to depend on suppliers paying attention to data quality and yet in search after search rogue products were presented purely as a result of inconsistent and often incoherent codes being applied to products.

The chapter headings define the scope as The Dangers of Dirty Data, Supplier Normalisation, Taxonomies, Spend Data Classificaion, Basic Data Cleansing, a Dirty Data Maturity Model and Data Horror Stories. Much of the text revolves around the use and mis-use of Excel, but that does not mean that the book is only suitable for SMEs. The principles are the same for any database application and you would be surprised to find out just how many larger companies use Excel despite all the problems over issues like record locking. (As a side issue it would be really useful if Microsoft could spend just a fraction of its research budget sorting out Excel rather than adding yet another feature to the feature mountain that is Teams!)

The style of the writing is classic story telling, somewhat breathless at times but always focused on the core themes of the book and with many very useful examples of broken Excel worksheets. I especially liked the way that each chapter has a Conclusion that highlights the lessons presented and leads the reader on to the next chapter. Full marks to both the author and to Facet for this approach. The quality of production is excellent, but the index is not as good as it needs to be for a handbook-style publication.

If you are teaching data science then all your students should be made aware of this book. When it comes to organisations, I can’t see any reason for not making sure that anyone managing an Excel data base has a copy to refer to.  If it does no more than provide a guide to the effective use of Excel then it has an immediate benefit. But this is not just an ‘Excel manual’; this is about understanding how poor data quality can defeat any IT application and the simple (in principle) steps you and your colleagues can take to begin to sort out the mess you are in. This 150pp book will set you back £36.99 and is excellent value for the price. It also makes a very good business case of using Susan as a consultant – the author really sells herself between the lines with her passion for data quality.

Martin White