Lies, damned lies and statistics – the Big Data implications

No one is quite sure who coined the phrase about lies, damned lies and statistics but I suspect that as more attention is paid to using sophisticated approaches to data analysis there are going to be occasions on which decisions are made which come back to bite. The problems are  not just in data cleansing or data validity but the way in which the numbers can be used to justify a position. Many years ago I remember a monograph on statistics from ICI (imperial Chemical Industries) which the author dedicated to his wife, as she was a better cook than any statistician!

In the course of a recent project I came across a non-for-profit company called Full Fact, based on London. Full Fact is an independent fact-checking organisation which has as its objective making it easier to see the facts and context behind the claims made by the key players in British political debate,  and press those who make misleading claims to correct the record. Today the subject is the scale of the oil reserves in the UK that could be extracted by fracking. The Prime Minister, David Cameron, has suggested that these would last 51 years but the latest estimate from an industry expert is 25 years. That’s quite a difference!

Apart from the issue about which is the ‘correct’ estimate there is a wider issue about information semantics, because it is not just about the number but exactly how it is presented in a report or a PowerPoint file that could influence a decision.  This highlights the fact that ‘unstructured’ data (as information is often referred to by the Big Data community) actually has a very important element of structure to it to present a semantic context. Luciano Floridi sets out the issues very well in his contribution to the Harvard Encyclopedia of Philosophy.

As your organisation starts to get excited about being a data-driven business it’s worth looking at the Full Fact site and thinking about how data is going to be presented and trusted. The Full Facts operation also indicates the problems of relying on a single source of data truth. Even within the same organisation there could be more than a single definitive source of data and several different ways of presenting an analysis. Who is going to make the choice about which is the ‘correct’ version? Welcome to the world of Big Data.

Martin White