Putting the science into ‘data science’

There has been a lot of discussion around the importance of data scientists in the management of big data but the focus has largely been on analytics skills. When it comes to data collections the concern seems to be about identifying data that has no value and not about the quality of the data itself. In my career in both the chemical and information sciences, and then in large scale market research, the need to be aware of the uncertainty of data quality has always been an important element of my work.

Carl Sagan (1934-1996) was a distinguished astronomer, writer and communicator who tended to be somewhat outspoken in his views and was not universally adored by the scientific or government communities. I recently came across his reflections on data quality from his book “The Demon-Haunted World. Science as a Candle in the Dark” published by Random House in 1994, a book full of insights into science and the scientific method.

“Every time a scientific paper presents a bit of data it’s accompanied by an error bar – a quiet but insistent reminder that no knowledge is complete or perfect. It’s a calibration of how much we trust what we think we know. If the error bars are small the accuracy of our empirical knowledge is high: if the error bars are large then so is the uncertainty in our knowledge. Except in pure mathematics nothing is known for certain (although much is certainly false). ….. The error bar is a pervasive, visible, self-assessment of the reliability of our knowledge”

How much of the data you are managing has error bars? Or are you and your organisation assuming that it is precise and perfect?

Martin White