“Data science is the sexiest job of the 21st century” is often quoted and has recently made me think about the increasing awareness of both data scientists and mathematicians/statisticians but more broadly of the impact of their work on everyday life.  However, behind every great data scientist there is a data set.  Critical to the success of a project, data rarely makes it into the limelight and when it does it most commonly for all the wrong reasons.

If data were to be given its time in front of a camera, what would it want to talk about?  What aspects of its role would it want to highlight and make sure the analysts are aware of?  Putting it another way, what do we, the data scientists, need to check for?  There are no simple answers to these questions, but I suggest the following points to help guide data checking before any exploration begins:

Is the data reliable and trustworthy?  Have you looked at the data set and can you answer questions on its accuracy or the precision with which it was recorded?  Have you looked at the range of each variable and if there are any missing data points?

When was the data collected and why?  If this data set is new to you, it’s important that you understand why it was originally collected as this could have an impact on its overall utility and applicability for the analyses you want to work on.  For example, are there key variables which haven’t been captured?

What are the bias and limitations of the data set?  Identifying bias within a data set is not always easy, but it is imperative so you understand how the data set can be used and the extent to which the results can be generalised.

What was the population and sample size?  It’s important to remember that not all data sets are big, and small really can be beautiful.  What is important is that the analysis approach taken is appropriate for the sample size you’re working with.

If you put your data in the spotlight and can answer all these questions then you’ll have created a robust foundation for your exploratory analysis, and have done your data proud.