Written by Jonathan Bowden
The hunt for truth: Why is exploratory data analysis important?
2 minutes
In this month's Machine Learning blog, Jonathan Bowden, hx Senior Model Developer, demonstrates how exploratory data analysis (EDA) can be used to optimise data processing and build better models.
The best machine learning models are built from clean, high-quality data that has been effectively and skilfully processed. Quite often, this task requires the heaviest lifting and has led to a running joke that most data scientists spend 80% of their time cleaning data and only 20% calibrating models.
I used to think that exploratory data analysis (EDA) was just a fancy term for making some summary statistics and a graph or two. The reality is that while this is the core of the activities we should perform, there can be far more to it, and the overlap between EDA and data pre-processing could be clearer. Before we embark on our EDA, we should know the problem we're trying to solve. Understanding the data types is often the first step and identifying which fields will be numerical and which are categorical is the crucial next step.
We will likely convert our categorical data into numbers later, but we should be careful to keep all the text data we have because natural language processing (NLP) technologies might be able to give us valuable insights. A great example of this is from the famous Titanic dataset: we might think that a passenger's name would have no bearing at all on their survival probability. But in this example, many of the children on board the ship would have the prefix ‘Master’, meaning that despite not always having data on a passenger’s age, it could be inferred from their name whether they were a child and therefore more likely to survive the catastrophe.
Summary Statistics
With the numerical fields, we can create our summary statistics. Here, we’re looking for null values, means, medians, standard deviations, skews, correlations and other valuable metrics. In Python, Pandas’ describe() method can be great, but I recently came across the pandas_profiling package, which offers a slightly richer, more interactive report.
Once you have completed this step, it’s a good idea to write down your observations. It helps to briefly describe each column and the range it varies over, as you will likely refer to it later. In addition, discussing the dataset with colleagues, especially those far removed from the work, can yield some valuable insights.
Visualisation
With the statistics under our belt, we can freely create graphical representations of our data. The goal is to look for trends and anomalies and reinforce our understanding of the data. As a seasoned coder, you’ll probably have your favourite charting packages in your language of choice. For me, in Python, this is usually a mix of seaborn and matplotlib, with guest appearances from plotly.
We can save the time spent making the graphs look good since our work here is for EDA purposes only. The goal is to make as many graphs relevant to our data as possible. This will give us the best chances to identify trends and anomalies, and a quick seaborn pairplot will do much of this in one go. If you’d rather not spend time writing code, software like Tableau or Power BI are excellent for quickly manipulating data views and allow the same functionality as Python packages. I recommend looking at the Python Graph Gallery for inspiration about what might help your investigation. There is a wealth of information there covering all the standard Python approaches.
Dinosaurs in the data
The best way to demonstrate a graph's importance is by quickly looking at some famous datasets. Anscombe's quartet has long been used as an example of why it is so important to plot your data. Here, I present a spin on Anscombe's quartet: the means, variances, standard deviations, and correlations of the datasets below are all similar, so our EDA must go deeper in the hunt for truth.
The scatter plots show a different story:
This dataset is called the Datasaurus dozen and highlights the possible extreme differences in data hiding behind summary statistics.
Additional Exploration
By now, we should have a fairly solid understanding of the patterns in our dataset, and we may already be forming ideas for filling in any missing values and which outliers we want to drop.
Before we move on to the data preparation, it is worth thinking about the following more advanced methods as a means of exploration:
Clustering – There are instances where clustering can significantly help inform the understanding of our data. Clustering algorithms can allocate data into homogenous groups taking in as many features as we like, which we can then reconsider with summary statistics and graphs. With a simple two-dimensional (X, Y) dataset, humans can easily see groupings of points on a scatter plot, but as soon as the number of features goes beyond 3 or 4, we will find it challenging to make a visual representation that will help us.
Principal Component Analysis (PCA) – This method is often used in data processing to “flatten” multi-dimensional data. We can use the same methodology for EDA purposes to gain an “explained variance” output of each feature within the data. If we find that just two or three features explain significant proportions of the variance, we may want to examine these features together in more detail.
It is worth noting that the above approaches would typically require the data to be normalised first. So, there may be an overlap between data processing and our EDA.
Conclusion
I could include several other methods in this guide, but by the time you've applied those listed above, you may have found the majority of the underlying dataset features, if not all.
For financial data sets, we often stop at the basic summary statistics and a chart. Depending on the data, this may be all we need, but a detour into Clustering, PCA or more advanced methods may yield some valuable insight that no one has yet come across.
The flexibility to present and process insurance data in a manner that is easy to work with is one of the key features of our next-generation pricing tool, hx Renew, and I'm pleased to share that improved graphing capabilities are coming to Renew in 2023, so stay tuned!
If you'd like to learn more about how hx Renew is helping insurers better process and manage their data, request a demo here.