Question: have so many Britons emigrated to the US that the average cheddar consumption there is determined by the immigration rate? Or was it the other way around, that so much cheddar was produced in the USA that the British felt compelled to emigrate there?
Question: is it due to the disregard of the well-intentioned advice "don’t swim with a full stomach" that most people drown at times of the highest ice cream sales?
And last but not least: is it raining because the road is wet?
All three questions are thematically and fundamentally different, but they make one thing clear: if data is analysed and connections between different statements are discovered, one should, above all, ask the question: Why?
Correlation vs. Causality
Consider two statements, A and B. The correlation describes the strength and direction of the relationship between A and B. The correlation between A and B is the same. The correlation coefficient is a quantity between -1 and 1 that can be calculated from data, where 1 describes a relationship, 0 describes no relationship and -1 an opposite relationship. There are several approaches for the calculation - the best known is the correlation coefficient according to Pearson. The correlation says nothing about the relationship of the two statements to each other, it gives no answer to the why.
The causality describes the relationship of A and B to each other. Does event A force event B? Or the other way around?
Connections in practice
Back to the examples: all cases described above show correlations, but what about causality?
The example of wet roads is easy to analyse. The assertion "it rains because the road is wet" can be refuted as follows: there are times when the road is wet but it is not raining - for example, directly after a rain shower or a burst water pipe. On the other hand, the statement is "the road is wet because it is raining" shows an obvious causal connection.
In the example of emigration to the USA and cheddar consumption, it is difficult to identify a (serious) causality despite high correlation coefficients, so one can rather assume a coincidence here.
Of particular interest is the example of ice cream consumption and the number of drowning people. It is difficult to describe a direct causality, but it does not seem as absurd as in the first example that there is a connection: in times of particularly high temperatures, people eat a lot of ice cream and swim a lot. So, there is a causal connection to a third statement ("it's hot!") that influences the amount of ice cream sold and the number of drowned people. In these cases, one speaks of fictitious correlation and exactly these are the ones that bring you to new questions and thus new insights.
Correlation as a Guide
Data can show correlations, but not causalities. As a rule, however, it is precisely these relationships of factors that are interesting in order to obtain a better overall picture - for example, to identify further requirements. Nevertheless, the data-based correlations found are not worthless. On the contrary, they can be understood as a question and not as an answer.
If it is understood that the sale of ice cream depends on the heat (and not on the number of drowned people), it is possible to deduce when ice cream consumption will increase.
In this way, data for a specific product can be used to reach potential user groups:
- Discovering new applications for existing products
- Recognising meaningful new features and understanding which factors influence a product
- Understanding how machine malfunctions occur and how to derive effective measures
In conclusion, data does not provide perfect answers to questions. But if you take the vast amounts of data available worldwide as starting points for questions, they provide new perspectives, ideas and points of contact for innovative products and business models that can make tomorrow's world a better place.