What are the questions in data science projects?

What are the questions in data science projects?

Everyone who is analysing data probably knows the situation of jumping straight into the data of the new project highly motivated. They have put a lot of energy into the analysis, but in the meeting with the customer, they realise that the expectations about the information to be delivered are slightly or totally different. In order to avoid this, just as much energy should be spent before the actual data analysis to ask the right questions and to refine them further along a process. Jeffery T. Leek, Roger D. Peng have already dealt with this topic in their paper, which acts as the base for this article. This paper is based on questions to help sharpen the analysis of the data and to evaluate the available data in different phases.

The different types of questions

For each situation, there is an appropriate question, which is the foundation for answering and ensuring the correctness of the interpretation of the results. The following six questions, which should be used as a basis for the survey, are mentioned and discussed.

  1. Descriptive
  2. Explorative
  3. Inferential
  4. Predicitive
  5. Causal
  6. Mechanistic/Prescriptive

In order to help you understand the question types, we will look at the questions in the context of the following example:

We have a dataset of the population in Munich, which includes the fitness level and the distance to the nearest recreational area with comprehensive sports facilities.

A descriptive question aims to provide a summary of the characteristics of the data. In our example it is, among others, the ratio of men and women, the average distance to the recreational area and the range of fitness levels. The question focuses on the attributes of the data - no interpretation.

An exploratory question analyses the question in terms of relationships between attributes, trends or other patterns. This type of question is designed to provide evidence-based hypotheses based on the results. In our example, a correlation between distance and fitness level might be found. Consequently, you hypothesize that the closer a person lives to the recreational area, the higher his or her fitness level is.

The inferential question takes the hypothesis of the exploratory analysis and applies it as a question to another representative data set. If the previously determined relation follows as an answer to the question, it may be assumed that this relation is applicable to the whole. If we investigate a second data set from Hamburg and find the same relation indicating that a shorter distance to a recreational area has a positive effect on the fitness level of the population and we are convinced that this sample is representative for Germany, we can conclude that this is applicable to all people in Germany.

A frequently misunderstood case is that although the inferential question shows a reciprocal relationship between distance and fitness level, it does not say that a small distance automatically increases fitness levels. This would be a causal question, which is rarely asked directly due to the design of the data analysis. In our example, it may be that our population that has a short distance to a recreational area is younger on average, which means that their average fitness level consequently is higher.

A predictive question has the goal to get a forecast for future events. In our example, the question would be to what extent the fitness level of the population would change if existing recreation areas were enlarged or new ones were built.

So far, we can see that none of the questions asked can give an exact answer on whether a shorter distance to a recreation area has a positive influence on the fitness level and how this leads to an effect. A question that aims to answer the question whether a shorter distance leads to a higher fitness level is called a mechanistic or more common a prescriptive question.

Es wurde kein Alt-Text für dieses Bild angegeben.

Finally, it is important to mention that the questions you are asking are determined by the available data and that some questions are based on the answers to other questions, so you will always have to ask several questions when analysing data. In the best case, some questions such as a causal question can be answered by examining the effects of a population move. If this information is not available, you should use an inferential question to answer the question as best as possible.

What is your opinion on this framework? Would you keep it in mind for your next project?

To view or add a comment, sign in

Others also viewed

Explore content categories