Know thy data: Data size can matter, but domain knowledge matters more

Know thy data: Data size can matter, but domain knowledge matters more

If you are a game or product developer, you are probably familiar with the term “analytics”. In game analytics (GA), various player metrics are recorded, and then described, aggregated, or manipulated, to help better understand players’ current behaviors, and predict their future actions. Quantitative data can be amassed through a variety of ways, including more passive channels (e.g. telemetry), as well as more active ones (e.g. in-lab playtests, and surveys). This makes GA useful to a number of game development aspects, game design and business intelligence.  

Seeing how useful data can be, you might be tempted to gather all the data you can and do anything imaginable to it. After all, it seems like data-driven decisions are all the rage these days. However, data’s size and usefulness are limited by the creativity and expertise of the person handling them.

Understandably, this notion can be hard to grasp. After all, if I have more data, can I not learn more things? If I throw all the predictors into the model, will I not have the best chance of creating the perfect model? And, if I make a graph, and it shows differences—such as the differences in churn between levels—does that not mean that some levels are harder to get through than others? While many may instinctively be tempted to answer these questions in the affirmative, the real answer is far more complicated, and depends on different variables.

The reason we have this misconception that more data are better, and that common sense—or maybe instructional online videos—are enough to process it effectively, is that it these beliefs are hard to falsify. Analytics is a field in which you need to understand the importance of your choices before performing your analyses, because meaningless results do not always appear as such—certainly not in the short term. This is in contrast to fields such as programming.

If I want to program a game beyond my skill set, where shooting a spaceship results in my hero gaining 15 experience points (XP), my shortcoming will be painfully obvious early on. That is, even if I can program my hero and spaceships moving around, and even if I can program a way to shoot at them, if nothing happens when the fire hits the spaceship, there will be no doubt that I do not know what I am doing.

Other domains, such as analytics, are not so obvious. For example, take the concept of a mean, which is also known as an average. The mean of the numbers 1, 1, 3, 8, 6, 9, and 4, is 4.57 (and change). These numbers clearly make sense to average if they are difficulty scores reported by playtesters, or how many times paytesters’ died during the first level.

However, what if these numbers represent something else, such as the time at which players were tested. This would mean that on average players were tested at 4.57 (say, pm). But what does this mean, or how is it useful in any way? Averaging time, in this case, is pointless. Even worse would be attempting to average data cannot be averaged.

Categorical data are data that are in discrete groups, such as educational level, shirt size, or game genre. These data cannot be averaged, even if you represent them as numbers. For example, imagine that you want to understand whether your players’ game genre preferences predict how much they like your first-person shooter (FPS) game. Imagine that in a survey the first player reports liking puzzle games (coded as 1), simulations (coded as 2), and sports games (coded as 3). You cannot average these, and then try and determine how they predict how much someone will like your FPS game.

To analyze these data, you would need to conduct an analysis of variance (ANOVA) model, to see whether different discreet categories predict a different continuous outcome. Conducting the test, the wrong way would result in data that would likely drive you down the wrong path. The difference between categorical and continuous data is relatively simple, and one taught early on in undergraduate statistics courses.

Other topics, seemingly only slightly more complicated, can pose much larger complications. One such example is a regression. A simple regression builds a predictive model to determine the linear relationship between two variables, such as time it takes to complete onboarding, and how likely a person is to churn (quit the game). This seems fairly simple and straightforward.

If you have a program that allows you to point and click to run a regression, you could easily conduct one. Even if you were using software such as R, you could conduct the regression model with a single line of code and a “lm” command, and you could likely learn to do it fairly easily by searching the internet. However, if your outcome is categorical—in this case whether a player churned or did not churn—you cannot run an ordinary regression; you need to run a logistic regression.

Another example is when you attempt to input as predictor players’ scores but have more than a single score per player. This model likely should be conducted as a hierarchical linear model, and not an ordinary regression. Conducting it as an ordinary regression will likely violate at least one of the regression’s assumptions, rendering the resulting p-value untrustworthy. Many people know that a low p-value is good, and that p = .05 is often the gold standard. However, it is often unclear that if the assumptions are violated, the resulting p-value should not be trusted.

Imagine an example in which a developer wants to understand the importance of social play on performance. In this game players can choose to play with unknown others, alone, or known others. The developer decides to collect information on how many known others each player plays with on the first day they play the game, 30 days later, 60 days later, and a year later, as well as performance for each raid. Imagine the developer enters this information into an ordinary regression and receives a p-value of .01.

A p-value is the probability that we would obtain the results that we did if the null hypothesis (often the hypothesis opposite to the one we constructed) is true. That is, if the hypothesis is that playing with more friends increases performance, a p of .01 means that if number of friendly raiders does not predict performance, that we would still get a result saying that friends predicts performance one times out of 100. Another way to state that is to say that if we ran this study 100 times, and number of friends really does not predict performance, only one of those times would we get a result saying that friends does predict performance. Therefore, it seems pretty unlikely that the null hypothesis is true.

This means that while it is possible that the relationship that we found between friends and performance is not a real one, it is pretty unlikely that we would have found it if that were the case. However, note that since there is still a 1 in 100 chance that we would get a relationship by chance, it is possible that this is that single instance, and that there is no real relationship between the two. Statistics is about probability—not absolute certainty. Therefore, although it is pretty unlikely that the developer’s hypothesis is wrong, the only thing we can say with certainty is that it is unlikely that we would have gotten this result if there really was no relationship between the two; this would happen only 1% of the time.

However, if you violate the regression assumptions, you cannot trust the p-value. There are different ways to correct the violations—depending on which of the assumptions is violated—and it is important to do so. My experience has been that sometimes an insignificant p-value becomes significant when the violations are corrected, and other times significant p-values become insignificant. However, we do not use analytics to prove our hypotheses; we use analytics to learn, so that we can create the best product possible.

Similar issues will occur in data science as well. For example, if you create an algorithm to predict which add increases conversion, but you choose the wrong metric, you will get a skewed view of what is happening. When your classes are imbalanced—that is, one group has far more people in it than the other—some metrics may give you a problematic understanding of the situation.

Take the accuracy metric, for example. If only 5% of your players convert, your model may give you a 90+% accuracy rating even if it predicts the converting players almost 0% of the time, because it will accurately predict the non-converting ones almost 100% of the time; this, as you might imagine, is not very helpful. To overcome this you would want to either oversample the minority class, undersample the majority one, or choose a better metric or more appropriate algorithm (such as penalized-SVM). Knowing how and when to do each of these things requires domain knowledge and skill.

Investing in an analyst—either as a permanent team member, or as a consultant—is important. Sometimes, smaller developers or studios cannot afford a full-time analyst. There are always other options, such as contractors that have experience with various forms of analytics (like me!) If you have spent so much blood, sweat, and money into your product, quantitative analysis can help ensure that you are spending down the right paths.

With data, bigger is not always better, and what you do with your data is more important than how much of it you have amassed. Analytics can be fun and exciting (and, if you ask me, a great topic for jokes) and is an important tool in development. However, it is important to know what questions to ask about your data, and it is even more important to know whether you are testing those questions correctly. Never drive recklessly with data.

If you liked my article, or found it useful, please “like it” or share it, so that others may find it as well.

1st principles are so important. Should you trust the data? Have you done quality control checks? Sanity checks? Are you clear on the source and potential selection bias inherent in that source or mix of sources? I thank my grad school advisor and one of my first big data jobs for grounding me. And I also thank my PM and data analyst friends who humor me when I want to dive deeper on source and distribution of the data.

To view or add a comment, sign in

More articles by Dr. Aria Fredman

Explore content categories