cover-img

Video Game Sales Analysis

6 September, 2020

3

3

0

Contributors


  1. Introduction

Video games have become an irreplaceable part of our global culture. According to Microsoft, there are more than 2.5 billion gamers around the world, and they're all a part of a rapidly growing global community. Video game market is a large market that is expected to be worth over 90 billion U.S. dollars by 2020. (BestTheNews, 2016). This relatively new industry has expected to reach 33.6 billion revenue by the end of 2020(Wepc, 2020). 

Despite its huge business value, the game industry is still developing and not yet saturated, having spaces for further developments and new incomers. The game market is currently a monopolistic competition market, with mainstream companies like Nintendo, Sony and Microsoft leading the industry, small studios and individual developers also making contributions to the video game century. 

Game development could be a long process requiring lots of effort and possibly money, whether a game achieves high ratings score and generates profits or not would directly influence the reputation of the game developer and thus further influence the developer’s later path. 

The purpose of this analysis is to explore the video game market, find its trend, build a predictive model to forecast the future critic scores, and give some recommendations. We use the data of video game sales up to 2016 in different markets grouped by game names. Each game has details, sales and ratings scores.


  1. Problem Description and Research Questions

Problem Description:

In this report, we intend to analyze the existing video games dataset to get a glance at the video game market structure and to predict its trends. A video game could be determined by many features, from the game genre to the gaming platform, each has its different target groups. 

We assume the different features of a video game would have influences on the game’s outcome after release. How can we have some insights about a video game and its future performance in the large market could be a problem. Based on the motivation and assumption, this report intends to build predictive models from the different features of video games to predict the possible ratings and sales for a certain game as a basic guide for massive video game developers to refer to.

Research Questions: 

  • Forecast the score of the game based on the characteristics of those games.

  • Forecast the overtrend of sales.

  • How can video games be divided by their characteristics?

  • What’s the trend of the video game market? 


  1. Data

Raw Data Description

There are 9,100 records in our raw dataset. But only 75% of them are valid. So, in total, there are 6,825 records that are valid to use. There are 16 columns. 3 of them are with too many categories, so we removed them as our model predictors. And in the rest 13 columns, 3 are categorical and 10 are continuous variables. (See Appendix A.1)

Our data can be divided into three parts. First is video game external features, such as platform, developer and genre. Second is about video game sales. There are sales with different regions, such as Japan, US and EU. The third one is video game ratings. At this part, there are critic and user scores of the video game and number of people that participate in the grading. Unfortunately, we do not have internal video game information, such as sound, graphics, background and winning and losing features. 

Data Understanding

The report intended to predict both sales and critic scores. However, while viewing the whole dataset, we found that the sales field is not appropriate to set for prediction directly. 

It is not appropriate to predict sales without considering unit price and sales quantity. The sales field in the dataset is the monetary total of the games, while price might have greater influence on the sales amount than sales quantity, this field could not directly reflect the sales volume. 

In this case, we ran a linear regression model on the sales field first and determined that the critic score had a strong positive correlation with the sales field. Given the considerations above, we decided to build models only to predict the critic score, as the critic score can also be used to indicate the overall trend of sales amount.

Data Cleaning

To clean the data, the report first ran a descriptive analysis on the raw dataset to gain a general view. The dataset appeared with few outliers in sales which the report considered as acceptable values. The raw dataset had missing values in several fields like game name and developer, several records with sales values appeared as TBD were also defined as missing values in this dataset. Given the adequate sample size of the dataset, the report filtered out those records with missing values. Therefore, 75% of data will be used in the future analysis.

Data Transform

The report transformed certain continuous values into categorical values for supervised analysis. The target variable of this report is critical score, which is a continuous variable in a scale of 0 to 100 representing the aggregate score compiled by Metacritic staff. The report transformed critical scores into three categories according to Steam review ratings (Score_cat), setting scores below 40 as negative, scores between 40 to 70 as mixed and scores above 70 as positive. 

In order to practice logistic regression, we transferred the critical score into binary variables (score_bin). 1 indicates the score is greater than mean value, and 0 indicates the score is less than mean value.


  1. Analysis

  • Unsupervised Learning

  • Clustering (See Appendix B.1)

We build the cluster model in order to know what kind of video games can be clustered as a group. In this model, we have three clusters with the cluster quality of 0.5. In this model, user_score, critic_count, and critic_score are three most important predictors. 

By looking at each cluster separately, cluster 1, the smallest cluster with 23.8% of data, includes the video games that have the most comments, and the highest mean of critic score of 81.06. Cluster 2, the second largest cluster with 24.4% of data, includes the video games that have the least comments, and the lowest critic score of 53.84 with 51.8% of data. Cluster 3 is the largest cluster, the video games in this cluster have the middle comments and critical score.

  • Supervised Learning for Categorical Target

To practice supervised learning for categorical variables, we tried two models. One is logistic regression, and another one is classification tree.

  • Logistic Regression (See Appendix B.4)

Calculated field: Score_bin

To build logistic regression, we transformed critic_score into binary variable, and set platform, genre, ratings etc. as predictors. We used the “enter” method and set the base category as 0. The model we get has McFadden R-Square of 0.327 and final model fitting criteria of 6384.5, which indicates a good-fitting model. In this model, user_score is the most important predictor. 

If we use this model to predict critical score, we can get 78.19% of prediction correctly in the training data, and 78.53% correctly in the testing data.

  • Classification Tree (See Appendix B.5)

Calculated field: Score_cat

To build a classification tree, we transformed the critic score into a nominal variable that has three categories. The first split is user_score, for video games that have user_score less than 7.15, the predicted classification of this game is mixed. For the user_score greater than 7.15, the classification is positive. The second split is user_count, and then followed by genre and platform. 

If we use this model to predict critic score classification, we can get 72.53% of prediction correctly in the training data, and 70.39% correctly in the testing data.

  • Supervised Learning for Continuous Target

Critical Score should be the target variable when we practice supervised learning.

  • Regression Tree (See Appendix B.2)

To practice supervised learning for continuous variables, we tried the regression tree and linear regression model. In the regression tree model, the first split and one of the second split is user score, and then followed by user_count, global sales, genre etc. If we use this model to predict critical scores, we can get a MAE of 7.862 in the training data, and 7.986 in the testing data, which is not a great difference.

  • Linear Regression (See Appendix B.3)

We also ran a linear regression model with the target variable. In the linear regression, we removed district sales since we already have global sales. The top 3 most important predictors are user score, global sales and platform.  The model has a 54.1% Adjusted R Square which indicates an overall good fitting.

Prediction Equation as below:

Critic Score = 28.024 + 5.374(user_score) + 4.577(global_sales) +10.63(platform=DC) - 3.763(platform=Wii, DS) - 2.553(platform=X360, PS2, 32, PSP, PSV) +5.390(platform=PC,XOne) -5.296(genre=Adventure) -4.26(genre=Action, Misc) -3.361(genre=fighting, platform, puzzle, racing, shooter, simulation) +0.118(critic_count) +0.011(user_count)

Then, we did three assumptions of our linear regression model to check if it follows the assumption correctly. First is checking residual normality. The distribution of residuals is normalized based on the graph. Then, we checked the residual assumption of constant variance. The plot shows that most residuals are around 0. We checked the residual assumption of independent last. If checking the first 5 lags, they are weakly correlated, which means there are no trends of residuals and they are independent.



  1. Conclusions

  • Given the different models shown above, for continuous target (critical_score), the linear model performs better prediction, which has an MAE of 7.321, and an R-square of 54.1%; For categorical target (Score_cat), the classification tree predicted 72.53% correctly. The logistic regression model has 0.327 McFadden R-Square, which partially indicates a good-fit model. Logistic regression model performs better for the categorical variable; Cluster model has 3 clusters and a quality of 0.4. 

  • From our models, aside from user score which is the ratings provided by normal users,  the most important predictors for critic score are global sales, platform and genre. Although user scores are not always consistent with critic score, the two ratings overall remain highly correlated when it comes to video games. For global sales, as we defined in data understanding, there’s correlation between sales and ratings.

  1. Recommendation

  • We recommend game developers choose the most mainstream platforms for their games. We found the platform Dreamcast has the most positive impact on the critic score, probably due to the high quality of the games in the platform Dreamcast. Though Dreamcast didn’t live long as a platform, it was one of the most popular consoles in 1999 and contributed to 31% of the game market in North America. Other popular mainstream platforms also appeared as important predictors in the model equation.

  • We recommend game developers broaden their target groups to include different game genres. According to our analysis, we figured out certain genres have different target groups, while some genres had the same level of importance on our target variables. According to our models, genres like fighting, race, shooting and platform tend to have the same influence level. The same influence level might reflect the potential overlapped market segmentation and targeting groups. By broadening the target market to include players from all genres, the game developer would be able to promote the games.





Reference

Video Gaming Industry Overview: https://www.wepc.com/news/video-game-statistics/#video-gaming-industry-overview

Business Insider, “The $120 billion gaming industry is going through more change than it ever has before, and everyone is trying to cash in”:

https://www.businessinsider.com/video-game-industry-120-billion-future-innovation-2019-9

Video Game Ratings Guide:

https://www.esrb.org/ratings-guide/

McFadden R Square for Logistic Regression:

https://statisticalhorizons.com/r2logistic















Appendix

A Data Description

A.1 Data Dictionary


Column Name

Data Type

Description

Name

String

Name of the game

Platform

String

Console on which the game is running

Genre

String

Game's category

Publisher

String

Publisher

NA_Sales

Decimal

Game sales in North America (in millions of units)

EU_Sales

Decimal

Game sales in the European Union (in millions of units)

JP_Sales

Decimal

Game sales in Japan (in millions of units)

Other_Sales

Decimal

Game sales in the rest of the world, i.e. Africa, Asia excluding Japan, Australia, Europe excluding the E.U. and South America (in millions of units)

Global_Sales

Decimal

Total sales in the world (in millions of units)

Critic_Score

Integer

Aggregate score compiled by Metacritic staff

Critic_Count

Integer

The number of critics used in coming up with the Critic_score

User_Score

Decimal

Score by Metacritic's subscribers

User_Count

Integer

Number of users who gave the user_score

Developer

String

Party responsible for creating the game

Rating

String

The ESRB ratings (E.g. Everyone, Teen, Adults Only..etc)


A.2 Raw Data Summary

A.3 Raw Data Audit Output

A.4 Cleaning Data Audit Output

A.5 Calculated Field

Critical Scores transformed by mean value

Critical Scores transformed by rating levels



B Model Summary and Output

B.1 Clustering Model

B.2 Regression Tree

B.3 Linear Regression


B.4 Logistic Regression 

B.5 Classification Tree


A group project created for INFO 4300 Course in University of Denver.

classification

video games

predictive analysis

3

3

0

classification

video games

predictive analysis

Rui

Denver, CO, USA

MSBA Candidate | Data Analyst

More Articles

Showwcase is a professional tech network with over 0 users from over 150 countries. We assist tech professionals in showcasing their unique skills through dedicated profiles and connect them with top global companies for career opportunities.

© Copyright 2025. Showcase Creators Inc. All rights reserved.