cover-img

Web Behavior Analysis

6 September, 2020

0

0

0

Contributors

Overview:

Nowadays, an increasing number of people prefer to shop online instead of shopping physically, believing it makes life easier. Ecommerce platforms likeAmazon is one of the largest online shopping platforms where people do shopping in real time. However, we have to know what kind of people would purchase something on what kind of platform such as Amazon, eBay, or Walmart and also we have to know those customers. For example, a person who is studying in college would like to shop on Amazon because most colleges offer Amazon Prime and they can purchase cheaper products there. 


In our project, we aim to analyze the customer web behaviors to determine which variables are related to the amount of money that customers would spend during online shopping. 


Data Content:

comScore Web Behavior Database 

(The comScore Web Behavior Database captures detailed browsing and buying

behavior by 100,000 Internet users across the United States at the domain level. The

panel is based on a random sample from a cross section of more than 2 million global

Internet users who have given comScore explicit permission to confidentially capture

their Web wide activity.)

https://www.dropbox.com/sh/rs1mhg5z7lbxgv4/AABHD_wmykaJ_RkZ6dz6wX2sa?dl=0


There are three databases in the Web Behavior Database: Demographic, Transaction, Sessions.


  • Demographic: basically include individual information based on each machine ID, such as age group, education, region, racial background and others related to demographic.


  • Transaction: including more about product information, like product category, price, quantity  purchasing history.


  • Sessions: including each machine’s what has been browsed, how long it takes and how many pages have been viewed.


Data Cleaning:

We applied left join on these three tables then we matched two keys that are “machine id” and “site session id” since we need to make sure that the browser history has to match to the transaction of each product and individual, so we came out with about 3791 rows. Furthermore, we also directly remove the missing value of any product that does not have a name. Even though we know the category of the product, missing a name and replacing it with “Unknown” does not really help our interpretation.


Data Visualization:

The data visualization we conducted using the “df” dataset, we planned to study (i) how the level of education and how it affects the basket total price as a dependent variable; (ii) what product category has the most number of basket total transactions and also the total basket total price; (iii) we also want to learn the distribution of duration, pages viewed, product total price, and basket total price; (iv) and how duration and pages viewed correlated to each other by month. Duration and Pages viewed. In the study level of education (Figure 1A), we observed that people with an associate degree have the most population in the whole dataset that means people who graduated from community college have the greater amount of transactions. We observed that people with an associate degree have higher demand on online shopping. But also we observed that people who are currently studying in college have also had higher demand that would affect the basket total price (Figure 1A)


When it comes to the product category on the number of transactions, we found out that the product category 13 (food and beverage) has reached the highest amount beyond any other product categories, but no doubt that food and water are necessities to everyone followed by healthy products (Figure 1B). We also visualize the total basket price related to product categories then product 16 (Health) has the highest total basket price and that makes us also visualize the total basket price based on domain name (Figure 1C); we found out that Amway rank the first in the bar plot so they are connectable and healthy product is in high demand for the online purchase(Figure 1D). Overall, if we only consider the number of transactions (Figure 1E), Amazon has the highest number of transactions. 


By observing the distribution of those four variables we previously mentioned, we created 3 histograms and 1 density and they are highly skewness to the right so it makes us create to qq plots for log consideration (Figure 2A)


Since we want to interpret that web behavior would affect basket total price so we also created a scatter plot between two important web behavioral variables that are duration and pages viewed (Figure 1F); we distinguish colors by level of education and size by duration then we could see that most points belong to associate degree and most of them have duration between 10 to 100 and pages viewed between 50 to 150. Based on the visualization, they have a positive linear relationship and we assume they have a strong effect on the basket total as hypotheses.

In addition, we also consider the total pages viewed and duration monthly (Figure 1G&H) then we observed that people usually visit less product or have shorter time online on April but the duration and pages viewed would increase from October to December due to Thanksgiving and Christmas that many people would purchase product with discount on Black Friday or Christman preparation. 


PSM

Before we went through our models, we conducted a PSM analysis. Propensity score matching is a powerful method for reducing the effects of confounding in observational studies. “In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment” In our study, PSM helps to reduce the bias due to confounding variables. Typically, people believe that a customer spends more time on a shopping website and browses more pages could potentially purchase more products. For validating and reducing the bias due to the variables “duration” and “page_viewed”.  We decided to utilize the PSM method for matching those data points into a treatment group and a control group. We would like to understand if “duration” and “page_viewed” could bring affect to our dependent variable “basket_total” under controlling other variables. 


Before modelling, we wanted to find whether there is any difference in basket total between groups with lower duration and higher duration, and groups with lower pages viewed and higher pages viewed. 


For duration, we first split the data into two groups, group 0 (duration < average) and group 1 (duration > average). We applied a T-test to see their difference in means of basket total. The average basket total of group 0 is 163.3, while the average basket total of group 1 is 213.6. The p-value is 4.53e-07, which is significant. The two groups are significantly different in average basket total before PSM. To prevent the impact of other variables, we then performed PSM to match the variables of the two groups and run a T-test again. The average basket total of group 0 is 163.3, and the average basket total of group 1 is 225.5. The p-value is 3.52e-09, which is also significant and more precise than the previous one. Besides, their difference seems to be larger after controlling other variables.


For pages viewed, we first split the data into two groups, group 0 (pages viewed < average) and group 1 (pages viewed > average). We applied a T-test to see their difference in means of basket total. The average basket total of group 0 is 161.1, while the average basket total of group 1 is 157.2. The p-value is 2.39e-14, which is significant. The two groups are significantly different in average basket total before PSM. To prevent the impact of other variables, we then performed PSM to match the variables of the two groups and run a T-test again. The average basket total of group 0 is 151.2, and the average basket total of group 1 is 257.2. The p-value is 2.2e-12, which is also significant and more precise than the previous one. Besides, their difference seems to be larger after controlling other variables.


In conclusion, PSM proved the significant and obvious difference in basket total between the groups. Group with higher duration has a $62 higher average basket total than the group with lower duration, and the group with higher pages viewed has a $106 higher average basket total than the group with lower pages viewed. This is consistent with our assumption. To find whether these two variables have impacts and how big are their impacts, we would apply the regression analysis next.



Modeling:

Before conducting modeling, we need to figure out taking logs on those three variables (duration, pages viewed, and product total price). Since basket total is our dependent variable so we did nothing to it on taking log. Firstly, we need to visualize QQ plots before taking log of them and we need to visualize those variables after taking log. After the comparison, we decided to take log for these variables since they are unmatched to the blue line before taking log and they became mostly matched after taking log (Figure 2A)


Also we have to consider that if there is a correlation between duration and pages viewed, if so, we need to put these two variables separately while we build our models. After we visualized the correlation between duration and pages viewed, we found that these two resulted in high correlation of 0.75 (Figure 2B)



Regression Analysis

In the construction of modeling, the “basket total” was used to label a dependent variable and we applied four regression models (1) Linear Regression, (2) Logistis Regression, (3) Random Effect Model, (4) Fixed Effect Model.


Linear Regression:

Due to the high correlation between duration and pages viewed as we previously mentioned, we have to do that separately even for the rest modeling. For duration, we observed that the significance is really high and it has a positive coefficient to basket total price; kindly 1% of duration logged increase then there would be a $14 increase on basket total price. For pages viewed, we found out that it is significant and positive to the dependent variable, 1% of pages viewed log increase and $25.7 increase on basket price. Sortly, if you have longer time on a shopping website or you visited many pages about the product; it would raise the probability of purchasing a product with a higher price, especially a healthy product since people would usually consider it many times before purchasing a health care product.


Logistic Regression:

We also conducted logistic regression that means we have dummy the “basket total price” into 0 and 1. If the price is below the average of basket total price it would become 0; if the price is above the average of basket total price it would become 1. The reason we created logistic regression is we want to find out how duration and pages viewed affect people's purchase on average pricing issues. From the output, we found out that the p-value of duration is 0.005 that remains significant with two stars and “pages viewed”’s p-value is 0.0001 that remain three stars in significance. In that case, although both of them are significant and positive to the dummy “basket total price” , pages viewed have a stronger effect on dependent variables than durations. 

Statistically, 1% increase on duration would have dummy “basket total” increases 0.02% for purchasing above the average (1) compared to 0. When it comes to pages viewed, the “basket total” increases 0.037% by viewing more products on shopping platforms. 

 

Random & Fixed Effect Model:

Based on those linear and logistic models, we also want to insight about web behaviors affecting purchase. In order to conduct mixed effect models, we firstly have to create a panel data with individual and time but we only put “machine_id” as individual variables and we ignore “time” due to duplicate issues. Then we continually put “duration” and “pages viewed” separately and we observed that these two variables all affect positively to dependent variables with high significance for both models. But we also tested these two variables with the “phtest” function and we ended up with the p-value of 0.2414 for “duration” and 0.9998 for “pages viewed”. Since the p-values are too high so we rejected the null hypothesis and fixed effect model became a better solution for customer analysis. 


Conclusion:

Using the Propensity Score Matching  (PSM) method, the results indicate that customers who spend more time and visit more web pages during online shopping are likely to spend more money in their transaction. On average, if customers spend time that is  above average time, they will pay $62 more in that transaction compared to customers who spend time that is less than the average. Similarly, the difference between two groups that visit web pages above or below the average number of web pages visited is $106. 


In order to study how web behaviors affect the total amount of money spent in the transaction deeply, we built linear regression, logistic regression, and random & fixed effect models. The linear regression shows that for every 1% increase in logarithm of duration and pages viewed, the transaction total will increase by $14 and $25.7 respectively. The logistic regression indicates that for every 1% increase in logarithm of duration and pages viewed, the odds of spending money above the average will increase by 0.02% and 0.037% respectively. Finally, even though the random effect model is not statistically significant, the fixed effect model still provides similar insights that duration and pages viewed both positively affect the transaction total. 


For those online shopping websites, if they can provide enough attractive information on their web pages to keep customers spending more time viewing, they will have a higher chance to convince customers to buy more products. Therefore, better graphic design, attractive logo, comfortable background and font are essential for online shopping websites, since those factors might affect the duration of the customers staying on their websites. In addition, reasonable recommendation links can also increase their potential sales by encouraging customers to click on more web pages. 

Appendix - Figures

Figure 1A: Pie Chart of Education Level

Figure 1B: The Total Number of Transactions based on Product Category

Figure 1C: The Total Transaction Amount based on Product Category


Figure 1D: The Total Number of Transactions based on Shopping Platform

Figure 1E: The Total Transaction Amount based on Shopping Platform

Figure 1F: The Linear Relationship Between Duration and Page Viewed

Figure 1G: Month Pages Viewed on Line Plot

Figure 1H: Monthly Duration on Line Plot

  (a)

(b)


                                                                           (c)

Figure 2A: QQ plot for (a)duration , (b)pages viewed,  (c)total price before and after logarithm

Figure 2F: Correlation Between Duration and Pages Viewed


Animate Scatter Plot: Change through Month

PSM

“Duration” t-test before PSM

“Duration” t-test after PSM

“Duration” difference in mean before treatment covariates

“Duration” difference in mean after matching






“Page_viewed” t-test before PSM

“Page_viewed” t-test after PSM

“Page_viewed” difference in mean before treatment covariates

“Page_viewed” difference in mean after matching







Model:

Linear Regression:



Logistic Regression:






Random Effect Model:


Fixed Effect Model:

Attached Code:

https://github.com/kli34/WebBehaviorAnalysis/blob/master/EDA.Rmd

https://github.com/kli34/WebBehaviorAnalysis/blob/master/EDA.Rmd

https://github.com/kli34/WebBehaviorAnalysis/blob/master/PSM.Rmd

machine learning

econometric

behaviorial analysis

0

0

0

machine learning

econometric

behaviorial analysis

Kuang Li

Irvine, CA, USA

MSIM student UIUC MSBA UC Irvine

More Articles

Showwcase is a professional tech network with over 0 users from over 150 countries. We assist tech professionals in showcasing their unique skills through dedicated profiles and connect them with top global companies for career opportunities.

© Copyright 2024. Showcase Creators Inc. All rights reserved.