Regression (for the non data scientist)

One of the most used techniques in statistics and machine learning is regression. Regression models can be straight lines or curvy lines. It can be used with one variable or it can be used with many variables. The point of regression is to derive an equation (to build a model) to help understand, to forecast and most of all to quantify relationships.
If a picture is worth a thousand words, then an animation is worth a million. If you don't like reading, then you can watch my Youtube Videos on regression (see links below). They are some of my most popular videos with well over 500,000 views.
If a picture is worth a thousand words, then an animation is worth a million. If you don't like reading, then you can watch my Youtube Videos on regression (see links below). They are some of my most popular videos with well over 500,000 views.
Setting Up The Problem
Recently there has been some interest in how video game playing has impacted crime. Juvenile arrests for crime have dropped to a 30-year low, and fewer teens are being locked up than at any time in nearly 20 years. Before we give credit to the police and politicians, it is worth noting that crime rates in the USA, United Kingdom, Australia, New Zealand and several other developed countries are all following the same downward trend. Video game sales and usage in these same countries are on upward trend. It turns out that youth commit a lot of crime and especially property crime. Youth also play a lot of video games. Could it be that these same youth are playing video games instead of committing crimes?
Building The Model
A model could be built with crime data from all the developed countries and video game usage per country. That would be an exhaustive study. In this short article, I am just going see if there is a correlation for Auto Theft and the video game Grand Theft Auto in just the US.
Auto Theft
In the US auto theft has decreased over 50 percent since 1997 and the sales of the video game Grand Theft Auto (GTA) have tripled since 1997. It appears there is an inverse relationship (okay, I am sorta kidding here).
In the following graph, years are plotted along the x axis. Auto Theft Rate is in red and Grand Theft Auto (GTA) sales in millions of dollars are in black. The goal with regression is to take these two graphs and turn them into a model to understand how GTA and Auto Theft Rate are related. The model quantifies the relationship.
Recently there has been some interest in how video game playing has impacted crime. Juvenile arrests for crime have dropped to a 30-year low, and fewer teens are being locked up than at any time in nearly 20 years. Before we give credit to the police and politicians, it is worth noting that crime rates in the USA, United Kingdom, Australia, New Zealand and several other developed countries are all following the same downward trend. Video game sales and usage in these same countries are on upward trend. It turns out that youth commit a lot of crime and especially property crime. Youth also play a lot of video games. Could it be that these same youth are playing video games instead of committing crimes?
Building The Model
A model could be built with crime data from all the developed countries and video game usage per country. That would be an exhaustive study. In this short article, I am just going see if there is a correlation for Auto Theft and the video game Grand Theft Auto in just the US.
Auto Theft
In the US auto theft has decreased over 50 percent since 1997 and the sales of the video game Grand Theft Auto (GTA) have tripled since 1997. It appears there is an inverse relationship (okay, I am sorta kidding here).
In the following graph, years are plotted along the x axis. Auto Theft Rate is in red and Grand Theft Auto (GTA) sales in millions of dollars are in black. The goal with regression is to take these two graphs and turn them into a model to understand how GTA and Auto Theft Rate are related. The model quantifies the relationship.
The following model was derived using an ordinary least squares routine in Python Pandas.
Auto Thefts = 557 - 6.8 times (GTA Sales).
The model is read as, "For every million dollars of GTA Sales, the Auto Theft Rate decreases by 6.8. If GTA sales were zero the Auto Theft Rate would be 557."
All Models Have Error
No matter how many variables are added all models have some error. It is important to quantify the error and the goodness. While there are many ways to calculate these the two most popular ways are R squared and standard error of the estimate.
R Squared
R squared measures correlation. The R squared of the auto theft model is .84. Roughly speaking, the increase in GTA sales can account for about 84% of the decrease in the Auto Theft Rate. R squared is between 0 and 1. Where 1 represents a perfect relationship and 0 indicates no relationship.
Correlation does not mean Causation
Just because GTA and Auto Theft Rate are highly correlated does not mean the sales of GTA are causing a decrease in Auto Theft Rate. There may be other reasons why the Auto Theft Rates are going down such as better anti-theft systems on cars.
Standard Error of the Estimate
The standard error of the estimate compares actual values to estimate values. It is a measure of error and especially how much error. When it comes to standard error of the estimate, the smaller the better. R squared and standard error of the estimate both indicate goodness of fit of the model. These two measurements will never contradict each other.
The calculation
The note section includes a link to a video on how to calculate the standard error of the estimate.
Another Method of Validation
With large data sets the data can be divided into training and test data. Training data is used to create models and the test data is used to validate the model.
Machine Learning and Regression
For a lack of a better analogy machine learning is a guess and check methodology. Machine learning tries to minimize error and specifically it tries to minimize the standard error of the estimate. Values for "dimensions" or variables are guessed and the standard error of the estimate is calculated. A new set of values are guessed at and the standard error of estimate is recalculated. This process continues until the standard error of the estimate cannot be minimized any further.
The reason for the guess and check methodology instead of just solving for the equation is due to computation time. In other words, it can take a long time to derive an actual regression equation versus guessing and checking.
The "power" of machine learning is you can add every dimension or variable under the sun one at a time, all together, subtract them, add them, increase degrees, decrease degrees until you have some model that predicts the future of everything including the meaning of the universe. BTW, the answer to the meaning of the universe and everything is 42.
Back To Crime and Video Games (proving causation)
We know that youth contribute a lot to crime and especially property crime. According to FBI reports, most juvenile crimes occur between 3 pm and 7 pm on school days. Nielsen Media reports that video game usage starts at about 3 pm and peaks at 9 pm. If youth are occupying their time playing video games instead of wandering the streets after school, then it should be expected that crime falls.
A General Model
The variable causing this downward trend in crime is video game usage. This is our independent variable. Instead of video game usage, video games sales can be used as a proxy for video game usage. Typically we would plot this along the x axis (horizontal axis). The Crime rate is our dependent variable because it depends on video game usage. The data and the statistics will NEVER PROVE CAUSATION (cause and effect). Causation is proved by understanding the domain and in this case youth crime.
Beyond a reasonable doubt
It is hard to prove causation beyond a reasonable doubt. The FBI reports that total arrests of those 18 and under dropped over 40% (1,346,165 to 788,982) from 2005 to 2013. Specifically, vandalism arrests dropped 50% , vagrancy (hanging around and being a general nuisance) dropped nearly 80% and disorderly conduct arrests dropped 50% during this same time period.
No plausible alternative explanations
It is important in regression to examine other possible causes of trends and rule them out. The experts argue about the causes of the decrease in youth crime and their arguments include everything from legalization of abortion to climate change. As mentioned early, several countries are seeing the same downward trend in youth crime. Polices and laws differ in all these countries and one constant between all these countries is the increase in video game playing.
Experimentation
Regression can be used for experimentation. A model can be built before and after to measure the impact of video gaming on crime. Imagine a police department measuring youth crime in a specific neighborhood and then going door to door and giving away video game systems. The police department could measure youth crime after deployment of all the games to see if there were any changes in youth crime.
In the End
Regression is used to model some business problem. Regression is a method to tie variables together and quantify the relationship. Any regression analysis should include a significant amount of discussion about the domain (the problem). I am only using crime and video gaming as a backdrop to understanding the decline in youth crime. In this case crime and video gaming is the business domain.
Auto Thefts = 557 - 6.8 times (GTA Sales).
The model is read as, "For every million dollars of GTA Sales, the Auto Theft Rate decreases by 6.8. If GTA sales were zero the Auto Theft Rate would be 557."
All Models Have Error
No matter how many variables are added all models have some error. It is important to quantify the error and the goodness. While there are many ways to calculate these the two most popular ways are R squared and standard error of the estimate.
R Squared
R squared measures correlation. The R squared of the auto theft model is .84. Roughly speaking, the increase in GTA sales can account for about 84% of the decrease in the Auto Theft Rate. R squared is between 0 and 1. Where 1 represents a perfect relationship and 0 indicates no relationship.
Correlation does not mean Causation
Just because GTA and Auto Theft Rate are highly correlated does not mean the sales of GTA are causing a decrease in Auto Theft Rate. There may be other reasons why the Auto Theft Rates are going down such as better anti-theft systems on cars.
Standard Error of the Estimate
The standard error of the estimate compares actual values to estimate values. It is a measure of error and especially how much error. When it comes to standard error of the estimate, the smaller the better. R squared and standard error of the estimate both indicate goodness of fit of the model. These two measurements will never contradict each other.
The calculation
The note section includes a link to a video on how to calculate the standard error of the estimate.
Another Method of Validation
With large data sets the data can be divided into training and test data. Training data is used to create models and the test data is used to validate the model.
Machine Learning and Regression
For a lack of a better analogy machine learning is a guess and check methodology. Machine learning tries to minimize error and specifically it tries to minimize the standard error of the estimate. Values for "dimensions" or variables are guessed and the standard error of the estimate is calculated. A new set of values are guessed at and the standard error of estimate is recalculated. This process continues until the standard error of the estimate cannot be minimized any further.
The reason for the guess and check methodology instead of just solving for the equation is due to computation time. In other words, it can take a long time to derive an actual regression equation versus guessing and checking.
The "power" of machine learning is you can add every dimension or variable under the sun one at a time, all together, subtract them, add them, increase degrees, decrease degrees until you have some model that predicts the future of everything including the meaning of the universe. BTW, the answer to the meaning of the universe and everything is 42.
Back To Crime and Video Games (proving causation)
We know that youth contribute a lot to crime and especially property crime. According to FBI reports, most juvenile crimes occur between 3 pm and 7 pm on school days. Nielsen Media reports that video game usage starts at about 3 pm and peaks at 9 pm. If youth are occupying their time playing video games instead of wandering the streets after school, then it should be expected that crime falls.
A General Model
The variable causing this downward trend in crime is video game usage. This is our independent variable. Instead of video game usage, video games sales can be used as a proxy for video game usage. Typically we would plot this along the x axis (horizontal axis). The Crime rate is our dependent variable because it depends on video game usage. The data and the statistics will NEVER PROVE CAUSATION (cause and effect). Causation is proved by understanding the domain and in this case youth crime.
Beyond a reasonable doubt
It is hard to prove causation beyond a reasonable doubt. The FBI reports that total arrests of those 18 and under dropped over 40% (1,346,165 to 788,982) from 2005 to 2013. Specifically, vandalism arrests dropped 50% , vagrancy (hanging around and being a general nuisance) dropped nearly 80% and disorderly conduct arrests dropped 50% during this same time period.
No plausible alternative explanations
It is important in regression to examine other possible causes of trends and rule them out. The experts argue about the causes of the decrease in youth crime and their arguments include everything from legalization of abortion to climate change. As mentioned early, several countries are seeing the same downward trend in youth crime. Polices and laws differ in all these countries and one constant between all these countries is the increase in video game playing.
Experimentation
Regression can be used for experimentation. A model can be built before and after to measure the impact of video gaming on crime. Imagine a police department measuring youth crime in a specific neighborhood and then going door to door and giving away video game systems. The police department could measure youth crime after deployment of all the games to see if there were any changes in youth crime.
In the End
Regression is used to model some business problem. Regression is a method to tie variables together and quantify the relationship. Any regression analysis should include a significant amount of discussion about the domain (the problem). I am only using crime and video gaming as a backdrop to understanding the decline in youth crime. In this case crime and video gaming is the business domain.
Notes:
Regression Playlist
https://www.youtube.com/playlist?list=PLF596A4043DBEAE9C
R Squared Calculation
https://youtu.be/w2FKXOa0HGA?list=PLF596A4043DBEAE9C
Standard Error of the Estimate Calculation
https://youtu.be/r-txC-dpI-E?list=PLF596A4043DBEAE9C
FBI Crime Data
https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2010/crime-in-the-u.s.-2010/tables/10tbl01.xls#overview
https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-34/table_34_five_year_arrest_trends_totals_2013.xls
Nielsen The Total Audience Report Table 1
http://ir.nielsen.com/files/doc_presentations/2014/The-Total-Audience-Report.pdf
http://time.com/120476/nielsen-video-games/
http://www.nielsen.com/content/dam/nielsen/en_us/documents/pdf/White%20Papers%20and%20Reports%20II/The%20State%20of%20the%20Video%20Gamer%20-%204th%20Quarter%202008.pdf
Regression Playlist
https://www.youtube.com/playlist?list=PLF596A4043DBEAE9C
R Squared Calculation
https://youtu.be/w2FKXOa0HGA?list=PLF596A4043DBEAE9C
Standard Error of the Estimate Calculation
https://youtu.be/r-txC-dpI-E?list=PLF596A4043DBEAE9C
FBI Crime Data
https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2010/crime-in-the-u.s.-2010/tables/10tbl01.xls#overview
https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-34/table_34_five_year_arrest_trends_totals_2013.xls
Nielsen The Total Audience Report Table 1
http://ir.nielsen.com/files/doc_presentations/2014/The-Total-Audience-Report.pdf
http://time.com/120476/nielsen-video-games/
http://www.nielsen.com/content/dam/nielsen/en_us/documents/pdf/White%20Papers%20and%20Reports%20II/The%20State%20of%20the%20Video%20Gamer%20-%204th%20Quarter%202008.pdf