## Regression (for the non data scientist)

One of the most used techniques in statistics and machine learning is regression. Regression models can be straight lines or curvy lines. It can be used with one variable or it can be used with many variables. The point of regression is to derive an equation (to build a model) to help understand, to forecast and most of all to quantify relationships.

If a picture is worth a thousand words, then an animation is worth a million. If you don't like reading, then you can watch my Youtube Videos on regression (see links below). They are some of my most popular videos with well over 500,000 views.

If a picture is worth a thousand words, then an animation is worth a million. If you don't like reading, then you can watch my Youtube Videos on regression (see links below). They are some of my most popular videos with well over 500,000 views.

**Setting Up The Problem**

Recently there has been some interest in how video game playing has impacted crime. Juvenile arrests for crime have dropped to a 30-year low, and fewer teens are being locked up than at any time in nearly 20 years. Before we give credit to the police and politicians, it is worth noting that crime rates in the USA, United Kingdom, Australia, New Zealand and several other developed countries are all following the same downward trend. Video game sales and usage in these same countries are on upward trend. It turns out that youth commit a lot of crime and especially property crime. Youth also play a lot of video games. Could it be that these same youth are playing video games instead of committing crimes?

**Building The Model**

A model could be built with crime data from all the developed countries and video game usage per country. That would be an exhaustive study. In this short article, I am just going see if there is a correlation for Auto Theft and the video game Grand Theft Auto in just the US.

*Auto Theft*

In the US auto theft has decreased over 50 percent since 1997 and the sales of the video game Grand Theft Auto (GTA) have tripled since 1997. It appears there is an inverse relationship (okay, I am sorta kidding here).

In the following graph, years are plotted along the x axis. Auto Theft Rate is in red and Grand Theft Auto (GTA) sales in millions of dollars are in black. The goal with regression is to take these two graphs and turn them into a model to understand how GTA and Auto Theft Rate are related. The model quantifies the relationship.

The following model was derived using an ordinary least squares routine in Python Pandas.

The model is read as, "For every million dollars of GTA Sales, the Auto Theft Rate decreases by 6.8. If GTA sales were zero the Auto Theft Rate would be 557."

No matter how many variables are added all models have some error. It is important to quantify the error and the goodness. While there are many ways to calculate these the two most popular ways are R squared and standard error of the estimate.

R squared measures correlation. The R squared of the auto theft model is .84. Roughly speaking, the increase in GTA sales can account for about 84% of the decrease in the Auto Theft Rate. R squared is between 0 and 1. Where 1 represents a perfect relationship and 0 indicates no relationship.

Just because GTA and Auto Theft Rate are highly correlated does not mean the sales of GTA are causing a decrease in Auto Theft Rate. There may be other reasons why the Auto Theft Rates are going down such as better anti-theft systems on cars.

The standard error of the estimate compares actual values to estimate values. It is a measure of error and especially how much error. When it comes to standard error of the estimate, the smaller the better. R squared and standard error of the estimate both indicate goodness of fit of the model. These two measurements will never contradict each other.

The note section includes a link to a video on how to calculate the standard error of the estimate.

With large data sets the data can be divided into training and test data. Training data is used to create models and the test data is used to validate the model.

For a lack of a better analogy machine learning is a guess and check methodology. Machine learning tries to minimize error and specifically it tries to minimize the standard error of the estimate. Values for "dimensions" or variables are guessed and the standard error of the estimate is calculated. A new set of values are guessed at and the standard error of estimate is recalculated. This process continues until the standard error of the estimate cannot be minimized any further.

The reason for the guess and check methodology instead of just solving for the equation is due to computation time. In other words, it can take a long time to derive an actual regression equation versus guessing and checking.

The "power" of machine learning is you can add every dimension or variable under the sun one at a time, all together, subtract them, add them, increase degrees, decrease degrees until you have some model that predicts the future of everything including the meaning of the universe. BTW, the answer to the meaning of the universe and everything is 42.

We know that youth contribute a lot to crime and especially property crime. According to FBI reports, most juvenile crimes occur between 3 pm and 7 pm on school days. Nielsen Media reports that video game usage starts at about 3 pm and peaks at 9 pm. If youth are occupying their time playing video games instead of wandering the streets after school, then it should be expected that crime falls.

The variable causing this downward trend in crime is video game usage. This is our independent variable. Instead of video game usage, video games sales can be used as a proxy for video game usage. Typically we would plot this along the x axis (horizontal axis). The Crime rate is our dependent variable because it depends on video game usage. The data and the statistics will NEVER PROVE CAUSATION (cause and effect). Causation is proved by understanding the domain and in this case youth crime.

It is hard to prove causation beyond a reasonable doubt. The FBI reports that total arrests of those 18 and under dropped over 40% (1,346,165 to 788,982) from 2005 to 2013. Specifically, vandalism arrests dropped 50% , vagrancy (hanging around and being a general nuisance) dropped nearly 80% and disorderly conduct arrests dropped 50% during this same time period.

It is important in regression to examine other possible causes of trends and rule them out. The experts argue about the causes of the decrease in youth crime and their arguments include everything from legalization of abortion to climate change. As mentioned early, several countries are seeing the same downward trend in youth crime. Polices and laws differ in all these countries and one constant between all these countries is the increase in video game playing.

Regression can be used for experimentation. A model can be built before and after to measure the impact of video gaming on crime. Imagine a police department measuring youth crime in a specific neighborhood and then going door to door and giving away video game systems. The police department could measure youth crime after deployment of all the games to see if there were any changes in youth crime.

Regression is used to model some business problem. Regression is a method to tie variables together and quantify the relationship. Any regression analysis should include a significant amount of discussion about the domain (the problem). I am only using crime and video gaming as a backdrop to understanding the decline in youth crime. In this case crime and video gaming is the business domain.

**Auto Thefts = 557 - 6.8 times (GTA Sales).**The model is read as, "For every million dollars of GTA Sales, the Auto Theft Rate decreases by 6.8. If GTA sales were zero the Auto Theft Rate would be 557."

**All Models Have Error**No matter how many variables are added all models have some error. It is important to quantify the error and the goodness. While there are many ways to calculate these the two most popular ways are R squared and standard error of the estimate.

**R Squared**R squared measures correlation. The R squared of the auto theft model is .84. Roughly speaking, the increase in GTA sales can account for about 84% of the decrease in the Auto Theft Rate. R squared is between 0 and 1. Where 1 represents a perfect relationship and 0 indicates no relationship.

*Correlation does not mean Causation*Just because GTA and Auto Theft Rate are highly correlated does not mean the sales of GTA are causing a decrease in Auto Theft Rate. There may be other reasons why the Auto Theft Rates are going down such as better anti-theft systems on cars.

Standard Error of the EstimateStandard Error of the Estimate

The standard error of the estimate compares actual values to estimate values. It is a measure of error and especially how much error. When it comes to standard error of the estimate, the smaller the better. R squared and standard error of the estimate both indicate goodness of fit of the model. These two measurements will never contradict each other.

*The calculation*The note section includes a link to a video on how to calculate the standard error of the estimate.

**Another Method of Validation**With large data sets the data can be divided into training and test data. Training data is used to create models and the test data is used to validate the model.

**Machine Learning and Regression**For a lack of a better analogy machine learning is a guess and check methodology. Machine learning tries to minimize error and specifically it tries to minimize the standard error of the estimate. Values for "dimensions" or variables are guessed and the standard error of the estimate is calculated. A new set of values are guessed at and the standard error of estimate is recalculated. This process continues until the standard error of the estimate cannot be minimized any further.

The reason for the guess and check methodology instead of just solving for the equation is due to computation time. In other words, it can take a long time to derive an actual regression equation versus guessing and checking.

The "power" of machine learning is you can add every dimension or variable under the sun one at a time, all together, subtract them, add them, increase degrees, decrease degrees until you have some model that predicts the future of everything including the meaning of the universe. BTW, the answer to the meaning of the universe and everything is 42.

Back To Crime and Video Games (proving causation)Back To Crime and Video Games (proving causation)

We know that youth contribute a lot to crime and especially property crime. According to FBI reports, most juvenile crimes occur between 3 pm and 7 pm on school days. Nielsen Media reports that video game usage starts at about 3 pm and peaks at 9 pm. If youth are occupying their time playing video games instead of wandering the streets after school, then it should be expected that crime falls.

**A General Model**The variable causing this downward trend in crime is video game usage. This is our independent variable. Instead of video game usage, video games sales can be used as a proxy for video game usage. Typically we would plot this along the x axis (horizontal axis). The Crime rate is our dependent variable because it depends on video game usage. The data and the statistics will NEVER PROVE CAUSATION (cause and effect). Causation is proved by understanding the domain and in this case youth crime.

*Beyond a reasonable doubt*It is hard to prove causation beyond a reasonable doubt. The FBI reports that total arrests of those 18 and under dropped over 40% (1,346,165 to 788,982) from 2005 to 2013. Specifically, vandalism arrests dropped 50% , vagrancy (hanging around and being a general nuisance) dropped nearly 80% and disorderly conduct arrests dropped 50% during this same time period.

*No plausible alternative explanations*It is important in regression to examine other possible causes of trends and rule them out. The experts argue about the causes of the decrease in youth crime and their arguments include everything from legalization of abortion to climate change. As mentioned early, several countries are seeing the same downward trend in youth crime. Polices and laws differ in all these countries and one constant between all these countries is the increase in video game playing.

*Experimentation*Regression can be used for experimentation. A model can be built before and after to measure the impact of video gaming on crime. Imagine a police department measuring youth crime in a specific neighborhood and then going door to door and giving away video game systems. The police department could measure youth crime after deployment of all the games to see if there were any changes in youth crime.

**In the End**Regression is used to model some business problem. Regression is a method to tie variables together and quantify the relationship. Any regression analysis should include a significant amount of discussion about the domain (the problem). I am only using crime and video gaming as a backdrop to understanding the decline in youth crime. In this case crime and video gaming is the business domain.

**Notes:**

*Regression Playlist*

*https://www.youtube.com/playlist?list=PLF596A4043DBEAE9C*

*R Squared Calculation*

https://youtu.be/w2FKXOa0HGA?list=PLF596A4043DBEAE9C

*Standard Error of the Estimate Calculation*

https://youtu.be/r-txC-dpI-E?list=PLF596A4043DBEAE9C

FBI Crime Data

https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2010/crime-in-the-u.s.-2010/tables/10tbl01.xls#overview

https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-34/table_34_five_year_arrest_trends_totals_2013.xls

Nielsen The Total Audience Report Table 1

http://ir.nielsen.com/files/doc_presentations/2014/The-Total-Audience-Report.pdf

http://time.com/120476/nielsen-video-games/

http://www.nielsen.com/content/dam/nielsen/en_us/documents/pdf/White%20Papers%20and%20Reports%20II/The%20State%20of%20the%20Video%20Gamer%20-%204th%20Quarter%202008.pdf