David Longstreet
  • Home Page
  • Data Science
  • Articles
    • Data Science >
      • Data Science Projects
      • Machine Learning
      • Logistic Regression
      • Regression (for the non data scientist)
      • Polynomials
      • Modeling Steroid Usage in MLB
      • Building Blocks of Data Science
    • Leadership >
      • Leadership
  • MyBookSucks
  • Art and Data
    • Art of Data
    • Voronoi Diagrams >
      • Voronoi As Art >
        • Voronoi Explained
        • Shades of Grey
  • About David
    • Short Bio
    • Client List
    • Learning Piano
    • Resume
    • Contact Me

Major League Baseball had a steroid problem

"You can observe a lot by just watching" Yogi Berra

Baseball
​
It is not clear when steroids started being used in MLB, but data can help us answer this question. It has been well known for decades that steroids improve athletic performance. Steroid usage was banned by the Olympics in the late 1960's. MLB didn't punish players for using steroids until the early 2000's ( over 30 years after the Olympic ban).  
​
The data suggests steroid usage in MLB  started in the early 1980's.
The following graph shows the number of players hitting over 20 home runs in a single season increases year after year. The number of players hitting over 20 home runs peaks in 1999 with 95 players. After enforcement (the grey shaded area) the number of players hitting over 20 homers begins a rapid downward trend and in the last full season (2014) about  50 players achieved this mark. Using 50 players as a benchmark (dashed lined) it is easy to see the abuses of steroids in the 1990's.  It is also evident that some players were experimenting with steroids in the late 1980's.

Picture
The  model below was created to estimate where the home run curve (dark blue line) should have been if steroids had not been used by MLB players.  MLB data supports that players have gradually become bigger, stronger, and faster over the past 70 years. Training regimes have improved as well.  This why there is a steady climb of the modeled home run curve.  There is an upper bound or limit and this why the model curve starts to flatten out at about 50 players.  ​

Picture
All models have error.
Keep in mind that models are just models and they are not reality.  Models are used to help understand and to predict.  A model is not the real thing.  There are errors in all models.
Taking steroids improves timing not just power
Data can help determine if the benefit to taking steroids is power or timing. Striking out a lot and hitting a lot of home runs were nicely correlated for over 70 years. They are nicely correlated until MLB enforces its ban on steroids. This chart should make everyone say, "wow, is your data correct?" Yes, it is correct.
Picture
The graph shows the number of players hitting over 20 home runs in a single season and the number of players striking out over 50 times in a season.

​Unfortunately a lot of MLB players grew up munching on steroids and learned how to hit with the benefit of  steroids. The number of players striking out a lot soars after MLB enforces its ban on steroids. Data that was correlated for nearly 70 years is no longer correlated. The graph clearly shows not taking steroids is negatively impacting timing.
​

Establishing a baseline and organizing data into thresholds is difficult in data analysis, but it is extremely important. Domain knowledge and experimentation with the data will help with setting a threshold and establishing a baseline.  


Notes:
The war years (1942 - 1945) and the baseball strike years (1981 and 1994) have been removed from the data set. Many baseball greats (Ted Williams, Joe DiMaggio, and others) served in the service from 1942 to 1945. The baseball strikes in 1981 and 1994 cut the baseball seasons short and therefore these years were treated  as anomalies. 

The model is a polynomial equation with an R squared of .82. The model was constructed using the assumption the maximum number of players that should hit 20 home runs in a single season is around 50.  

​Python Numpy was used to derive the model.  In the very near future, I plan on showing all my python code.  The actual model is below (where y = players hitting over 50 home runs and is x is the number of years). 
Picture

Passions and Professionalisms 


  • Home Page
  • Data Science
  • Articles
    • Data Science >
      • Data Science Projects
      • Machine Learning
      • Logistic Regression
      • Regression (for the non data scientist)
      • Polynomials
      • Modeling Steroid Usage in MLB
      • Building Blocks of Data Science
    • Leadership >
      • Leadership
  • MyBookSucks
  • Art and Data
    • Art of Data
    • Voronoi Diagrams >
      • Voronoi As Art >
        • Voronoi Explained
        • Shades of Grey
  • About David
    • Short Bio
    • Client List
    • Learning Piano
    • Resume
    • Contact Me