Data Science

Data Science is the combination of psychology, economics and statistics. Computer science and programming are important, but programming is a tool that allows us to do data science, it is not data science. The ultimate goal of data science is to model human behavior, to understand past trends and to predict the future. Data Science relies on learning from the data itself and not just from domain experts.
What has changed over the past decade or so, is the enormous amount of data that has been accumulated. What drove this accumulation of data is basic economics. It became inexpensive to gather and store data. Many organizations now have gathered up literally billions of observations or rows of information. Much of what has been collected was collected with no specific intent. It was collected just because it could be collected. The real question is does all this data have some economic value or not.
What has changed over the past decade or so, is the enormous amount of data that has been accumulated. What drove this accumulation of data is basic economics. It became inexpensive to gather and store data. Many organizations now have gathered up literally billions of observations or rows of information. Much of what has been collected was collected with no specific intent. It was collected just because it could be collected. The real question is does all this data have some economic value or not.
Data Mining

Once upon a time, I spent hours looking for data. I combed through government documents, survey results, or any data source I could find. I was happy to get a thimble full of data. I actually thought having more data would solve my problems. It turns out enormous data sets create an entirely new set of problems.
Big and Messy Data
An enormous amount of time is spent cleaning up and reorganizing data. With many data sets there is a tremendous amount of work that takes place before any data science, statistical analysis or modeling. I led a project to gather up voter registration lists from all 50 states. There are over 200 million registered voters in the US and each state has a different file format. A lot of time was spent just organizing the data. Python was the tool of choice.
At H&R Block I lead a team of data scientists where we look for trends and patterns within 20 million tax returns. Each federal tax return can have up to 8,000 fields (plus several thousand additional fields for state tax returns). A tremendous amount of time is spent, cleaning and scrubbing data. The tool of choice was SQL, SAS and R.
Now I am at FanThreeSixty and we study Fan Behaviors. In short, our objective is to help teams and venues sell more hot dogs and tickets. We gather data primarily from ticketing, mobile app and point of sale. We work with nearly 100 teams across Professional, Minor League, University and High School sports. We have observations on nearly 10 million fans. One of the biggest challenges of is linking all the disparate data together into a single fan view. At FanThreeSixty our tool of choice is Python and SQL.
Big and Messy Data
An enormous amount of time is spent cleaning up and reorganizing data. With many data sets there is a tremendous amount of work that takes place before any data science, statistical analysis or modeling. I led a project to gather up voter registration lists from all 50 states. There are over 200 million registered voters in the US and each state has a different file format. A lot of time was spent just organizing the data. Python was the tool of choice.
At H&R Block I lead a team of data scientists where we look for trends and patterns within 20 million tax returns. Each federal tax return can have up to 8,000 fields (plus several thousand additional fields for state tax returns). A tremendous amount of time is spent, cleaning and scrubbing data. The tool of choice was SQL, SAS and R.
Now I am at FanThreeSixty and we study Fan Behaviors. In short, our objective is to help teams and venues sell more hot dogs and tickets. We gather data primarily from ticketing, mobile app and point of sale. We work with nearly 100 teams across Professional, Minor League, University and High School sports. We have observations on nearly 10 million fans. One of the biggest challenges of is linking all the disparate data together into a single fan view. At FanThreeSixty our tool of choice is Python and SQL.
New Skills Emerge
With these large data sets a new skill set emerged which was computer programming (of sorts). I say of sorts because you don't need to be a mission critical computer programmer to be a solid data scientist. I maintain that your programming skills have to be just good enough.
It is important to note that programming becomes easier with each iteration of software languages. Today data scientists have several languages at their finger tips. Python and R both are rich with statistical analysis programs. While I used R for a period of time, I have switched to Python. Python is a tool that is easier to use for data mining, data organization, and especially data clean up. It is difficult to use R for these chores. In the past, I have used SPSS (and taught it in my statistics classes) and SAS. My team used SAS, R, SQL and Python at H&R Block. Now I am at FanThreeSixty and we use primarily Python and SQL. Like a skilled carpenter, a good data scientist uses the best tool for the job.
It is important to note that programming becomes easier with each iteration of software languages. Today data scientists have several languages at their finger tips. Python and R both are rich with statistical analysis programs. While I used R for a period of time, I have switched to Python. Python is a tool that is easier to use for data mining, data organization, and especially data clean up. It is difficult to use R for these chores. In the past, I have used SPSS (and taught it in my statistics classes) and SAS. My team used SAS, R, SQL and Python at H&R Block. Now I am at FanThreeSixty and we use primarily Python and SQL. Like a skilled carpenter, a good data scientist uses the best tool for the job.