Discrete v. Continuous A simple light switch is discrete. A light is on or it is off. You either vote or you don't vote. You click or you don't click. You either go to the movie or you don't go. There is no in between or is there?
Enter Logistic Regression. Logistic regression is to statistics what a dimmer switch is to light. It is a technique to measure and predict likelihood. It helps you understand in between. It takes discrete data and turns it into continuous (or tries too). Instead of the light being on or off there are shades of light. Instead of voting or not voting there are smooth curves. Continuous data are things like income, age or temperature.
Understanding Likelihood A person is more likely to vote in an election if they are older. They are more likely to vote if they are college educated. They are more likely to vote if they have higher incomes. Each of these "variables" or "dimensions" increases the likelihood of a person voting. It does not guarantee that a person votes it increases the odds of a person voting.
A person is more likely to go see the new Mission Impossible movie if they saw and liked a previous Mission Impossible movie. The odds increase with each Mission Impossible movie seen. Add "Whole Lotta Love"by Led Zeppelin to the movie trailer and it increases the likelihood of appealing to a whole generation. Again, being a fan of Mission Impossible and being from the 70's does not guarantee a person will go to see the new movie it just increases the odds.
0's and 1's Imagine I wanted to understand and model the odds of a person voting by age. The following table has two columns. The first is labeled voted and the second is labeled age. A person voted (Vote = 1) or they did not vote (Vote = 0) . This is discrete data. Age is continuous data.
Voted Age 1 55 0 18 0 25 1 36
Now imagine millions of these rows of zeros, ones and age! With a little help from Python, logistic regression takes this data and turns them into odds and a nice smooth curve (see graph below).
Understanding The Odds Odds are expressed as for and against. The odds of a coin flip are 1 to 1 or even. They are even because you are as likely to get a heads as a tail when you flip a coin. If the odds of a voter are 1 to 1 they are as likely to vote as not to vote. If the odds of a voter are 2 to 1, then they are twice as likely to vote as not to vote. If the odds of a voter are 1 to 2, they they are twice as likely not to vote as to vote.
Odds of Voting Voting odds are expressed as voted (o) and not voted (1). The chart (Figure 1) was created from voting data collected by individual states for the last general election (November 2014). It models the likelihood of a person voting by age (blue line). There is not much argument that the odds of a person voting increases with age. What logistic regression helps to do is to quantify the likelihood or odds.
Remember the odds of flipping a coin are 1 to 1 or even. As you can see, 18 to 39 year olds were more likely not to vote than to vote in the last general election (November 2014). Their odds are under the 1 to 1 line or the even money line. After the age of 40 the odds improve with age
The odds of person in their 80's voting is 7 to 1. That means they are 7 times more likely to vote than not to vote. The odds of a person in their 20's is .50 (or 1 to 2). That means 20 year olds were twice as likely not to vote as to vote in the 2014 election. We can quantify the differences between any two age groups. An 80 year old is 14 times ( 7 divided by .50) more likely to vote than someone in their 20's.
Odds Across Time Logistic regression allows comparison over time. In the 2008 general election (when President Obama was first elected), the odds for someone in their twenties to vote was 1 to 1 or even money. They were as likely to vote as not to vote. But by the 2014 midterm election their odds dropped to 1 to 2. They were twice as likely not to vote as to vote.
Voting Models Voting models can be complex. In this article I only discussed the odds of voting or not voting. In future articles, I plan on presenting models that help predict if a person votes, what are the odds they will vote for a specific candidate. These models include a lot of variables beyond age. They include income, education, geography, and gender to name a few. They can also include more discrete variables such as do they listen to NPR or Fox News.
Summary In all honesty, the math for logistics regression can be messy and hard to understand. You don't really need to understand the underlying mathematics to understand logistic regression. You just need to understand the odds.
I also use the terms odds and likelihood interchangeably. The reason is most people do not think in odds or probabilities, they think in likelihood. Likelihood is good enough because odds, probability, and likelihood all move in the same direction.
By the way, I have seen all the previous Mission Impossible movies, I am from the 70's, I grew up listening to Led Zeppelin, I am going to the Mission Impossible. In fact, I am going to go see the movie right after I post this article.
The Math I built the equation (Figure 2) for 2014 general election. I plan on creating a video that explains how the actual math is calculated. I used python script to derive the equation using nearly 2 million active registered voters. Again, if you are not familiar with logs or regression, then the math is going to be hard.
More On Odds In the movie Dumb and Dumber the character Lloyd, played by Jim Carrey, asks the attractive red head (Mary Swanson) what are my chances of getting together with her. Her eventual response is "one out of a million." Lloyds response is " so you're telling me there's a chance." Mary Swanson's character was played by Lauren Holly. In a bit of movie irony Lauren Holly and Jim Carrey would get married in 1996 (two years after the movie). So I guess he had a chance after all, but they divorced in 1997.