## Logistic Regression

for the non-data scientist

Enter Logistic Regression. Logistic regression is to statistics what a dimmer switch is to light. It is a technique to measure and predict likelihood. It helps you understand in between. It takes discrete data and turns it into continuous (or tries too). Instead of the light being on or off there are shades of light. Instead of voting or not voting there are smooth curves. Continuous data are things like income, age or temperature.

A person is more likely to vote in an election if they are older. They are more likely to vote if they are college educated. They are more likely to vote if they have higher incomes. Each of these

A person is more likely to go see the new

Voted Age

1 55

0 18

0 25

1 36

Now imagine millions of these rows of zeros, ones and age! With a little help from Python, logistic regression takes this data and turns them into odds and a nice smooth curve (see graph below).

Voting odds are expressed as voted (o) and not voted (1). The chart (Figure 1) was created from voting data collected by individual states for the last general election (November 2014). It models the likelihood of a person voting by age (blue line). There is not much argument that the odds of a person voting increases with age. What logistic regression helps to do is to quantify the likelihood or odds.

Remember the odds of flipping a coin are 1 to 1 or even. As you can see, 18 to 39 year olds were more likely not to vote than to vote in the last general election (November 2014). Their odds are under the 1 to 1 line or the even money line. After the age of 40 the odds improve with age

The odds of person in their 80's voting is 7 to 1. That means they are 7 times more likely to vote than not to vote. The odds of a person in their 20's is .50 (or 1 to 2). That means 20 year olds were twice as likely not to vote as to vote in the 2014 election. We can quantify the differences between any two age groups. An 80 year old is 14 times ( 7 divided by .50) more likely to vote than someone in their 20's.

**Odds Across Time**

Logistic regression allows comparison over time. In the 2008 general election (when President Obama was first elected), the odds for someone in their twenties to vote was 1 to 1 or even money. They were as likely to vote as not to vote. But by the 2014 midterm election their odds dropped to 1 to 2. They were twice as likely not to vote as to vote.

**Voting Models**

Voting models can be complex. In this article I only discussed the odds of voting or not voting. In future articles, I plan on presenting models that help predict if a person votes, what are the odds they will vote for a specific candidate. These models include a lot of variables beyond age. They include income, education, geography, and gender to name a few. They can also include more discrete variables such as do they listen to NPR or Fox News.

**Summary**

In all honesty, the math for logistics regression can be messy and hard to understand. You don't really need to understand the underlying mathematics to understand logistic regression. You just need to understand the odds.

I also use the terms odds and likelihood interchangeably. The reason is most people do not think in odds or probabilities, they think in likelihood. Likelihood is good enough because odds, probability, and likelihood all move in the same direction.

By the way, I have seen all the previous Mission Impossible movies, I am from the 70's, I grew up listening to Led Zeppelin, I am going to the Mission Impossible. In fact, I am going to go see the movie right after I post this article.

Logistic regression allows comparison over time. In the 2008 general election (when President Obama was first elected), the odds for someone in their twenties to vote was 1 to 1 or even money. They were as likely to vote as not to vote. But by the 2014 midterm election their odds dropped to 1 to 2. They were twice as likely not to vote as to vote.

Voting models can be complex. In this article I only discussed the odds of voting or not voting. In future articles, I plan on presenting models that help predict if a person votes, what are the odds they will vote for a specific candidate. These models include a lot of variables beyond age. They include income, education, geography, and gender to name a few. They can also include more discrete variables such as do they listen to NPR or Fox News.

In all honesty, the math for logistics regression can be messy and hard to understand. You don't really need to understand the underlying mathematics to understand logistic regression. You just need to understand the odds.

I also use the terms odds and likelihood interchangeably. The reason is most people do not think in odds or probabilities, they think in likelihood. Likelihood is good enough because odds, probability, and likelihood all move in the same direction.

By the way, I have seen all the previous Mission Impossible movies, I am from the 70's, I grew up listening to Led Zeppelin, I am going to the Mission Impossible. In fact, I am going to go see the movie right after I post this article.

I built the equation (Figure 2) for 2014 general election. I plan on creating a video that explains how the actual math is calculated. I used python script to derive the equation using nearly 2 million active registered voters. Again, if you are not familiar with logs or regression, then the math is going to be hard.

A good resource to understand logistic regression

✕