Understanding logistic regression

Bibin Sebastian
5 min readJun 7, 2021

--

Logistic regression explained with examples

Photo by Robert Katzki on Unsplash

Before we try to understand logistic regression, it is worth understanding about linear regression first. I will explain these concepts using some examples instead of giving any textbook definitions, so that you really understand these concepts well.

Linear regression

Image that you want to buy a new house in the city you live in. How do you know what price to quote for a new house? In general, if you have some sample data about house prices in your locality, you can do some simple math to come up an estimate for the new house, right?. Guess what, most likely you have used linear regression unknowingly. So let’s go into some details.

Imagine that you have the below sample data on housing prices. You can see that price of the house varies based on area of the house.

To visualise how prices are dependent on area, let’s plot this on a graph.

X-axis represents the area, Y-axis represents the price of the house. If you closely observe data points, you can see that you can fit a straight line through the data points, right? Let’s plot the straight line as below.

From calculus, we know that the line can be represented using the function y=mx+b. In this case, y represents the price of the house and x represents the area of the house. Now, if you know the values for m and b, using the equation, you can predict the price for any given input area.

It turns out that there is a way to find this values of m and b using the sample data we have. This process is called model training in data science (we won’t go into details of model training in this post) and the function y=mx+b is called our machine learning model for linear regression. Some times this function is also written as h = θ1+θ2*x where θ1=b, θ2=m are called parameters. Since we have only one input variable x in this function, it is called linear regression function with 1 variable (You will see linear regression function with two variables when you read about logistic regression below).

Now if a property agent asks you how did you quote for the price of the new house, you can proudly say that you have used a Machine learning model :).

In some cases we won’t be able to fit a straight line over our input data points. In such cases we will have to use a non-linear function for regression.

Logistic regression for classification

Imagine that you have the following graph with blue circles and green triangles.

Like how we did for linear regression, can we come up with a function to classify these data points into two — blue circles and green triangles? so that we can use that function to classify any new input data points?

It turns out that if you first apply a linear regression function on these data points followed by a logistic function, you can classify these data points to 0 and 1. 0 representing blue circles while 1 representing green triangles.

First, let us define the linear regression function for these data points.

We know that a point in a two dimensional space is denoted as (x1, x2) where x1 is the X coordinate value and x2 is the Y coordinate value. So each of these data points — blue circles and green triangles — have their own (x1, x2) values. Our linear regression function with two variables x1, x2 will be h=θ1+θ2*x1+θ2*x2. θ1, θ2 and θ3 are the parameters.

Now let’s apply logistic function (also called sigmoid function) on the value of h. g(h) = 1/(1 + e^(-h))

Logistic function g(h) outputs a value between 0 and 1 as shown below.

For classification, whenever logistic function g(h) outputs a value≥ 0.5, we consider the classification result as 1 and 0 otherwise. It turns out that the value of parameters θ1, θ2 and θ3 can be found in a similar way how it was done for linear regression in the model training process (we won’t go into details of model training in this post) using the sample input points of blue circles and green triangles.

Another interesting fact to note is that our logistic function leads to the formation of a boundary line between our blue circles and green triangles. Let’s see how.

From the logistic function graph above, we can see that whenever value of h (h=θ1+θ2*x1+θ2*x2) is ≥0, our classification outputs a value 1. We know from calculus that the equation θ1+θ2*x1+θ2*x2 =0 represents a line and the condition θ1+θ2*x1+θ2*x2>0 will be true only for points (x1,x2) falling above this line. Below is the plotted graph with a boundary line separating the data points. Now you should be able to clearly understand how logistic function was able to do the classification.

Note that we have only discussed about logistic regression for a binary classification problem. Same concepts can be extended for a multi class classification using a “one vs rest” strategy.

Boundary between classes wont be a straight line always. So in such cases, we need to use a non-linear regression function for regression.

Conclusion

I hope the details provided in this post were helpful for you understand the concepts of linear regression and logistic regression clearly. If you want to see the code I used to come up with the graphs in this post, use the google colab notebook provided here https://github.com/bibinss/logistic_regression/blob/main/logistic_regression_on_google_colab.ipynb

--

--

Bibin Sebastian
Bibin Sebastian

No responses yet