DataTechNotes: Binary Classification with Logistic Regression in Python

Logistic regression is a fundamental machine learning algorithm used for binary classification tasks. In this tutorial, we'll delve into performing binary classification using logistic regression with the Scikit-Learn LogisticRegression class. We'll cover the following topics:

Introduction to logistic regression
Preparing data
Training the model
Prediction and accuracy check
Conclusion
Source code listing.

Let's get started.

Introduction to logistic regression

Logistic regression is a statistical method used for binary classification tasks. It models the probability that a given input belongs to a certain class, typically denoted as 1 or 0. Despite its name, logistic regression is a classification algorithm, not a regression algorithm.

In logistic regression, the input features are combined linearly using weights, and then the logistic function (also known as the sigmoid function) is applied to the result. The logistic function transforms the linear combination of inputs into a probability score between 0 and 1. This probability score represents the likelihood that the input belongs to the positive class.

Mathematically, the logistic regression model can be expressed as:

$P (y = 1 ∣ x) = \frac{1}{1 + e^{- (β_{0} + β_{1} x)}}$

In this formula:

$P (y = 1 ∣ x)$ is the probability of the dependent variable (y) being 1 given the value of the independent variable (x).
$e$ is the base of the natural logarithm.
$β_{0}$ and $β_{1}$ are coefficients (weights) that the model learns from the data.

This formula calculates the probability that the outcome $y$ is 1 (or true) given the value of the independent variable $x$ .

Preparing data

We'll start loading the necessary libraries for this tutorial. Make sure you have the sklearn library installed.

 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

Next, we load the Breast Cancer dataset available in Scikit-Learn and split the dataset into training and testing sets using the train_test_split function from Scikit-Learn. We apply the StandardScaler to preprocess the features in the dataset.

 
# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the input data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 

Training the model

We create an instance of the logistic regression model using LogisticRegression() constructor. Here, we set the max_iter parameter to 200, which determines the maximum number of iterations.
After initializing the model, we train it using the training data. The fit() method is called on the model object, where we pass the scaled training features X_train_scaled and corresponding labels y_train.

 
# Initialize the logistic regression model with increased max_iter
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train_scaled, y_train)

Prediction and accuracy check

   We use the trained logistic regression model to make predictions on the test data X_test. The predict() method is applied to the model object with the test features as input, resulting in predicted class labels y_pred.
   We compute the accuracy of the model predictions by comparing the predicted class labels with the actual class labels from the test set. The accuracy_score() function from scikit-learn is used to calculate the accuracy as the fraction of correctly predicted labels over the total number of samples.
   The classification report includes metrics such as precision, recall, F1-score, and support for each class, providing insights into the model's ability to correctly classify instances of each class.

# Predict on the test data
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
 

The result looks as follows:

 
Accuracy: 0.9736842105263158
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Conclusion

In this tutorial, we learned how to perform binary classification using logistic regression with binary dataset. We split the dataset into training and testing sets, scaled the feature data, trained a logistic regression model, and evaluated its performance on the test set. Logistic regression is a simple yet powerful algorithm for binary classification tasks, and it can be easily implemented using Scikit-Learn. The full source code is listed below.

Source code listing

 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the input data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize the logistic regression model with increased max_iter
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

DataTechNotes

Pages

Binary Classification with Logistic Regression in Python

1 comment: