Logistic regression is a fundamental machine learning algorithm used for binary classification tasks. In this tutorial, we'll explore how to classify binary data with logistic regression using PyTorch deep learning framework. We'll cover the following topics:
- Introduction to logistic regression
- Preparing data
- Building the classifier model
- Training the model
- Prediction and accuracy check
- Conclusion
- Source code listing
Let's get started.
Please note that this tutorial provides a basic understanding of implementing logistic regression for data classification using PyTorch. It's important to remember that parameters and model definitions may require adjustments when dealing with larger datasets.
Introduction to logistic regression
Logistic regression is a linear classification algorithm that predicts the probability that an instance belongs to a particular class. It's commonly used for binary classification tasks where the target variable has two possible outcomes, such as spam detection, disease diagnosis, and sentiment analysis.
The logistic regression model calculates the probability that an input sample belongs to the positive class using the logistic function (also known as the sigmoid function). Mathematically, the logistic regression model can be represented as:
Where:
- is the probability that equals 1 given input and model parameters .
- is the input features.
- is the weight vector.
- is the bias term.
Preparing data
We'll begin by loading the necessary libraries for this tutorial.
Before building the model, it's essential to preprocess the data. This may include tasks such as data cleaning, feature scaling, and splitting the dataset into training and test sets.
We first load the breast cancer dataset using scikit-learn's load_breast_cancer function. Then, we separate the features X and target labels y. Next, we standardize the features using StandardScaler to ensure that each feature has a mean of 0 and a standard deviation of 1. After standardization, we convert the data into PyTorch tensors X_tensor for features and y_tensor for labels using the torch.tensor function.
Building the classifier model
We define a new class named LogisticRegression, which inherits from the nn.Module class. We create an instance of the nn.Linear module, which represents a linear transformation of the input data. The nn.Linear module expects two arguments: input_size, which is the number of features in the input data, and num_classes, which is the number of output classes for classification.
The forward method specifies how input data flows through the model. In this case, we apply the linear transformation defined by self.linear to the input x. The output out represents the logits for each class, which are then used to compute probabilities during inference.
Training the model
We optimize the model parameters (weights and bias) using gradient descent and minimize the binary cross-entropy loss function during the training of the logistic regression model.
We determine the input size based on the number of features in our dataset X_train.shape[1] and find the number of classes by computing the length of unique labels in the training set(len(torch.unique(y_train))). Then, we initialize our logistic regression model LogisticRegression with the determined input size and number of classes. For the loss function, we use nn.CrossEntropyLoss(), commonly used for multi-class classification problems. For optimization, we choose stochastic gradient descent (SGD) with a learning rate (lr) of 0.01.
To train the model, we set the number of epochs to 200. Within the training loop for each epoch, we reset the gradients of the model parameters to prevent accumulation from prior iterations (optimizer.zero_grad()), then pass the training features X_train through the model to get the predicted outputs. Using the predicted outputs and the original labels y_train, we calculate the loss using the specified loss function (criterion). Next, we perform backpropagation to compute the gradients of the loss with respect to the model parameters loss.backward(). Finally, we update the model parameters using the optimizer optimizer.step(), taking a step towards minimizing the loss.
Prediction and accuracy check
In this part, we first make predictions on the test data using the trained logistic regression model. Inside the with torch.no_grad() block, we ensure that no gradients are calculated during inference to save memory and computation. We obtain the predicted class labels y_pred by taking the index of the maximum value along the second dimension of the output tensor outputs, which corresponds to the predicted class.
Next, we convert the predicted labels and the true labels from PyTorch tensors to numpy arrays using the numpy() method. This conversion is necessary to use scikit-learn's accuracy_score function and classification_report.
Then, we calculate the accuracy of the predictions by comparing the predicted labels y_pred_np with the true labels y_test_np using the accuracy_score function from scikit-learn.
Finally, we print the accuracy score and the classification report, which provides a summary of various evaluation metrics such as precision, recall, and F1-score for each class in the classification task.
The result is as follows.
Classification Report:
precision recall f1-score support
0 0.95 0.98 0.97 43
1 0.99 0.97 0.98 71
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
Source code listing
No comments:
Post a Comment