DataTechNotes: Pyspark Linear SVC Classification Example

PySpark MLLib API provides a LinearSVC class to classify data with linear support vector machines (SVMs). SVM builds hyperplane(s) in a high dimensional space to separate data into two groups. The method is widely used to implement classification, regression, and anomaly detection techniques in machine learning. Linear SVM classifies data into two groups by using linear straight line.

In this tutorial, you'll briefly learn how to train and classify binary classification data by using PySpark Linear SVC model. The tutorial covers:

Preparing the data
Prediction and accuracy check
Source code listing

We'll start by loading the required libraries for this tutorial.

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer
import pandas as pd

Preparing the data

We use Breast Cancer dataset to perform binary classification and it can be easily loaded from the Scikit-learn dataset module. Below code explains how to load dataset and transform it into the pandas data frame type.

bc = load_breast_cancer()

df_bc = pd.DataFrame(bc.data, columns=bc.feature_names)
df_bc['label'] = pd.Series(bc.target)

print(df_bc.head())

mean radius  mean texture  ...  worst fractal dimension  label
0        17.99         10.38  ...                  0.11890      0
1        20.57         17.77  ...                  0.08902      0
2        19.69         21.25  ...                  0.08758      0
3        11.42         20.38  ...                  0.17300      0
4        20.29         14.34  ...                  0.07678      0

[5 rows x 31 columns]

Next, we'll define SqlConext and create data frame by using df_bc data. You can check the data frame schema.

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

data = sqlContext.createDataFrame(df_bc)
print(data.printSchema())

root
 |-- mean radius: double (nullable = true)
 |-- mean texture: double (nullable = true)
 |-- mean perimeter: double (nullable = true)
 |-- mean area: double (nullable = true)
 |-- mean smoothness: double (nullable = true)
 |-- mean compactness: double (nullable = true)
 |-- mean concavity: double (nullable = true)
 |-- mean concave points: double (nullable = true)
 |-- mean symmetry: double (nullable = true)
 |-- mean fractal dimension: double (nullable = true)
 |-- radius error: double (nullable = true)
 |-- texture error: double (nullable = true)
 |-- perimeter error: double (nullable = true)
 |-- area error: double (nullable = true)
 |-- smoothness error: double (nullable = true)
 |-- compactness error: double (nullable = true)
 |-- concavity error: double (nullable = true)
 |-- concave points error: double (nullable = true)
 |-- symmetry error: double (nullable = true)
 |-- fractal dimension error: double (nullable = true)
 |-- worst radius: double (nullable = true)
 |-- worst texture: double (nullable = true)
 |-- worst perimeter: double (nullable = true)
 |-- worst area: double (nullable = true)
 |-- worst smoothness: double (nullable = true)
 |-- worst compactness: double (nullable = true)
 |-- worst concavity: double (nullable = true)
 |-- worst concave points: double (nullable = true)
 |-- worst symmetry: double (nullable = true)
 |-- worst fractal dimension: double (nullable = true)
 |-- label: long (nullable = true)

To combine all feature data and separate 'label' data in a dataset, we use VectorAssembler.

features = bc.feature_names

va = VectorAssembler(inputCols = features, outputCol='features')

va_df = va.transform(data)
va_df = va_df.select(['features', 'label'])
va_df.show(3)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[17.99,10.38,122....|    0|
|[20.57,17.77,132....|    0|
|[19.69,21.25,130....|    0|
+--------------------+-----+
only showing top 3 rows

Next, we'll split data into the train and test parts.

(train, test) = va_df.randomSplit([0.9, 0.1])

Prediction and Accuracy Check

We'll define the linear SVC model by using the LinearSVC class and fit model on train data. Here, we'll set 50 into the iteration number parameter. To predict test data, we can use trasnform() method.

lsvc = LinearSVC(labelCol="label", maxIter=50)
lsvc = lsvc.fit(train)

pred = lsvc.transform(test)

pred.show(3)

+--------------------+-----+--------------------+----------+
|            features|label|       rawPrediction|prediction|
+--------------------+-----+--------------------+----------+
|[16.13,17.88,107....|    0|[3.42019452073193...|       0.0|
|[11.31,19.04,71.8...|    1|[-2.5294373438518...|       1.0|
|[12.86,18.0,83.19...|    1|[-2.2054146906822...|       1.0|
+--------------------+-----+--------------------+----------+

only showing top 3 rows

After predicting test data, we'll check the prediction accuracy. Here, we can use MulticlassClassificationEvaluator. Confusion matrix can be created by using confusion_matrix function of sklearn.metrics module.

evaluator=MulticlassClassificationEvaluator(metricName="accuracy")
acc = evaluator.evaluate(pred)

print("Prediction Accuracy: ", acc)

y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()

cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)

Prediction Accuracy:  0.9365079365079365
Confusion Matrix:
[[24  2]
 [ 2 35]]

Finally, we'll stop spark context session.

# Stop session 
sc.stop()

In this tutorial, we've briefly learned how to fit and classify data by using PySpark LinearSVC class. The full source code is listed below.

Source code listing

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer
import pandas as pd

bc = load_breast_cancer()

df_bc = pd.DataFrame(bc.data, columns=bc.feature_names)
df_bc['label'] = pd.Series(bc.target)
print(df_bc.head())

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

data = sqlContext.createDataFrame(df_bc)
print(data.printSchema())

features = bc.feature_names

va = VectorAssembler(inputCols = features, outputCol='features')

va_df = va.transform(data)
va_df = va_df.select(['features', 'label'])
va_df.show(3)

(train, test) = va_df.randomSplit([0.9, 0.1])

lsvc = LinearSVC(labelCol="label", maxIter=50)
lsvc = lsvc.fit(train)

pred = lsvc.transform(test)
pred.show(3)

evaluator=MulticlassClassificationEvaluator(metricName="accuracy")
acc = evaluator.evaluate(pred)
print("Prediction Accuracy: ", acc)


y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()

cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)

sc.stop()

References:

PySpark Linear Support Vector Classifier

DataTechNotes

Pages

Pyspark Linear SVC Classification Example

No comments:

Post a Comment