Clustering¶
We are going to look at a dataset of car acceptability. This dataset consists of 6 attributes that describe a car, and a class describing wether or not the car is acceptable. We will try to cluster this dataset into 4 clusters without looking at class (acc). We will then compare the clustering to the pre-existing class values and see if they match up.
import pandas as pd
import numpy as np
data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", names=("price","maint","doors","ppl","lug","safe","acc"))
data.head(5)
First we will create another dataframe that contains everything from the dataset except the class column. We do this so we can encode the dataset with dummy variables. Categorical data is not usefull for machine learning algorithms so we must change it to numerical data.
data_noclass = data[["price","maint","doors","ppl","lug","safe"]] #get every attribute except for class (acc)
data_noclass.head(5)
We will use one-hot encoding to turn the different values for each attribute into dummy variables.
onehot_data = pd.get_dummies(data_noclass)
onehot_data.head(5)
Lets load sklearn library for KMeans clustering and see how this dataset is clustered.
# Import the kmeans clustering model.
from sklearn.cluster import KMeans
# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=1)
# Fit the model using the good columns.
kmeans_model.fit(onehot_data)
# Get the cluster assignments.
labels = kmeans_model.labels_
labels
We have our labels, now we want to compare them to the existing classes to see how close our clusters are.
data['cluster'] = labels
data.head(50)
Our clustering is not very accurate, it is clustering the cars into different groups than the classes suggest. Lets look at the distribution of classes to see if we can figure out why this is happening.
classes = data['acc']
classes.value_counts()
The classes in the dataset aren't evenly represented. Lets look at the cars that are classed as 'vgood' and 'good' and see what type of clusters they're put into.
print("vgood clusters")
print(data.loc[data['acc'] == 'vgood']['cluster'].value_counts()) #get the frequency of each value in the 'cluster' attribute
print("\ngood clusters")
print(data.loc[data['acc'] == 'good']['cluster'].value_counts()) #get the frequency of each value in the 'cluster' attribute
All of the cars under the 'vgood' and 'good' class are being put into clusters 1 and 2. This is some good news, maybe our clustering algorithm is doing something right, at least there are only two clusters in these two groups. Lets look at the other two groups with more elements to see what their clustering looks like.
print("acc clusters")
print(data.loc[data['acc'] == 'acc']['cluster'].value_counts()) #get the frequency of each value in the 'cluster' attribute
print("\nunacc clusters")
print(data.loc[data['acc'] == 'unacc']['cluster'].value_counts()) #get the frequency of each value in the 'cluster' attribute
cluster_1 = data.loc[data['cluster'] == 1] #get only the datapoints with cluster value of 1
cluster_1
My first observation is that every element of the ppl attribute is "more" so lets try looking at the value counts for the ppl attribute.
print("Number of passengers, Cluster 1: ")
print(cluster_1['ppl'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, all Data: ")
print(data['ppl'].value_counts()) #get the frequency of each value in the 'ppl' attribute
The number of passengers is "more" for all of cluster 1, so this could be a big factor for the clustering. Lets look at some of the other clusters to see if they have a similar trend.
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
cluster_0 = data.loc[data['cluster'] == 0]
print("Number of passengers, Cluster 2: ")
print(cluster_2['ppl'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 3: ")
print(cluster_3['ppl'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 0: ")
print(cluster_0['ppl'].value_counts()) #get the frequency of each value in the 'ppl' attribute
It seems like our hypothesis was correct, the clustering algorithm grouped the cars primarily by number of passengers. It is likely that the only reason there are '2' passengers in cluster 3 and 0 is that it had to split it 4 times, but there are only 3 valus for number of passengers. This could be because the distribution of the number of passengers was so even between the values. Lets look at the distribution of values in the other attributes to see if they are also evenly distributed.
print("\nBuying Price: ")
print(data['price'].value_counts()) #get the frequency of values
print("\nMaintanance Price: ")
print(data['maint'].value_counts()) #get the frequency of values
print("\nNumber of doors: ")
print(data['doors'].value_counts()) #get the frequency of values
print("\nLuggage room: ")
print(data['lug'].value_counts()) #get the frequency of values
print("\nSafety: ")
print(data['safe'].value_counts()) #get the frequency of values
All of the attributes have even represented, so its interesting that the clustering algorithm decided to cluster based primarily on the number of passengers. Lets try running the clustering algorithm again with a different random seed and see if it does the same thing.
# Import the kmeans clustering model.
from sklearn.cluster import KMeans
# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=11)
# Fit the model using the good columns.
kmeans_model.fit(onehot_data)
# Get the cluster assignments.
labels = kmeans_model.labels_
#reset the cluster column of the data
data['cluster'] = labels
data.head(30)
From looking at this run of the clustering algorithm it seems that the luggage is the deciding factor for how the set is clustered. Lets look at the luggage size value counts for each cluster.
cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nLuggage Space, Cluster 0: ")
print(cluster_0['lug'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 1: ")
print(cluster_1['lug'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 2: ")
print(cluster_2['lug'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 3: ")
print(cluster_3['lug'].value_counts()) #get the frequency of each value in the 'ppl' attribute
We were right, the clusters are being decided by luggage space now. Since each attribute has such an even representation, the clustering algorithm is just picking one to focus on based on random chance. Its likely due to the even distribution of each attribute. That is very unnatural for a real dataset to have, for each attribute to have the same number of each value.
Lets try running this dataset through some predictors to see if we can accurately predict the class.
onehot_data['acc'] = data['acc'].copy() #add class onto onehot encoded attributes
onehot_data['acc'] = onehot_data['acc'].astype('category') #change to category datatype
onehot_data['acc'] = onehot_data['acc'].cat.codes #change to int code
# Generate the training set. Set random_state to be able to replicate results.
train = onehot_data.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = onehot_data.loc[~onehot_data.index.isin(train.index)]
# Get all the columns from the dataframe.
columns = onehot_data.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["acc"]]
# Store the variable we'll be predicting on.
target = "acc"
from sklearn.ensemble import RandomForestClassifier
# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train[columns], train[target])
# Make predictions.
predictions = model.predict(test[columns])
#add the prediction column onto the test data to compare with actual class
test['prediction'] = predictions
We changed the class to a categorical variable and then encoded it as an integer from [0,3]. Then we split the data into training and test data using random sampling. Then we gathered all the columns that weren't the class. We imported sklearn RandomForestClassifier and fit a model with our training data. Then we used that model to predict our test data without the class in it. Finally, we added on a column to the end of test data to represent the predictions.
#calculate percentage of correct classifications
num_correct = 0
for (c, p) in zip(test['acc'],test['prediction']):
if(c == p):
num_correct += 1
percent_correct = num_correct/len(test)
print("Percent Classified Correctly: %.2f%%" % (percent_correct*100))
We compared the predicted class with the actual class and we got 100% accuracy. That is good news for this dataset, it seems that the data really is representative of the class. Our k-means clustering didn't show this, so maybe k-means clustering isn't the right clustering algorithm to use. This is likely because k-means is a linear clustering algorithm, and our classes are defined in a non-linear fashion. Lets use a different clustering algorithm and see if we get better results. Our classes are clearly well defined since we got 100% accuracy, we must be able to find a clustering algorithm that can match this.
Lets try DBSCAN from the sklearn library which should be good at picking up non-linear clusters.
#remake onehot_data for clustering
onehot_data = pd.get_dummies(data_noclass)
#import DBSCAN from sklearn
from sklearn.cluster import DBSCAN
#create the model with some arbitrary values to start
model = DBSCAN(eps=0.5, min_samples=5)
#fit the model and get clusters
clusters = model.fit_predict(onehot_data)
#set cluster column of data
data['cluster'] = clusters
data.head(5)
data['cluster'].value_counts()
It seems that our clustering algorithm didn't work. After trying many different combinations of parameters, (lower min_samples and higher eps should lead to more clusters) I could only ever get 1 cluster, which is useless. This algorithm must be ill suited for a dataset like this with all categorical data.
I'm starting to think that there isn't a way to cluster this dataset to match the classes. There is a good chance that the classes are spread out in separate areas of the space. If this is the case, then there would be no way for an unsupervised learning algorithm to know that these separate areas could be linked as one cluster.
We'll try one more clustering algorithm just to see if we can get some results. We'll try agglomerative clustering which is also good for non-linear clusters.
from sklearn.cluster import AgglomerativeClustering
#create the model with 4 clusters
model = AgglomerativeClustering(n_clusters = 4)
#fit the model and get clusters
clusters = model.fit_predict(onehot_data)
#set cluster column of data
data['cluster'] = clusters
#get clusters and frequency of each
print(data['cluster'].value_counts())
print(data['acc'].value_counts())
This clustering method did not work as we hoped, it didn't come close to the same distribution of classes that we have in the labelled data. Lets look at the clusters to see which classes are represented in each.
cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nClasses, Cluster 0: ")
print(cluster_0['acc'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 1: ")
print(cluster_1['acc'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 2: ")
print(cluster_2['acc'].value_counts()) #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 3: ")
print(cluster_3['acc'].value_counts()) #get the frequency of each value in the 'ppl' attribute
This is nowhere near a match for the classes. But the more I explore this dataset trying to cluster it, the more I think it isn't suited to clustering. Even though I haven't gained much information about the dataset, I still feel like I've learned something about it.
For my next clustering project I will definitely choose a dataset with continuous variables so that we can get some better results.