Comparing Clustering on a Labelled Dataset

Clustering

We are going to look at a dataset of car acceptability. This dataset consists of 6 attributes that describe a car, and a class describing wether or not the car is acceptable. We will try to cluster this dataset into 4 clusters without looking at class (acc). We will then compare the clustering to the pre-existing class values and see if they match up.

In [59]:
import pandas as pd
import numpy as np

data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", names=("price","maint","doors","ppl","lug","safe","acc"))
data.head(5)
Out[59]:
price maint doors ppl lug safe acc
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc

First we will create another dataframe that contains everything from the dataset except the class column. We do this so we can encode the dataset with dummy variables. Categorical data is not usefull for machine learning algorithms so we must change it to numerical data.

In [60]:
data_noclass = data[["price","maint","doors","ppl","lug","safe"]]    #get every attribute except for class (acc)
data_noclass.head(5)
Out[60]:
price maint doors ppl lug safe
0 vhigh vhigh 2 2 small low
1 vhigh vhigh 2 2 small med
2 vhigh vhigh 2 2 small high
3 vhigh vhigh 2 2 med low
4 vhigh vhigh 2 2 med med

We will use one-hot encoding to turn the different values for each attribute into dummy variables.

In [61]:
onehot_data = pd.get_dummies(data_noclass)
onehot_data.head(5)
Out[61]:
price_high price_low price_med price_vhigh maint_high maint_low maint_med maint_vhigh doors_2 doors_3 ... doors_5more ppl_2 ppl_4 ppl_more lug_big lug_med lug_small safe_high safe_low safe_med
0 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 0 1 0
1 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 0 0 1
2 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 1 0 0
3 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 1 0 0 1 0
4 0 0 0 1 0 0 0 1 1 0 ... 0 1 0 0 0 1 0 0 0 1

5 rows × 21 columns

Lets load sklearn library for KMeans clustering and see how this dataset is clustered.

In [62]:
# Import the kmeans clustering model.
from sklearn.cluster import KMeans

# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=1)

# Fit the model using the good columns.
kmeans_model.fit(onehot_data)
# Get the cluster assignments.
labels = kmeans_model.labels_

labels
Out[62]:
array([0, 0, 0, ..., 1, 1, 1])

We have our labels, now we want to compare them to the existing classes to see how close our clusters are.

In [63]:
data['cluster'] = labels
data.head(50)
Out[63]:
price maint doors ppl lug safe acc cluster
0 vhigh vhigh 2 2 small low unacc 0
1 vhigh vhigh 2 2 small med unacc 0
2 vhigh vhigh 2 2 small high unacc 0
3 vhigh vhigh 2 2 med low unacc 3
4 vhigh vhigh 2 2 med med unacc 3
5 vhigh vhigh 2 2 med high unacc 3
6 vhigh vhigh 2 2 big low unacc 0
7 vhigh vhigh 2 2 big med unacc 0
8 vhigh vhigh 2 2 big high unacc 0
9 vhigh vhigh 2 4 small low unacc 2
10 vhigh vhigh 2 4 small med unacc 2
11 vhigh vhigh 2 4 small high unacc 2
12 vhigh vhigh 2 4 med low unacc 2
13 vhigh vhigh 2 4 med med unacc 2
14 vhigh vhigh 2 4 med high unacc 2
15 vhigh vhigh 2 4 big low unacc 2
16 vhigh vhigh 2 4 big med unacc 2
17 vhigh vhigh 2 4 big high unacc 2
18 vhigh vhigh 2 more small low unacc 1
19 vhigh vhigh 2 more small med unacc 1
20 vhigh vhigh 2 more small high unacc 1
21 vhigh vhigh 2 more med low unacc 1
22 vhigh vhigh 2 more med med unacc 1
23 vhigh vhigh 2 more med high unacc 1
24 vhigh vhigh 2 more big low unacc 1
25 vhigh vhigh 2 more big med unacc 1
26 vhigh vhigh 2 more big high unacc 1
27 vhigh vhigh 3 2 small low unacc 0
28 vhigh vhigh 3 2 small med unacc 0
29 vhigh vhigh 3 2 small high unacc 0
30 vhigh vhigh 3 2 med low unacc 3
31 vhigh vhigh 3 2 med med unacc 3
32 vhigh vhigh 3 2 med high unacc 3
33 vhigh vhigh 3 2 big low unacc 0
34 vhigh vhigh 3 2 big med unacc 0
35 vhigh vhigh 3 2 big high unacc 0
36 vhigh vhigh 3 4 small low unacc 2
37 vhigh vhigh 3 4 small med unacc 2
38 vhigh vhigh 3 4 small high unacc 2
39 vhigh vhigh 3 4 med low unacc 2
40 vhigh vhigh 3 4 med med unacc 2
41 vhigh vhigh 3 4 med high unacc 2
42 vhigh vhigh 3 4 big low unacc 2
43 vhigh vhigh 3 4 big med unacc 2
44 vhigh vhigh 3 4 big high unacc 2
45 vhigh vhigh 3 more small low unacc 1
46 vhigh vhigh 3 more small med unacc 1
47 vhigh vhigh 3 more small high unacc 1
48 vhigh vhigh 3 more med low unacc 1
49 vhigh vhigh 3 more med med unacc 1

Our clustering is not very accurate, it is clustering the cars into different groups than the classes suggest. Lets look at the distribution of classes to see if we can figure out why this is happening.

In [64]:
classes = data['acc']
classes.value_counts()
Out[64]:
unacc    1210
acc       384
good       69
vgood      65
Name: acc, dtype: int64

The classes in the dataset aren't evenly represented. Lets look at the cars that are classed as 'vgood' and 'good' and see what type of clusters they're put into.

In [74]:
print("vgood clusters")
print(data.loc[data['acc'] == 'vgood']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
print("\ngood clusters")
print(data.loc[data['acc'] == 'good']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
vgood clusters
1    35
2    30
Name: cluster, dtype: int64

good clusters
2    36
1    33
Name: cluster, dtype: int64
In [ ]:
 

All of the cars under the 'vgood' and 'good' class are being put into clusters 1 and 2. This is some good news, maybe our clustering algorithm is doing something right, at least there are only two clusters in these two groups. Lets look at the other two groups with more elements to see what their clustering looks like.

In [75]:
print("acc clusters")
print(data.loc[data['acc'] == 'acc']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
print("\nunacc clusters")
print(data.loc[data['acc'] == 'unacc']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
acc clusters
2    198
1    186
Name: cluster, dtype: int64

unacc clusters
0    384
1    322
2    312
3    192
Name: cluster, dtype: int64
This isn't good news, we were hoping to see different clusters than we got from 'vgood' and 'good. It's likely that the clustering algorithm is finding different correlations in the data that aren't in line with the class. Lets look at all the '1' clusters since they are represented in every class. Maybe we can find something to explain why the clusters are different than the class.
In [79]:
cluster_1 = data.loc[data['cluster'] == 1]   #get only the datapoints with cluster value of 1
cluster_1
Out[79]:
price maint doors ppl lug safe acc cluster
18 vhigh vhigh 2 more small low unacc 1
19 vhigh vhigh 2 more small med unacc 1
20 vhigh vhigh 2 more small high unacc 1
21 vhigh vhigh 2 more med low unacc 1
22 vhigh vhigh 2 more med med unacc 1
23 vhigh vhigh 2 more med high unacc 1
24 vhigh vhigh 2 more big low unacc 1
25 vhigh vhigh 2 more big med unacc 1
26 vhigh vhigh 2 more big high unacc 1
45 vhigh vhigh 3 more small low unacc 1
46 vhigh vhigh 3 more small med unacc 1
47 vhigh vhigh 3 more small high unacc 1
48 vhigh vhigh 3 more med low unacc 1
49 vhigh vhigh 3 more med med unacc 1
50 vhigh vhigh 3 more med high unacc 1
51 vhigh vhigh 3 more big low unacc 1
52 vhigh vhigh 3 more big med unacc 1
53 vhigh vhigh 3 more big high unacc 1
72 vhigh vhigh 4 more small low unacc 1
73 vhigh vhigh 4 more small med unacc 1
74 vhigh vhigh 4 more small high unacc 1
75 vhigh vhigh 4 more med low unacc 1
76 vhigh vhigh 4 more med med unacc 1
77 vhigh vhigh 4 more med high unacc 1
78 vhigh vhigh 4 more big low unacc 1
79 vhigh vhigh 4 more big med unacc 1
80 vhigh vhigh 4 more big high unacc 1
99 vhigh vhigh 5more more small low unacc 1
100 vhigh vhigh 5more more small med unacc 1
101 vhigh vhigh 5more more small high unacc 1
... ... ... ... ... ... ... ... ...
1644 low low 2 more big low unacc 1
1645 low low 2 more big med good 1
1646 low low 2 more big high vgood 1
1665 low low 3 more small low unacc 1
1666 low low 3 more small med acc 1
1667 low low 3 more small high good 1
1668 low low 3 more med low unacc 1
1669 low low 3 more med med good 1
1670 low low 3 more med high vgood 1
1671 low low 3 more big low unacc 1
1672 low low 3 more big med good 1
1673 low low 3 more big high vgood 1
1692 low low 4 more small low unacc 1
1693 low low 4 more small med acc 1
1694 low low 4 more small high good 1
1695 low low 4 more med low unacc 1
1696 low low 4 more med med good 1
1697 low low 4 more med high vgood 1
1698 low low 4 more big low unacc 1
1699 low low 4 more big med good 1
1700 low low 4 more big high vgood 1
1719 low low 5more more small low unacc 1
1720 low low 5more more small med acc 1
1721 low low 5more more small high good 1
1722 low low 5more more med low unacc 1
1723 low low 5more more med med good 1
1724 low low 5more more med high vgood 1
1725 low low 5more more big low unacc 1
1726 low low 5more more big med good 1
1727 low low 5more more big high vgood 1

576 rows × 8 columns

My first observation is that every element of the ppl attribute is "more" so lets try looking at the value counts for the ppl attribute.

In [84]:
print("Number of passengers, Cluster 1: ")
print(cluster_1['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, all Data: ")
print(data['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
Number of passengers, Cluster 1: 
more    576
Name: ppl, dtype: int64

Number of passengers, all Data: 
more    576
4       576
2       576
Name: ppl, dtype: int64

The number of passengers is "more" for all of cluster 1, so this could be a big factor for the clustering. Lets look at some of the other clusters to see if they have a similar trend.

In [86]:
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
cluster_0 = data.loc[data['cluster'] == 0]
print("Number of passengers, Cluster 2: ")
print(cluster_2['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 3: ")
print(cluster_3['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 0: ")
print(cluster_0['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
Number of passengers, Cluster 2: 
4    576
Name: ppl, dtype: int64

Number of passengers, Cluster 3: 
2    192
Name: ppl, dtype: int64

Number of passengers, Cluster 0: 
2    384
Name: ppl, dtype: int64

It seems like our hypothesis was correct, the clustering algorithm grouped the cars primarily by number of passengers. It is likely that the only reason there are '2' passengers in cluster 3 and 0 is that it had to split it 4 times, but there are only 3 valus for number of passengers. This could be because the distribution of the number of passengers was so even between the values. Lets look at the distribution of values in the other attributes to see if they are also evenly distributed.

In [87]:
print("\nBuying Price: ")
print(data['price'].value_counts())   #get the frequency of values
print("\nMaintanance Price: ")
print(data['maint'].value_counts())   #get the frequency of values
print("\nNumber of doors: ")
print(data['doors'].value_counts())   #get the frequency of values
print("\nLuggage room: ")
print(data['lug'].value_counts())   #get the frequency of values
print("\nSafety: ")
print(data['safe'].value_counts())   #get the frequency of values
Buying Price: 
vhigh    432
low      432
high     432
med      432
Name: price, dtype: int64

Maintanance Price: 
vhigh    432
low      432
high     432
med      432
Name: maint, dtype: int64

Number of doors: 
4        432
2        432
5more    432
3        432
Name: doors, dtype: int64

Luggage room: 
small    576
med      576
big      576
Name: lug, dtype: int64

Safety: 
low     576
high    576
med     576
Name: safe, dtype: int64

All of the attributes have even represented, so its interesting that the clustering algorithm decided to cluster based primarily on the number of passengers. Lets try running the clustering algorithm again with a different random seed and see if it does the same thing.

In [89]:
# Import the kmeans clustering model.
from sklearn.cluster import KMeans

# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=11)

# Fit the model using the good columns.
kmeans_model.fit(onehot_data)

# Get the cluster assignments.
labels = kmeans_model.labels_

#reset the cluster column of the data
data['cluster'] = labels
data.head(30)
Out[89]:
price maint doors ppl lug safe acc cluster
0 vhigh vhigh 2 2 small low unacc 1
1 vhigh vhigh 2 2 small med unacc 1
2 vhigh vhigh 2 2 small high unacc 1
3 vhigh vhigh 2 2 med low unacc 2
4 vhigh vhigh 2 2 med med unacc 2
5 vhigh vhigh 2 2 med high unacc 2
6 vhigh vhigh 2 2 big low unacc 0
7 vhigh vhigh 2 2 big med unacc 0
8 vhigh vhigh 2 2 big high unacc 0
9 vhigh vhigh 2 4 small low unacc 3
10 vhigh vhigh 2 4 small med unacc 3
11 vhigh vhigh 2 4 small high unacc 3
12 vhigh vhigh 2 4 med low unacc 2
13 vhigh vhigh 2 4 med med unacc 2
14 vhigh vhigh 2 4 med high unacc 2
15 vhigh vhigh 2 4 big low unacc 0
16 vhigh vhigh 2 4 big med unacc 0
17 vhigh vhigh 2 4 big high unacc 0
18 vhigh vhigh 2 more small low unacc 1
19 vhigh vhigh 2 more small med unacc 1
20 vhigh vhigh 2 more small high unacc 1
21 vhigh vhigh 2 more med low unacc 2
22 vhigh vhigh 2 more med med unacc 2
23 vhigh vhigh 2 more med high unacc 2
24 vhigh vhigh 2 more big low unacc 0
25 vhigh vhigh 2 more big med unacc 0
26 vhigh vhigh 2 more big high unacc 0
27 vhigh vhigh 3 2 small low unacc 1
28 vhigh vhigh 3 2 small med unacc 1
29 vhigh vhigh 3 2 small high unacc 1

From looking at this run of the clustering algorithm it seems that the luggage is the deciding factor for how the set is clustered. Lets look at the luggage size value counts for each cluster.

In [90]:
cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nLuggage Space, Cluster 0: ")
print(cluster_0['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 1: ")
print(cluster_1['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 2: ")
print(cluster_2['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 3: ")
print(cluster_3['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
Luggage Space, Cluster 0: 
big    576
Name: lug, dtype: int64

Luggage Space, Cluster 1: 
small    384
Name: lug, dtype: int64

Luggage Space, Cluster 2: 
med    576
Name: lug, dtype: int64

Luggage Space, Cluster 3: 
small    192
Name: lug, dtype: int64

We were right, the clusters are being decided by luggage space now. Since each attribute has such an even representation, the clustering algorithm is just picking one to focus on based on random chance. Its likely due to the even distribution of each attribute. That is very unnatural for a real dataset to have, for each attribute to have the same number of each value.

Lets try running this dataset through some predictors to see if we can accurately predict the class.

In [132]:
onehot_data['acc'] = data['acc'].copy() #add class onto onehot encoded attributes

onehot_data['acc'] = onehot_data['acc'].astype('category')  #change to category datatype
onehot_data['acc'] = onehot_data['acc'].cat.codes  #change to int code

# Generate the training set.  Set random_state to be able to replicate results.
train = onehot_data.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = onehot_data.loc[~onehot_data.index.isin(train.index)]

# Get all the columns from the dataframe.
columns = onehot_data.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["acc"]]

# Store the variable we'll be predicting on.
target = "acc"


from sklearn.ensemble import RandomForestClassifier

# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train[columns], train[target])
# Make predictions.
predictions = model.predict(test[columns])

#add the prediction column onto the test data to compare with actual class
test['prediction'] = predictions
C:\Users\Asus\Anaconda3\lib\site-packages\ipykernel\__main__.py:30: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

We changed the class to a categorical variable and then encoded it as an integer from [0,3]. Then we split the data into training and test data using random sampling. Then we gathered all the columns that weren't the class. We imported sklearn RandomForestClassifier and fit a model with our training data. Then we used that model to predict our test data without the class in it. Finally, we added on a column to the end of test data to represent the predictions.

In [134]:
#calculate percentage of correct classifications
num_correct = 0
for (c, p) in zip(test['acc'],test['prediction']):
    if(c == p):
        num_correct += 1
        
percent_correct = num_correct/len(test)
print("Percent Classified Correctly: %.2f%%" % (percent_correct*100))
Percent Classified Correctly: 100.00%

We compared the predicted class with the actual class and we got 100% accuracy. That is good news for this dataset, it seems that the data really is representative of the class. Our k-means clustering didn't show this, so maybe k-means clustering isn't the right clustering algorithm to use. This is likely because k-means is a linear clustering algorithm, and our classes are defined in a non-linear fashion. Lets use a different clustering algorithm and see if we get better results. Our classes are clearly well defined since we got 100% accuracy, we must be able to find a clustering algorithm that can match this.

Lets try DBSCAN from the sklearn library which should be good at picking up non-linear clusters.

In [138]:
#remake onehot_data for clustering
onehot_data = pd.get_dummies(data_noclass)

#import DBSCAN from sklearn
from sklearn.cluster import DBSCAN

#create the model with some arbitrary values to start
model = DBSCAN(eps=0.5, min_samples=5)

#fit the model and get clusters
clusters = model.fit_predict(onehot_data)

#set cluster column of data
data['cluster'] = clusters

data.head(5)
Out[138]:
price maint doors ppl lug safe acc cluster
0 vhigh vhigh 2 2 small low unacc -1
1 vhigh vhigh 2 2 small med unacc -1
2 vhigh vhigh 2 2 small high unacc -1
3 vhigh vhigh 2 2 med low unacc -1
4 vhigh vhigh 2 2 med med unacc -1
In [139]:
data['cluster'].value_counts()
Out[139]:
-1    1728
Name: cluster, dtype: int64

It seems that our clustering algorithm didn't work. After trying many different combinations of parameters, (lower min_samples and higher eps should lead to more clusters) I could only ever get 1 cluster, which is useless. This algorithm must be ill suited for a dataset like this with all categorical data.

I'm starting to think that there isn't a way to cluster this dataset to match the classes. There is a good chance that the classes are spread out in separate areas of the space. If this is the case, then there would be no way for an unsupervised learning algorithm to know that these separate areas could be linked as one cluster.

We'll try one more clustering algorithm just to see if we can get some results. We'll try agglomerative clustering which is also good for non-linear clusters.

In [166]:
from sklearn.cluster import AgglomerativeClustering

#create the model with 4 clusters
model = AgglomerativeClustering(n_clusters = 4)

#fit the model and get clusters
clusters = model.fit_predict(onehot_data)

#set cluster column of data
data['cluster'] = clusters

#get clusters and frequency of each
print(data['cluster'].value_counts())
print(data['acc'].value_counts())
0    675
3    351
2    351
1    351
Name: cluster, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: acc, dtype: int64

This clustering method did not work as we hoped, it didn't come close to the same distribution of classes that we have in the labelled data. Lets look at the clusters to see which classes are represented in each.

In [164]:
cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nClasses, Cluster 0: ")
print(cluster_0['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 1: ")
print(cluster_1['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 2: ")
print(cluster_2['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 3: ")
print(cluster_3['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
Classes, Cluster 0: 
unacc    545
acc      130
Name: acc, dtype: int64

Classes, Cluster 1: 
unacc    216
acc       86
vgood     26
good      23
Name: acc, dtype: int64

Classes, Cluster 2: 
unacc    243
acc      108
Name: acc, dtype: int64

Classes, Cluster 3: 
unacc    206
acc       60
good      46
vgood     39
Name: acc, dtype: int64

This is nowhere near a match for the classes. But the more I explore this dataset trying to cluster it, the more I think it isn't suited to clustering. Even though I haven't gained much information about the dataset, I still feel like I've learned something about it.

For my next clustering project I will definitely choose a dataset with continuous variables so that we can get some better results.

In [ ]:
 

blogroll

social