Clustering¶

We are going to look at a dataset of car acceptability. This dataset consists of 6 attributes that describe a car, and a class describing wether or not the car is acceptable. We will try to cluster this dataset into 4 clusters without looking at class (acc). We will then compare the clustering to the pre-existing class values and see if they match up.

In [59]:

import pandas as pd
import numpy as np

data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", names=("price","maint","doors","ppl","lug","safe","acc"))
data.head(5)

Out[59]:

	price	maint	doors	ppl	lug	safe	acc
0	vhigh	vhigh	2	2	small	low	unacc
1	vhigh	vhigh	2	2	small	med	unacc
2	vhigh	vhigh	2	2	small	high	unacc
3	vhigh	vhigh	2	2	med	low	unacc
4	vhigh	vhigh	2	2	med	med	unacc

First we will create another dataframe that contains everything from the dataset except the class column. We do this so we can encode the dataset with dummy variables. Categorical data is not usefull for machine learning algorithms so we must change it to numerical data.

In [60]:

data_noclass = data[["price","maint","doors","ppl","lug","safe"]]    #get every attribute except for class (acc)
data_noclass.head(5)

Out[60]:

	price	maint	doors	ppl	lug	safe
0	vhigh	vhigh	2	2	small	low
1	vhigh	vhigh	2	2	small	med
2	vhigh	vhigh	2	2	small	high
3	vhigh	vhigh	2	2	med	low
4	vhigh	vhigh	2	2	med	med

We will use one-hot encoding to turn the different values for each attribute into dummy variables.

In [61]:

onehot_data = pd.get_dummies(data_noclass)
onehot_data.head(5)

Out[61]:

	price_vhigh	maint_vhigh	doors_2	...	ppl_2	lug_med	lug_small	safe_high	safe_low	safe_med
0	1	1	1	...	1	0	1	0	1	0
1	1	1	1	...	1	0	1	0	0	1
2	1	1	1	...	1	0	1	1	0	0
3	1	1	1	...	1	1	0	0	1	0
4	1	1	1	...	1	1	0	0	0	1

5 rows × 21 columns

Lets load sklearn library for KMeans clustering and see how this dataset is clustered.

In [62]:

# Import the kmeans clustering model.
from sklearn.cluster import KMeans

# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=1)

# Fit the model using the good columns.
kmeans_model.fit(onehot_data)
# Get the cluster assignments.
labels = kmeans_model.labels_

labels

Out[62]:

array([0, 0, 0, ..., 1, 1, 1])

We have our labels, now we want to compare them to the existing classes to see how close our clusters are.

In [63]:

data['cluster'] = labels
data.head(50)

Out[63]:

	price	maint	doors	ppl	lug	safe	acc	cluster
0	vhigh	vhigh	2	2	small	low	unacc	0
1	vhigh	vhigh	2	2	small	med	unacc	0
2	vhigh	vhigh	2	2	small	high	unacc	0
3	vhigh	vhigh	2	2	med	low	unacc	3
4	vhigh	vhigh	2	2	med	med	unacc	3
5	vhigh	vhigh	2	2	med	high	unacc	3
6	vhigh	vhigh	2	2	big	low	unacc	0
7	vhigh	vhigh	2	2	big	med	unacc	0
8	vhigh	vhigh	2	2	big	high	unacc	0
9	vhigh	vhigh	2	4	small	low	unacc	2
10	vhigh	vhigh	2	4	small	med	unacc	2
11	vhigh	vhigh	2	4	small	high	unacc	2
12	vhigh	vhigh	2	4	med	low	unacc	2
13	vhigh	vhigh	2	4	med	med	unacc	2
14	vhigh	vhigh	2	4	med	high	unacc	2
15	vhigh	vhigh	2	4	big	low	unacc	2
16	vhigh	vhigh	2	4	big	med	unacc	2
17	vhigh	vhigh	2	4	big	high	unacc	2
18	vhigh	vhigh	2	more	small	low	unacc	1
19	vhigh	vhigh	2	more	small	med	unacc	1
20	vhigh	vhigh	2	more	small	high	unacc	1
21	vhigh	vhigh	2	more	med	low	unacc	1
22	vhigh	vhigh	2	more	med	med	unacc	1
23	vhigh	vhigh	2	more	med	high	unacc	1
24	vhigh	vhigh	2	more	big	low	unacc	1
25	vhigh	vhigh	2	more	big	med	unacc	1
26	vhigh	vhigh	2	more	big	high	unacc	1
27	vhigh	vhigh	3	2	small	low	unacc	0
28	vhigh	vhigh	3	2	small	med	unacc	0
29	vhigh	vhigh	3	2	small	high	unacc	0
30	vhigh	vhigh	3	2	med	low	unacc	3
31	vhigh	vhigh	3	2	med	med	unacc	3
32	vhigh	vhigh	3	2	med	high	unacc	3
33	vhigh	vhigh	3	2	big	low	unacc	0
34	vhigh	vhigh	3	2	big	med	unacc	0
35	vhigh	vhigh	3	2	big	high	unacc	0
36	vhigh	vhigh	3	4	small	low	unacc	2
37	vhigh	vhigh	3	4	small	med	unacc	2
38	vhigh	vhigh	3	4	small	high	unacc	2
39	vhigh	vhigh	3	4	med	low	unacc	2
40	vhigh	vhigh	3	4	med	med	unacc	2
41	vhigh	vhigh	3	4	med	high	unacc	2
42	vhigh	vhigh	3	4	big	low	unacc	2
43	vhigh	vhigh	3	4	big	med	unacc	2
44	vhigh	vhigh	3	4	big	high	unacc	2
45	vhigh	vhigh	3	more	small	low	unacc	1
46	vhigh	vhigh	3	more	small	med	unacc	1
47	vhigh	vhigh	3	more	small	high	unacc	1
48	vhigh	vhigh	3	more	med	low	unacc	1
49	vhigh	vhigh	3	more	med	med	unacc	1

Our clustering is not very accurate, it is clustering the cars into different groups than the classes suggest. Lets look at the distribution of classes to see if we can figure out why this is happening.

In [64]:

classes = data['acc']
classes.value_counts()

Out[64]:

unacc    1210
acc       384
good       69
vgood      65
Name: acc, dtype: int64

The classes in the dataset aren't evenly represented. Lets look at the cars that are classed as 'vgood' and 'good' and see what type of clusters they're put into.

In [74]:

print("vgood clusters")
print(data.loc[data['acc'] == 'vgood']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
print("\ngood clusters")
print(data.loc[data['acc'] == 'good']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute

vgood clusters
1    35
2    30
Name: cluster, dtype: int64

good clusters
2    36
1    33
Name: cluster, dtype: int64

In [ ]:

All of the cars under the 'vgood' and 'good' class are being put into clusters 1 and 2. This is some good news, maybe our clustering algorithm is doing something right, at least there are only two clusters in these two groups. Lets look at the other two groups with more elements to see what their clustering looks like.

In [75]:

print("acc clusters")
print(data.loc[data['acc'] == 'acc']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute
print("\nunacc clusters")
print(data.loc[data['acc'] == 'unacc']['cluster'].value_counts())  #get the frequency of each value in the 'cluster' attribute

acc clusters
2    198
1    186
Name: cluster, dtype: int64

unacc clusters
0    384
1    322
2    312
3    192
Name: cluster, dtype: int64

In [79]:

cluster_1 = data.loc[data['cluster'] == 1]   #get only the datapoints with cluster value of 1
cluster_1

Out[79]:

	price	maint	doors	ppl	lug	safe	acc	cluster
18	vhigh	vhigh	2	more	small	low	unacc	1
19	vhigh	vhigh	2	more	small	med	unacc	1
20	vhigh	vhigh	2	more	small	high	unacc	1
21	vhigh	vhigh	2	more	med	low	unacc	1
22	vhigh	vhigh	2	more	med	med	unacc	1
23	vhigh	vhigh	2	more	med	high	unacc	1
24	vhigh	vhigh	2	more	big	low	unacc	1
25	vhigh	vhigh	2	more	big	med	unacc	1
26	vhigh	vhigh	2	more	big	high	unacc	1
45	vhigh	vhigh	3	more	small	low	unacc	1
46	vhigh	vhigh	3	more	small	med	unacc	1
47	vhigh	vhigh	3	more	small	high	unacc	1
48	vhigh	vhigh	3	more	med	low	unacc	1
49	vhigh	vhigh	3	more	med	med	unacc	1
50	vhigh	vhigh	3	more	med	high	unacc	1
51	vhigh	vhigh	3	more	big	low	unacc	1
52	vhigh	vhigh	3	more	big	med	unacc	1
53	vhigh	vhigh	3	more	big	high	unacc	1
72	vhigh	vhigh	4	more	small	low	unacc	1
73	vhigh	vhigh	4	more	small	med	unacc	1
74	vhigh	vhigh	4	more	small	high	unacc	1
75	vhigh	vhigh	4	more	med	low	unacc	1
76	vhigh	vhigh	4	more	med	med	unacc	1
77	vhigh	vhigh	4	more	med	high	unacc	1
78	vhigh	vhigh	4	more	big	low	unacc	1
79	vhigh	vhigh	4	more	big	med	unacc	1
80	vhigh	vhigh	4	more	big	high	unacc	1
99	vhigh	vhigh	5more	more	small	low	unacc	1
100	vhigh	vhigh	5more	more	small	med	unacc	1
101	vhigh	vhigh	5more	more	small	high	unacc	1
...	...	...	...	...	...	...	...	...
1644	low	low	2	more	big	low	unacc	1
1645	low	low	2	more	big	med	good	1
1646	low	low	2	more	big	high	vgood	1
1665	low	low	3	more	small	low	unacc	1
1666	low	low	3	more	small	med	acc	1
1667	low	low	3	more	small	high	good	1
1668	low	low	3	more	med	low	unacc	1
1669	low	low	3	more	med	med	good	1
1670	low	low	3	more	med	high	vgood	1
1671	low	low	3	more	big	low	unacc	1
1672	low	low	3	more	big	med	good	1
1673	low	low	3	more	big	high	vgood	1
1692	low	low	4	more	small	low	unacc	1
1693	low	low	4	more	small	med	acc	1
1694	low	low	4	more	small	high	good	1
1695	low	low	4	more	med	low	unacc	1
1696	low	low	4	more	med	med	good	1
1697	low	low	4	more	med	high	vgood	1
1698	low	low	4	more	big	low	unacc	1
1699	low	low	4	more	big	med	good	1
1700	low	low	4	more	big	high	vgood	1
1719	low	low	5more	more	small	low	unacc	1
1720	low	low	5more	more	small	med	acc	1
1721	low	low	5more	more	small	high	good	1
1722	low	low	5more	more	med	low	unacc	1
1723	low	low	5more	more	med	med	good	1
1724	low	low	5more	more	med	high	vgood	1
1725	low	low	5more	more	big	low	unacc	1
1726	low	low	5more	more	big	med	good	1
1727	low	low	5more	more	big	high	vgood	1

576 rows × 8 columns

My first observation is that every element of the ppl attribute is "more" so lets try looking at the value counts for the ppl attribute.

In [84]:

print("Number of passengers, Cluster 1: ")
print(cluster_1['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, all Data: ")
print(data['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute

Number of passengers, Cluster 1: 
more    576
Name: ppl, dtype: int64

Number of passengers, all Data: 
more    576
4       576
2       576
Name: ppl, dtype: int64

The number of passengers is "more" for all of cluster 1, so this could be a big factor for the clustering. Lets look at some of the other clusters to see if they have a similar trend.

In [86]:

cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
cluster_0 = data.loc[data['cluster'] == 0]
print("Number of passengers, Cluster 2: ")
print(cluster_2['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 3: ")
print(cluster_3['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nNumber of passengers, Cluster 0: ")
print(cluster_0['ppl'].value_counts())   #get the frequency of each value in the 'ppl' attribute

Number of passengers, Cluster 2: 
4    576
Name: ppl, dtype: int64

Number of passengers, Cluster 3: 
2    192
Name: ppl, dtype: int64

Number of passengers, Cluster 0: 
2    384
Name: ppl, dtype: int64

It seems like our hypothesis was correct, the clustering algorithm grouped the cars primarily by number of passengers. It is likely that the only reason there are '2' passengers in cluster 3 and 0 is that it had to split it 4 times, but there are only 3 valus for number of passengers. This could be because the distribution of the number of passengers was so even between the values. Lets look at the distribution of values in the other attributes to see if they are also evenly distributed.

In [87]:

print("\nBuying Price: ")
print(data['price'].value_counts())   #get the frequency of values
print("\nMaintanance Price: ")
print(data['maint'].value_counts())   #get the frequency of values
print("\nNumber of doors: ")
print(data['doors'].value_counts())   #get the frequency of values
print("\nLuggage room: ")
print(data['lug'].value_counts())   #get the frequency of values
print("\nSafety: ")
print(data['safe'].value_counts())   #get the frequency of values

Buying Price: 
vhigh    432
low      432
high     432
med      432
Name: price, dtype: int64

Maintanance Price: 
vhigh    432
low      432
high     432
med      432
Name: maint, dtype: int64

Number of doors: 
4        432
2        432
5more    432
3        432
Name: doors, dtype: int64

Luggage room: 
small    576
med      576
big      576
Name: lug, dtype: int64

Safety: 
low     576
high    576
med     576
Name: safe, dtype: int64

All of the attributes have even represented, so its interesting that the clustering algorithm decided to cluster based primarily on the number of passengers. Lets try running the clustering algorithm again with a different random seed and see if it does the same thing.

In [89]:

# Import the kmeans clustering model.
from sklearn.cluster import KMeans

# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=4, random_state=11)

# Fit the model using the good columns.
kmeans_model.fit(onehot_data)

# Get the cluster assignments.
labels = kmeans_model.labels_

#reset the cluster column of the data
data['cluster'] = labels
data.head(30)

Out[89]:

	price	maint	doors	ppl	lug	safe	acc	cluster
0	vhigh	vhigh	2	2	small	low	unacc	1
1	vhigh	vhigh	2	2	small	med	unacc	1
2	vhigh	vhigh	2	2	small	high	unacc	1
3	vhigh	vhigh	2	2	med	low	unacc	2
4	vhigh	vhigh	2	2	med	med	unacc	2
5	vhigh	vhigh	2	2	med	high	unacc	2
6	vhigh	vhigh	2	2	big	low	unacc	0
7	vhigh	vhigh	2	2	big	med	unacc	0
8	vhigh	vhigh	2	2	big	high	unacc	0
9	vhigh	vhigh	2	4	small	low	unacc	3
10	vhigh	vhigh	2	4	small	med	unacc	3
11	vhigh	vhigh	2	4	small	high	unacc	3
12	vhigh	vhigh	2	4	med	low	unacc	2
13	vhigh	vhigh	2	4	med	med	unacc	2
14	vhigh	vhigh	2	4	med	high	unacc	2
15	vhigh	vhigh	2	4	big	low	unacc	0
16	vhigh	vhigh	2	4	big	med	unacc	0
17	vhigh	vhigh	2	4	big	high	unacc	0
18	vhigh	vhigh	2	more	small	low	unacc	1
19	vhigh	vhigh	2	more	small	med	unacc	1
20	vhigh	vhigh	2	more	small	high	unacc	1
21	vhigh	vhigh	2	more	med	low	unacc	2
22	vhigh	vhigh	2	more	med	med	unacc	2
23	vhigh	vhigh	2	more	med	high	unacc	2
24	vhigh	vhigh	2	more	big	low	unacc	0
25	vhigh	vhigh	2	more	big	med	unacc	0
26	vhigh	vhigh	2	more	big	high	unacc	0
27	vhigh	vhigh	3	2	small	low	unacc	1
28	vhigh	vhigh	3	2	small	med	unacc	1
29	vhigh	vhigh	3	2	small	high	unacc	1

From looking at this run of the clustering algorithm it seems that the luggage is the deciding factor for how the set is clustered. Lets look at the luggage size value counts for each cluster.

In [90]:

cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nLuggage Space, Cluster 0: ")
print(cluster_0['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 1: ")
print(cluster_1['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 2: ")
print(cluster_2['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nLuggage Space, Cluster 3: ")
print(cluster_3['lug'].value_counts())   #get the frequency of each value in the 'ppl' attribute

Luggage Space, Cluster 0: 
big    576
Name: lug, dtype: int64

Luggage Space, Cluster 1: 
small    384
Name: lug, dtype: int64

Luggage Space, Cluster 2: 
med    576
Name: lug, dtype: int64

Luggage Space, Cluster 3: 
small    192
Name: lug, dtype: int64

We were right, the clusters are being decided by luggage space now. Since each attribute has such an even representation, the clustering algorithm is just picking one to focus on based on random chance. Its likely due to the even distribution of each attribute. That is very unnatural for a real dataset to have, for each attribute to have the same number of each value.

Lets try running this dataset through some predictors to see if we can accurately predict the class.

In [132]:

onehot_data['acc'] = data['acc'].copy() #add class onto onehot encoded attributes

onehot_data['acc'] = onehot_data['acc'].astype('category')  #change to category datatype
onehot_data['acc'] = onehot_data['acc'].cat.codes  #change to int code

# Generate the training set.  Set random_state to be able to replicate results.
train = onehot_data.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = onehot_data.loc[~onehot_data.index.isin(train.index)]

# Get all the columns from the dataframe.
columns = onehot_data.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["acc"]]

# Store the variable we'll be predicting on.
target = "acc"


from sklearn.ensemble import RandomForestClassifier

# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=10, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train[columns], train[target])
# Make predictions.
predictions = model.predict(test[columns])

#add the prediction column onto the test data to compare with actual class
test['prediction'] = predictions

C:\Users\Asus\Anaconda3\lib\site-packages\ipykernel\__main__.py:30: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

We changed the class to a categorical variable and then encoded it as an integer from [0,3]. Then we split the data into training and test data using random sampling. Then we gathered all the columns that weren't the class. We imported sklearn RandomForestClassifier and fit a model with our training data. Then we used that model to predict our test data without the class in it. Finally, we added on a column to the end of test data to represent the predictions.

In [134]:

#calculate percentage of correct classifications
num_correct = 0
for (c, p) in zip(test['acc'],test['prediction']):
    if(c == p):
        num_correct += 1
        
percent_correct = num_correct/len(test)
print("Percent Classified Correctly: %.2f%%" % (percent_correct*100))

Percent Classified Correctly: 100.00%

We compared the predicted class with the actual class and we got 100% accuracy. That is good news for this dataset, it seems that the data really is representative of the class. Our k-means clustering didn't show this, so maybe k-means clustering isn't the right clustering algorithm to use. This is likely because k-means is a linear clustering algorithm, and our classes are defined in a non-linear fashion. Lets use a different clustering algorithm and see if we get better results. Our classes are clearly well defined since we got 100% accuracy, we must be able to find a clustering algorithm that can match this.

Lets try DBSCAN from the sklearn library which should be good at picking up non-linear clusters.

In [138]:

#remake onehot_data for clustering
onehot_data = pd.get_dummies(data_noclass)

#import DBSCAN from sklearn
from sklearn.cluster import DBSCAN

#create the model with some arbitrary values to start
model = DBSCAN(eps=0.5, min_samples=5)

#fit the model and get clusters
clusters = model.fit_predict(onehot_data)

#set cluster column of data
data['cluster'] = clusters

data.head(5)

Out[138]:

	price	maint	doors	ppl	lug	safe	acc	cluster
0	vhigh	vhigh	2	2	small	low	unacc	-1
1	vhigh	vhigh	2	2	small	med	unacc	-1
2	vhigh	vhigh	2	2	small	high	unacc	-1
3	vhigh	vhigh	2	2	med	low	unacc	-1
4	vhigh	vhigh	2	2	med	med	unacc	-1

In [139]:

data['cluster'].value_counts()

Out[139]:

-1    1728
Name: cluster, dtype: int64

It seems that our clustering algorithm didn't work. After trying many different combinations of parameters, (lower min_samples and higher eps should lead to more clusters) I could only ever get 1 cluster, which is useless. This algorithm must be ill suited for a dataset like this with all categorical data.

I'm starting to think that there isn't a way to cluster this dataset to match the classes. There is a good chance that the classes are spread out in separate areas of the space. If this is the case, then there would be no way for an unsupervised learning algorithm to know that these separate areas could be linked as one cluster.

We'll try one more clustering algorithm just to see if we can get some results. We'll try agglomerative clustering which is also good for non-linear clusters.

In [166]:

from sklearn.cluster import AgglomerativeClustering

#create the model with 4 clusters
model = AgglomerativeClustering(n_clusters = 4)

#fit the model and get clusters
clusters = model.fit_predict(onehot_data)

#set cluster column of data
data['cluster'] = clusters

#get clusters and frequency of each
print(data['cluster'].value_counts())
print(data['acc'].value_counts())

0    675
3    351
2    351
1    351
Name: cluster, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: acc, dtype: int64

This clustering method did not work as we hoped, it didn't come close to the same distribution of classes that we have in the labelled data. Lets look at the clusters to see which classes are represented in each.

In [164]:

cluster_0 = data.loc[data['cluster'] == 0]
cluster_1 = data.loc[data['cluster'] == 1]
cluster_2 = data.loc[data['cluster'] == 2]
cluster_3 = data.loc[data['cluster'] == 3]
print("\nClasses, Cluster 0: ")
print(cluster_0['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 1: ")
print(cluster_1['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 2: ")
print(cluster_2['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute
print("\nClasses, Cluster 3: ")
print(cluster_3['acc'].value_counts())   #get the frequency of each value in the 'ppl' attribute

Classes, Cluster 0: 
unacc    545
acc      130
Name: acc, dtype: int64

Classes, Cluster 1: 
unacc    216
acc       86
vgood     26
good      23
Name: acc, dtype: int64

Classes, Cluster 2: 
unacc    243
acc      108
Name: acc, dtype: int64

Classes, Cluster 3: 
unacc    206
acc       60
good      46
vgood     39
Name: acc, dtype: int64

This is nowhere near a match for the classes. But the more I explore this dataset trying to cluster it, the more I think it isn't suited to clustering. Even though I haven't gained much information about the dataset, I still feel like I've learned something about it.

For my next clustering project I will definitely choose a dataset with continuous variables so that we can get some better results.

In [ ]:

Data Science and Machine Learning

Comparing Clustering on a Labelled Dataset

Clustering¶

	price_vhigh	maint_vhigh	doors_2	...	ppl_2	lug_med	lug_small	safe_high	safe_low	safe_med
0	1	1	1	...	1	0	1	0	1	0
1	1	1	1	...	1	0	1	0	0	1
2	1	1	1	...	1	0	1	1	0	0
3	1	1	1	...	1	1	0	0	1	0
4	1	1	1	...	1	1	0	0	0	1

	price_vhigh	maint_vhigh	doors_2	...	ppl_2	lug_med	lug_small	safe_high	safe_low	safe_med
0	1	1	1	...	1	0	1	0	1	0
1	1	1	1	...	1	0	1	0	0	1
2	1	1	1	...	1	0	1	1	0	0
3	1	1	1	...	1	1	0	0	1	0
4	1	1	1	...	1	1	0	0	0	1

Clustering¶

blogroll

social

	price_vhigh	maint_vhigh	doors_2	...	ppl_2	lug_med	lug_small	safe_high	safe_low	safe_med
0	1	1	1	...	1	0	1	0	1	0
1	1	1	1	...	1	0	1	0	0	1
2	1	1	1	...	1	0	1	1	0	0
3	1	1	1	...	1	1	0	0	1	0
4	1	1	1	...	1	1	0	0	0	1