1

Is there a way of using one's preferred colors (8 to 10 or more) for different clusters plotted by the following code:

import numpy as np

existing_df_2d.plot(
    kind='scatter',
    x='PC2',y='PC1',
    c=existing_df_2d.cluster.astype(np.float), 
    figsize=(16,8))

The code is from here: https://www.codementor.io/python/tutorial/data-science-python-pandas-r-dimensionality-reduction

Thanks

I have tried the following without success:

LABEL_COLOR_MAP = {0 : 'red',
               1 : 'blue',
               2 : 'green',
               3 : 'purple'}

label_color = [LABEL_COLOR_MAP[l] for l in range(len(np.unique(existing_df_2d.cluster)))]

existing_df_2d.plot(
    kind='scatter',
    x='PC2',y='PC1',
    c=label_color, 
    figsize=(16,8))
user27976
  • 903
  • 3
  • 17
  • 28

1 Answers1

1

You need add one new color 4 and use maping by dictionary LABEL_COLOR_MAP:

LABEL_COLOR_MAP = {0 : 'red',
                   1 : 'blue',
                   2 : 'green',
                   3 : 'purple',
                   4 : 'yellow'}

existing_df_2d.plot(
        kind='scatter',
        x='PC2',y='PC1',
        c=existing_df_2d.cluster.map(LABEL_COLOR_MAP), 
        figsize=(16,8))

because:

print np.unique(existing_df_2d.cluster)
[0 1 2 3 4]

All code:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVACGsg5v4xu6badNs4C7I/pub?gid=0&output=csv'

existing_df = pd.read_csv(
    tb_existing_url_csv, 
    index_col = 0, 
    thousands  = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']

pca = PCA(n_components=2)
pca.fit(existing_df)
PCA(copy=True, n_components=2, whiten=False)
existing_2d = pca.transform(existing_df)

existing_df_2d = pd.DataFrame(existing_2d)
existing_df_2d.index = existing_df.index
existing_df_2d.columns = ['PC1','PC2']
existing_df_2d.head()

kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit(existing_df)
existing_df_2d['cluster'] = pd.Series(clusters.labels_, index=existing_df_2d.index)
print existing_df_2d.head()

                       PC1         PC2  cluster
country                                        
Afghanistan    -732.215864  203.381494        2
Albania         613.296510    4.715978        3
Algeria         569.303713  -36.837051        3
American Samoa  717.082766    5.464696        3
Andorra         661.802241   11.037736        3    

LABEL_COLOR_MAP = {0 : 'red',
                   1 : 'blue',
                   2 : 'green',
                   3 : 'purple',
                   4 : 'yellow'}

existing_df_2d.plot(
        kind='scatter',
        x='PC2',y='PC1',
        c=existing_df_2d.cluster.map(LABEL_COLOR_MAP), 
        figsize=(16,8))

graph

Testing:

Top 10 rows by column PC2:

print existing_df_2d.loc[existing_df_2d['PC2'].nlargest(10).index,:]
                          PC1         PC2  cluster
country                                           
Kiribati         -2234.809790  864.494075        2
Djibouti         -3798.447446  578.975277        4
Bhutan           -1742.709249  569.448954        2
Solomon Islands   -809.277671  530.292939        1
Nepal             -986.570652  525.624757        1
Korea, Dem. Rep. -2146.623299  438.945977        2
Timor-Leste      -1618.364795  428.244340        2
Tuvalu           -1075.316806  366.666171        1
Mongolia          -686.839037  363.722971        1
India            -1146.809345  363.270389        1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks a lot @jezrael. But my worry remains that the different colors seems to to all over the place and not showing distinctive clusters as on that website. I think a connection between the colors and clusters is still missing. – user27976 Mar 24 '16 at 12:48
  • Yes, you are right. I edit answer and add testing part. – jezrael Mar 24 '16 at 13:14