6 Cluster Refinement

Problem with our initial clustering: Many medium-big clusters get broken up even though they seem nicely separated on the UMAP projection.

Two ideas for potential refinement:
1. Run the clustering again on centroids of clusters
2. Take all the existing clusters as single documents (i.e. paste all documents in a cluster together into single document for each cluster) – repeat this process. I think this is a nice idea but I don’t believe I’ll have time to play with it.

The first idea is easier so we’ll start there:

6.1 Refinement Idea 1: Clustering the centroids

Topic Keywords for Clusters

Get top words for further visualization and pile all documents in a cluster into one giant document for the purposes of summarization.

top.words=list()
cluster.docs = vector()
centroids = matrix(NA,k,2)
mem=matrix(NA,nrow=n,ncol=k)

for(i in 1:k){
  mem[,i] = clus$cluster ==i
  tdmi = tfidf_tdm[,mem[,i]]
  rs = row_sums(tdmi)
  top.words[[i]] = names(rs[order(rs,decreasing=T)])[1:10]
  cluster.docs[i] = paste(raw_text[clus$cluster ==i], sep='', collapse=' ')
  centroids[i,]=colMeans(svd_ump[clus$cluster ==i,])
}

displayWords=vector()
for(i in 1:k){displayWords[i] = paste(top.words[[i]][1:7] , sep=' ', collapse='<br>')}

Cluster Summarization

To qualitatively evaluate our meta-clustering via centroids on the visualization, we’ll create some naive cluster summaries to display in the plot using the eigenvector centralities of the graph induced by cosine similarity between sentences in the clusters.

# SummarySentences=vector()
# for(i in 1:k){
# t=cluster.docs[i]
# t = gsub("<br>", " ", t, fixed =T)
# # Change alternative sentence-ending punctuation to '.'
# t = gsub("\\?", ".", t)
# t = gsub("\\!", ".", t)
# # Split by sentence
# t2 = strsplit(t, ".",fixed=TRUE)
# corpus <- Corpus(VectorSource(as.data.frame(t2)[,1]))
# # Remove stop words, numbers and stem
# corpus = tm_map(corpus,removeWords,c(stopwords("en"),'reuter', 'dlrs', 'mln', 'said','will', 'year', 'compani','pct','corp'))
# corpus = tm_map(corpus,removeNumbers)
# corpus = tm_map(corpus,stemDocument)
# ########################################################################
# # Remove empty documents that appear after removal of stopwords and numbers
# td1 = TermDocumentMatrix(corpus)
# empty.cols = col_sums(td1)==0
# td1 = td1[,!empty.cols]
# raw=unlist(t2)[!empty.cols]
# if(length(empty.cols)>0){corpus = corpus[!empty.cols]}
# td = weightTfIdf(td1,normalize=T)
# tdm_norm = apply(td, 2, function(x){x/c(sqrt(t(x)%*%x))})
# tdm_norm = as(tdm_norm,"sparseMatrix")
# # Cosine similarity
# C=t(tdm_norm)%*%tdm_norm
# # Remove self loops
# C[C>0.9999999] = 0
# g = graph_from_adjacency_matrix(C, weighted = T)
# j=order(eigen_centrality(g)$vector,decreasing=T)
# SummarySentences[i] = paste(raw[j[1:5]], sep= ' ',collapse='<br>')
# }
# save(SummarySentences, file='docs/final_data_plots/SummarySentences.RData')
load('docs/final_data_plots/SummarySentences.RData')
# cen_clus = hdbscan(centroids, 3) # Down to 81 Clusters...Looks Pretty Good. 
# save(cen_clus,file='docs/final_data_plots/cen_clus.RData')
load('docs/final_data_plots/cen_clus.RData')
fig <- plot_ly(type = 'scatter', mode = 'markers')%>%
  add_trace(x = centroids[,1],
            y = centroids[,2],
            text = ~paste('Key Words:', displayWords,"<br>Cluster Number: ", cen_clus$cluster,
                          "<br>Summary Sentences: ", SummarySentences),
            color=factor(cen_clus$cluster),
            marker=list( opacity=0.6),
            showlegend = FALSE)
saveWidget(fig, "docs/Centroid_Clus.html")

Full Page Visualization

Now we just need a function that maps the new centroid clustering back to the original points. Essentially one line of code in R, thanks to subsetting functionality (final line of function remapClusters below) but with the minor problem that noise points create an extra cluster. We simply add the noise cluster to the vector as cluster number k+1, and give it a value of 0 similar to the noise points. This creates some real problems with noise points.

Additional thought (not implemented) leave the noise points IN and cluster them with the centroids. This is a good idea because it allows points that were previously labeled as noise to potentially join a cluster of nearby centroids.

remapClusters = function(cen_clus,clus){
  k = length(clus$cluster_scores)
  c=as.vector(clus$cluster)
  c[c==0]=k+1
  cc=as.vector(cen_clus$cluster)
  cc[k+1]=0
  new = cc[c]
  return(new)
  }

Visualization of Refined Clusters

newclusters = remapClusters(cen_clus, clus) 

fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
  add_trace(
    x = svd_ump[,1],
    y = svd_ump[,2],
    text = ~paste('Heading:', head ,"<br>Text: ", raw_text ,"<br>Original Cluster Number: ", clusters, "<br>Centroid Cluster Number:", newclusters ),
    hoverinfo = 'text',
    color = factor(newclusters),
    marker=list(opacity=0.6),
    showlegend = F
  )

# saveWidget(fig, "docs/All_centroid_refined_clusters.html")

Full Page Visualization