4 Clustering: HDBSCAN
Key Takeaways:
- Clustering 20,000 documents is never going to be clean. HDBSCAN makes a nice algorithm for the task because it will find dense clusters of points amidst a sea of noise - this means it identifies points that don’t belong to an area of higher density and labels them as noise. Are they necessarily noise? No. Should we ignore these documents altogether? No. But we might ought to treat them differently from the denser regions that are more clearly clusters. Maybe noise points need a fuzzy treatment where they are compared to nearby clusters and given scores that measure the extent to which they belong in each nearby cluster.
- Getting a manageable number of clusters is unlikely to be helpful. A cluster like “data reports” might contain earnings releases, foreign debt reports, government employment reports etc - broad topics provide no specificity overall.
- Getting a thousand clusters with no idea of how those clusters are similar to each other is also unhelpful. In this situation, topics are just overly specific islands.
- Ideally we want many specific clusters that cluster together into larger topics. Like the “data reports” example given earlier: if we could have all of the subtopics (earnings reports, foreign debt reports, government employment reports etc) and know they fall into that broader category of “data reports”, then we’ve found a nice organization of the corpus. HDBSCAN is a typically a good tool for this task.
- Result of clustering the UMAP projection was nicer than clustering in the SVD space and clustering in the GloVe space.
To get clusters, we consider just 2 options here - can cluster this UMAP projection or can opt to cluster some higher-dimensional projection (like the singular vectors themselves) and see how that looks in the UMAP Space. UMAP clustering seems to perform really well, even better than singular vector input, so we stick with it. We define n
as the number of documents and k
as the number of clusters.
Hierarchical DBSCAN is a fast algorithm that adapts the ideas of single linkage clustering (minimal spanning trees) to DBSCAN (density based spatial clustering of applications with noise) to create a hierarchical map of density based clusters.
# clus = hdbscan(svd_ump,6)
# save(clus,file='docs/final_data_plots/alldocs_hdbscan_of_map6.RData')
load('docs/final_data_plots/alldocs_hdbscan_of_map6.RData')
=length(clus$cluster)
nk = length(clus$cluster_scores)) (
## [1] 590
We get a LOT of clusters from hdbscan
- this makes sense, there is a lot going on in this corpus! But it might be nice to refine those clusters so that we can see which ones are related. We’ll get to that after we explore this great visualization.