Wikipedia Dendrograms

Clustering is a form of unsupervised learning used to group a set of objects in such a way that members in the same group (called a cluster) are more similar to each other than to members of other groups. There are many uses for this kind of classification in business, such as targeting advertisements, picking the best sales person for a given lead, or selecting a subset of a larger dataset for more intensive analysis.

This example is a agglomerative hierarchical clustering using strict partitionining, meaning each node appears exactly once in a tree that was built by merging leaf nodes into branches, then merging those branches until a small number of root nodes is reached.

These dendrograms are samples from a clustering of 300,000 Wikipedia pages. The clustering was performed using a proprietary algorithm with its roots in K-Means, TF-IDF, and Euclidean distance. The computation was run in Hadoop using a nine computer Amazon EMR cluster, and took a little under three hours.

All metadata, links, and classification info available in the Wikipedia pages was removed before processing to ensure that the clustering was entirely based on term analysis. The only built-in knowledge of the English language was provided by the Lucene Snowball term stemmer.

I am a server-side and algorithms guy, please forgive the stark user interface.

All charts and data are copyright 2015 Robert Bushman. All Rights Reserved.

Symphony Music:
thumbnail of a few nodes of the symphony dendrogram
In this cluster, the system detected the relationship between composers, orchestras, musicians, and conductors. Click to see the full graph.

Transportation: thumbnail of a few nodes of the transportation dendrogram
This cluster shows a detected relationship between highways, bridges, tunnels, and railways. Click to see the full graph.

Auto Racing:
thumbnail of a few nodes of the racing dendrogram
This large cluster captures a variety of auto racing events, courses, cars, and drivers. It also has a number of fine scale clusters that capture close associations that are often beyond the grasp of unsupervised learning. Click to see the full graph.

Professional Golf:
thumbnail of a few nodes of the golf dendrogram
This cluster shows the system's ability to collect golfers, PGA events, and notable country clubs. Click to see the full graph.

thumbnail of a few nodes of the architecture dendrogram
This shows the system's detection of the relationship between civic areas known for their architecture, architecturally significant buildings, and large or famous constructions. Click to see the full graph.

Cars & Bikes:
thumbnail of a few nodes of the cars and bikes dendrogram
In this example, the system has captured a wide array of passenger vehicles. Once again, a number of fine-grained subclusters show the ability to capture often-elusive detailed relationships. Click to see the full graph.

thumbnail of a few nodes of the socioeconomics dendrogram
This cluster contains many of the topics associate with the study of human social interaction and organization. Click to see the full graph.