Wikipedia DendrogramsClustering is a form of unsupervised learning used to group a set of objects in such a way that members in the same group (called a cluster) are more similar to each other than to members of other groups. There are many uses for this kind of classification in business, such as targeting advertisements, picking the best sales person for a given lead, or selecting a subset of a larger dataset for more intensive analysis.
This example is a agglomerative hierarchical clustering using strict partitionining, meaning each node appears exactly once in a tree that was built by merging leaf nodes into branches, then merging those branches until a small number of root nodes is reached.
These dendrograms are samples from a clustering of 300,000 Wikipedia pages. The clustering was performed using a proprietary algorithm with its roots in K-Means, TF-IDF, and Euclidean distance. The computation was run in Hadoop using a nine computer Amazon EMR cluster, and took a little under three hours.
All metadata, links, and classification info available in the Wikipedia pages was removed before processing to ensure that the clustering was entirely based on term analysis. The only built-in knowledge of the English language was provided by the Lucene Snowball term stemmer.
I am a server-side and algorithms guy, please forgive the stark user interface.
All charts and data are copyright 2015 Robert Bushman. All Rights Reserved.
In this cluster, the system detected the relationship between composers, orchestras, musicians, and conductors. Click to see the full graph.
This cluster shows a detected relationship between highways, bridges, tunnels, and railways. Click to see the full graph.
This large cluster captures a variety of auto racing events, courses, cars, and drivers. It also has a number of fine scale clusters that capture close associations that are often beyond the grasp of unsupervised learning. Click to see the full graph.
This cluster shows the system's ability to collect golfers, PGA events, and notable country clubs. Click to see the full graph.
This shows the system's detection of the relationship between civic areas known for their architecture, architecturally significant buildings, and large or famous constructions. Click to see the full graph.
Cars & Bikes:
In this example, the system has captured a wide array of passenger vehicles. Once again, a number of fine-grained subclusters show the ability to capture often-elusive detailed relationships. Click to see the full graph.
This cluster contains many of the topics associate with the study of human social interaction and organization. Click to see the full graph.