Scala Machine Learning Projects
上QQ阅读APP看书,第一时间看更新

Unsupervised machine learning

Unsupervised learning is a type of machine learning algorithm used for grouping related data objects and finding hidden patterns by inferencing from unlabeled datasets—that is, training sets consisting of input data without labels.

Let's see a real-life example. Suppose you have a large collection of non-pirated and totally legal MP3 files in a crowded and massive folder on your hard drive. Now, what if you could build a predictive model that helps you automatically group together similar songs and organize them into your favorite categories, such as country, rap, and rock?

This is an act of assigning an item to a group so that an MP3 is added to the respective playlist in an unsupervised way. For classification, we assume that you are given a training dataset of correctly labeled data. Unfortunately, we do not always have that luxury when we collect data in the real world.

For example, suppose we would like to divide a huge collection of music into interesting playlists. How can we possibly group together songs if we do not have direct access to their metadata? One possible approach is a mixture of various ML techniques, but clustering is often at the heart of the solution:

Figure 7: Clustering data samples at a glance

In other words, the main objective of unsupervised learning algorithms is to explore unknown/hidden patterns in input data that is unlabeled. Unsupervised learning, however, also comprehends other techniques to explain the key features of the data in an exploratory way to find the hidden patterns. To overcome this challenge, clustering techniques are used widely to group unlabeled data points based on certain similarity measures in an unsupervised way.