Unsupervised vs. Supervised Classification Methods

April 07, 2017

Unsupervised vs. Supervised Classification Methods

Unsupervised vs. Supervised learning methodS

Mwalimu Phiri

Unsupervised and Supervised classification are both methods of machine learning but the major differences include how they’re applied, which type of data could be used, and the parameters or conditions involved. Unsupervised learning can use unstructured data without any training samples and create groupings or clusters. Supervised learning requires training samples with a defined classification. Though supervised learning allows us to define the classification, it offers room for human error. On the other hand, unsupervised learning requires domain knowledge to ensure valuable information is obtained from the groupings.

Different Challenges:

The challenge with unsupervised classification could be allowing machines to automatically choose categories of the data. This could be similar to a “black-box” defining model clusters, in which the categories may not associate to the objective as expected. This is where domain and industry knowledge is critical in validating the unsupervised classifications. Cluster numbers could be set in some models such as the popular k mean algorithm.

The challenge with supervised learning is the dependence on sample data inputs to train the models for an acceptable level of classification. If our sample data is not aligned on proper definitions and parameters with the data we plan on analyzing, then the model could classify targets erroneously. Sample data has to include or not include the target variable in order to classify depending on the objective; classifying outliers or group associations.

Examples:

An example of supervised and unsupervised classification is displayed by Chris regarding Image processing (Banman, 2002). For the supervised learning approach, Banman uses a defined raster image with training sites (water, grassland, e.t.c…) to outline spectral signatures. The model is trained on the image variables by the user. The model is tested to then uses statistical measures to support its classifications based on user parameters.

Another example is shown with unsupervised classification on the same image but this time there are no training involved (no predefined variables or groupings). Through ISOCLUST classification method, the user is able to obtain similar but even better results of digital mappings. The unsupervised classification was able to group the little nuggets of different land areas to which a human could not differentiate.

Thoughts:

Unsupervised learning came out on top with this example but that’s not always the case. Validating the classification models seems to be easier part of the whole process meanwhile pre-processing is critical for both classifications. If the data is not defined or scaled properly during pre-processing, it could negatively affect training the model. Validating could be askew if the models are trained on skewed input data. Validating the unsupervised classification model could be to the extent of developer expertise in the fields being classified. Though we consider it unsupervised, there’s always some sort iterative supervision being implemented during the grouping or validation process. Though these forms of classifications are different, they can be used in parallel or subsequently with each other. Unsupervised learning can be used to group large data sets into clusters and supervised learning could be implemented afterwards with a defined response variable and parameters. The response variable could be based on the different cluster relationships, where we could calculate the likelihood of obtaining a target land class from the examples.

References:

Banman, C. (2002). Supervised and Unsupervised Land Use Classification. Retrieved from Emporia State University: http://academic.emporia.edu/aberjame/student/banman5/perry3.html#intro

Search This Blog

Phirilytics

Unsupervised vs. Supervised Classification Methods

Comments

Post a Comment

Popular Posts

Hadoop MapReduce & Apache Spark

What is your stand about Big Data? What are the most critical issues from Data Scientist perspective?