My Interesting Yet Contradicting Aspects of SAS SEMMA

INTRODUCTION:
Though I am familiar with other data mining processes, SAS SEMMA (Sample > Explore > Modify > Model > Assess) is a new standard to me. I have not completed a full-fledged implementation with SEMMA but as I am learning more about it, it has presented simpler processes to follow compared to other standards that I have historically dealt with. Some aspects of SEMMA I find interesting are Sample and Modify.

MOST IMPORTANT: 
Although the modification step is the most interesting to me, I think that the Sample step is the most important. If the sample data provides a clear and complete set for the objective, the rest of the steps are simpler to generate a model that returns useful information. Essentially 'garbage in, garbage out'

MOST INTERESTING: 
Other SEMMA Steps such as obtaining Samples and exploring data sets are very critical to data mining but I personally like the Modification aspect because it is where the model inputs are defined and rationalized. I view the modification aspect as an area where I place the data into my technical sandbox or a playground and illuminate it towards the model application objective.
 Another reason I like the modify step is that its where we could cleanse the data. Simple transformations such as binning, dropping/anticipating missing values could greatly affect a model positively or negatively. Data comes in all shapes and sizes, and if our starting point is completely off target, we could modify the data to make it more comprehensive to the model and ultimately the decision makers. Essentially 'garbage in, could be intelligence out'.

SUMMARY AND CONTRADICTION: 
During projects, I have found that modifying data as a critical stage because it might determine users ending up with a model that is either over fit or incorrect. As much as I like this area, I also find it contradictory because I would like to make sure that my models are independent and objective of any bias (impossible), having little to no influence from external factors, but yet modifying data could be necessary. Supervised learning models can require users to control input and modifications. This is where the implementation of machine/deep learning excites me further because some models could be configured for automated inputs and modifications depending on the target variable(s). A quick example of this could be in computer application systems. Computer applications, features and functions could fail every so often and predicting the failure could be difficult for a human to track event logs as new updates and upgrades are deployed. In summary, as data inputs, algorithms and technologies change overtime, the modification step proves to be critical and interesting.

Mwalimu Phiri

Comments

Popular Posts