SAS Enterprise Miner: Feature Transformation Options to Increase Predictive Strength

Transformations are sometimes necessary when a user wants to directly or indirectly improve the performance of their models. Some data sets variables could present too many missing values, extreme variances, too many input variables, which could lead to incorrect relationships and poor performance of a model. Though it may not always be necessary, we could improve our models during data preparation by transforming our features and increasing the predictive value.

One way to measure variable strength in terms of predictive value is using the Gini statistic. The Gini statistic is a measure of dispersion within a distribution. Some variables could start off with a low worthiness on the Gini statistic but once transformed, could bring value to the model.

Transforming variables is part of the Modify node reference in SAS diagram functionality toolbars. 
There are different ways in which we could approach variable transformations depending on the model objectives. Interval and class variables are transformed with different methods.
 Interval Variables:
  • Simple transformations
    • Log – Transform the variable by taking the natural log.
    • Square – change the variable to its square version
    • Standardization – subtract the mean, and divide by the standard deviation
  • Binning – break down an interval variable into an ordinal grouped variable. Change the range parameters for the best groups that fits the model
    • Bucket – created by divide values into equally spaced intervals between the minimum and the maximum
    • Quantile – divide groups into having the same frequency
    • Optimal Binning for Relationship to Target – used on Binary targets, data could be binned in order to optimize the target relationships.
  • Best Power transformations
    • Max Normality- maximize the opportunity of a normal distribution
    • Equalize spread with target levels-  with a class target, variables could be transformed offering the minimum variance between target levels.
Class Variables:
  • Group levels that are rare
      • Create an outlier feature or group
  • Develop dummy indicator variables
  • Binning - generate intervals on a continuous or class variables then by improving predictive power
    • Create new variables
  • Variable clustering – incases of correlated variables, they can be clustered to decrease collinearity
Ultimately, we could use all these options and more to ensure our objective is met in generating a model that meets our marginal strength in predictive value. Though it a usual practice, transforming variables could create a challenge for user and model interpretability.

References: 

Sarma, K. (2008, January -). http://nymetro.chapter.informs.org. Retrieved from http://nymetro.chapter.informs.org/prac_cor_pubs/: http://nymetro.chapter.informs.org/prac_cor_pubs/01-08%20INFORMS_Jan2008.pdf


Mwalimu Phiri 



Comments

Popular Posts