What is your stand about Big Data? What are the most critical issues from Data Scientist perspective?

  • Within the past 2 decades, data is being produced at exponential rates. Big data is assembled from various data sources such as machine logs, sensors, and other social/web networks. The data is collected in large nonstandard levels in terms of volume, velocity and variety. My stance on big data depends on the environment. I stand for big data in that it’s a means to help society progress in many ways that we may have not thought of previously. Big data enables us to detect fraud, system failures, predict the spread of illness, develop targeted marketing campaigns and even create optimal plans for community development. Ultimately, I believe big data has a net positive because it could offer a proactive rather than reactive outlook.
  • Prior to data processing and analysis, data cleansing is a critical issue. About 80%-90% of organizational data is unstructured. Cleansing could take up most of the expense.
  • On another issue, big data does have its vulnerabilities. It is important to ensure that the accumulation of data is well secured. For example, In recent news, one of the major credit reporting companies (EQUIFAX), was hacked and it exposed about 143 million people at risk of fraud. This has also caused the credit reporting company’s stock to be downgraded and shareholder equity has incurred a major loss.
  • Clean data is at the heart of an accurate model. Clean data is a major issue being faced in the big data world. The amount of time actually spent on cleaning data varies depending on sources and objectives. Prior to knowledge discovery, cleansing and massaging data is referenced as a critical area of the data processing cycle by multiple authors. A study by Baseline Magazine including more than 150 data scientists surveyed that time spent on cleaning data is the biggest challenge for Data scientists. “Roughly one-third of all chief data scientists spend up to 90 percent of their time “cleaning” raw data” (King, 2016). Growing big data services can benefit greatly by increasing their quality of data. In terms of dollar value, Gartner estimated “that poor quality of data costs an average organization $13.5 million per year”
  • Pipino and Kopeso also noted in their research that the effort spent on data mining preprocessing tasks have been estimated “as much as 80% of the time” (Pipino & Kopeso, 2004). Most of the cost of data mining is spent on cleansing and integrating data sets to fit the model specifications. Cleansing could further consist of normalizing or standardizing data through modifying data types, stemming documents in terms of text analysis, decreasing variability and dimensionality. A snippet is below from Baseline Magazine showing the different areas that challenge data scientists and compromise our time to complete analysis. This figure below also supports the need to automate cleansing processes.
    Figure 1- Are we done cleaning yet?
     

    References

    King, T. (2016, January 05). Big Data as a Service: The Time is Now. Retrieved from Solutions Review: https://solutionsreview.com/data-integration/big-data-as-a-service-the-time-is-now/
    McCafferty, D. (2015, March 16). Why Data Scientists Don't Have Time to Do Analysis. Retrieved from Baseline Magazine: http://www.baselinemag.com/careers/slideshows/why-data-scientists-dont-have-time-to-do-analysis.html
    Pipino, L., & Kopeso, D. (2004). Data Mining, Dirty Data, and Costs. International Conference on Information Quality (pp. 164-169). Massachusetts: Massachusetts Institute of Technology.
    Rombaut, V. (2016, July 16). Top 5 Problems with Big Data (and how to solve them). Retrieved from Business2Community: http://www.business2community.com/big-data/top-5-problems-big-data-solve-01597918
  • Mwalimu Phiri

Comments

Post a Comment

Popular Posts