Apache Spark SQL

This is a quick summary version of 8 things I found notable in the reference article from Data Bricks so please feel free to review further information in the article referenced.
  1. The ironic trend is change in Data Analytics. The more we classify, categorize, organize and structure our data, the more we’re better able to create disruptions of change. Changes are shown with fast trends in the Hadoop ecosystem. It’s important to note that Apache Spark is still fairly new as a top open source project with an inception of 2010. MapReduce challenges are being simplified with Spark, Mahout challenges are simplified with Sparks MLlib, Impala was enhanced after Shark designs. Shark, Impala and DryadLIQN inspired the design of Spark SQL.
  2. The article referenced Spark SQL’s Domain-specific languages, which are enabled with Data-frames and RDD’s to be viewed as objects and thereby could be manipulated with relational operators like projection (select), filter (where), join, and aggregations (GroupBy).
    1. Even though it is computed lazily, Spark SQL reports an error as soon as the user types an invalid line of code instead of waiting until execution. This is similar to Microsoft SQL Servers Management Studio’s Query Analyzer feature, except automatically analyzing queries as a user is typing.
  3. In-memory caching using columnar storage. This is a great way to store massive amounts of data and be able to process them with ease.
  4. User Defined Functions (UDFs), which could be optimal for ML algorithms implemented in different domains.
  5. Spark SQL has a Catalyst library to optimize queries with data sources, and most importantly; tree node classes. Trees nodes could be set up with rules such as pattern matching. The catalyst trees go through different phases including; Analysis, Logical Optimization, Physical Planning and Code Generation.
  6. Spark SQL has a relational query optimizer, which optimizes data frame computations on single machines or clusters. With MLlib on Spark, it easier to expose all its algorithms to SQL.
  7. Advanced Analytics- Schema inference algorithm to parse through unstructured data. Processing Big data past aggregation to ML. Combining data pipelines form disparate storage systems. Currently, the inference algorithms are limited in use. Future plans are to use the inference algorithm to add inference for CSV files and XML schema tables.
  8. Ultimately, Spark SQL empowers users to write better data pipelines greater integration and less complex code. API’s could combine both relational and procedural queries to run faster with the Spark SQL engine running parallel jobs. This makes Spark much more competitive than Shark or Impala.

References:

Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., . . . Zaharia, M. (2015). Spark SQL: Relational Data Processing in Spark. Data Bricks, AMPLab, UC Berkley. Berkley: Data Bricks.

Comments

Post a Comment