Open Source Programs Spark Innovation in z Systems Analytics
Integrating Spark, DB2 and BigInsights allows users to leverage them effectively
9/21/2016 12:00:01 AM |
By Eberhard Hechler
The availability of the IBM z/OS platform for Apache Spark and its positioning within z Systems analytics with DB2 for z/OS and the IBM DB2 Analytics Accelerator is a topic of interest to IBM’s clients. The IBM Open Platform (IOP) with Apache Spark, Apache Hadoop and IBM BigInsights on the distributed platform are further enriching the value proposition for z Systems analytics. However, distributed platform offerings are rarely viewed in the context of IBM z analytics. Rather, they are assumed to necessitate the movement of z/OS data off the z Systems platform. Clients often ask about integration options and clear positioning statements, which will help them to leverage open source and Spark effectively in the context of IBM z analytics related use case scenarios.
We look at relevant technical integration points and the derived enrichment for some use case scenarios. It should help to better understand the added value that open source can provide for z Systems analytics platform with DB2 and the Accelerator. The focus is on Spark on z/OS, as well as on the relationship and integration with IOP (including Apache Spark) and BigInsights on the distributed platform.
The following are a few technical integration scenarios.
Spark on z/OS With DB2 for z/OS and IBM DB2 Analytics Accelerator
Apache Spark on z/OS enables data scientists to develop analytical models and segmentation models on z Systems using z/OS data, (e.g., to develop fraud discovery and prevention models, customer segmentation for up-sell and cross-sell, sentiment analytics with IBM z customer profile and transactional data). As can be seen in Figure 1, with its SQL connectivity, data in DB2 can be accessed using Java database connectivity (JDBC). The Spark DataFrames API exposes z/OS data as DataFrames, which were introduced with Spark V1.3, and transforms them into Resilient Distributed Datasets (RDDs) that can be processed with any of the Spark libraries.
Other z/OS data sources, such as IMS, VSAM and SMF log records can be accessed via Spark on z/OS directly. Once the data is abstracted as RDDs, Spark analytical tasks can be performed using the Spark Machine Learning Library (MLLib) to enable new user roles, tasks and use cases to be performed on z Systems. It allows line of business (LoB) users to leverage z Systems in innovative ways using new programming languages on z Systems, such as Scala and Python using R with z/OS data. The DB2 Analytics Accelerator can be used to enable fast SQL execution in the context of these additional use cases.
To implement federation across platforms, Spark on z/OS implementations will initially need external orchestration. In the future, the value of Spark on z/OS will be further enhanced with improved federated analytics, including true federated SQL with Spark SQL, providing potential for application transparency, global optimization and pushdown capabilities to further reduce required data movement, especially related to integration of non-z/OS data.
Figure 1: Leveraging Spark SQL Connectivity
Federated SQL With DB2 for z/OS as Data Source
A number of organizations prefer the deployment of an open-source ecosystem on a distributed platform. IBM Big SQL, a part of IBM BigInsights on the distributed platform, provides federation capability and works with DB2 for z/OS as a federated data source underpinned by the Accelerator for the execution of complex analytical queries. Big SQL on the distributed platform provides a single point of entry for SQL queries with application transparency, global optimization, pushdown capabilities and split query generation (see Figure 2). It integrates with z Systems—specifically with the DB2 Analytics Accelerator—efficiently with IOP (including Apache Spark) and IBM BigInsights on the distributed platform. The integration simplifies the information supply chain by reducing data movement, enables LoB users to access and process only the relevant z/OS data and introduces SQL federation to application programmers within a Spark context. In addition, it enables z Systems server-centric data lake architectures.
Federated SQL with DB2 for z/OS as data source enables z/OS data to be integrated and queried in the context of non-structured data on a distributed Hadoop platform for uses such as sentiment analytics (e.g., e-mail and call center transcripts or Twitter data). It integrates IBM z customer data with The Weather Company weather data and executes complex analytical queries on z Systems data via the Accelerator.
For most of the clients that I have been engaged with, this Big SQL function that uses DB2 as a federated data source represents a significant value point for quite a number of use case scenarios and data lake deployments. It helps to integrate z/OS data by performing SQL on DB2 and only moves relevant result sets off z Systems.
Figure 2: Federated Data Access Across DB2 for z/OS and Hadoop Platform
Leveraging Big SQL and Spark
The integration between Apache Spark and Big SQL in IBM BigInsights V4.2 allows Spark on the distributed platform (as part of IOP) to be called by Big SQL, using a user-defined function (UDF) (see Figure 3). An alternative is to use Spark with its SQL connectivity via JDBC to access DB2 for z/OS (with the DB2 Analytics Accelerator) directly, if federation to other data sources isn’t required.
The benefit is that only relevant DB2 for z/OS data needs to be moved to a Hadoop-based data lake (e.g., a Hive cluster). Federation can be performed to process DB2 for z/OS data in the context of data on a Hadoop cluster. Access to DB2 can be underpinned by the DB2 Analytics Accelerator to enable highly complex analytical queries to be executed. It prevents data movement of potentially large Hadoop data volumes (e.g., e-mails, call center transcripts, web logs and Twitter data) to the z/OS platform.
Figure 3: Integration of Big SQL and Apache Spark
DB2 for z/OS with the IBM DB2 Analytics Accelerator, Spark on z/OS, IOP with Apache Spark, Apache Hadoop and IBM BigInsights on the distributed platform are complementary offerings. They work in concert in the context of use case scenarios. These offerings are integrated and shouldn’t be deployed in isolation. Spark on z/OS augments use case scenarios above and beyond what has been possible with DB2 for z/OS and the IBM DB2 Analytics Accelerator.
DB2 for z/OS with the IBM DB2 Analytics Accelerator works well as a federated data source for Big SQL (on the distributed platform)—a key requirement from quite a few z clients for a long time.
The Spark and Big SQL integration (on distributed platform) as part of IBM BigInsights V4.2 broadens the use cases and allow z/OS data to be integrated with limited data movement.
Eberhard Hechler is an Executive Architect from the IBM Germany R&D Lab. He is a member of DB2 Analytics Accelerator development.