An Absolute Beginner’s Guide to Big Data
You’re a mainframe user and you’ve been hearing a lot about big data—people keep saying Hadoop—and you want a very simple beginner’s guide telling you what it all means. Well, look no further. Here’s the absolute beginner’s guide to big data for mainframers.
The first question you’re probably asking is why is it called big data? You’ve been managing petabytes of hierarchically stored data for years in your IMS system, and you’ve been accessing it quickly, so what’s different about so-called big data? Well, the answer to that has something to do with the structure of the data and the source of the data. An IMS database has been structured to work efficiently, and the data in it has more than likely been keyed in by a human. A Hadoop database is quite different. The data can be completely unstructured, and it could have come from any source on that Internet of Things—it could be video data, remote sensor data, barcode readers—and it could be exabytes in size.
So, if it runs on a mainframe, why is Microsoft interested? Why are the likes of Amazon, Google and Facebook players? The answer is that big data typically sits on top of Linux servers, although Windows servers will do. That means, for mainframers, you’ll have to use Linux on System z. And Amazon, Google and Facebook are interested because they deal with large amounts of fairly random data. Amazon wants to make sure that you’re being informed about other products similar to ones you’ve already purchased. Google wants to serve up appropriate adverts and has huge amounts of data about data sitting on its servers. And Facebook uses it to effectively stay in business.
Once you’ve decided that there’s a business case for making use of big data, the next thing is to find out what you actually need. The good news is that most of the stuff is open source and comes from the Apache Foundation, to which IBM belongs.
You need a file system and that’s called Hadoop Distributed File System (HDFS). Data in a Hadoop cluster is broken down into smaller pieces called blocks, and these are distributed throughout the cluster. Any work on the data can then be performed on manageable pieces, rather than on the whole mass of data.
Next, you need a data store—and that’s HBase. HBase is a non-relational, distributed database, written in Java, and modelled after Google’s BigTable. It’s a column-oriented database management system (DBMS) that runs on top of HDFS. HBase applications are written in Java.
As a runtime, there’s MapReduce—a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. This is the part that makes big data work. In very basic terms, it maps the data and reduces it by combining the results—hence the name.
IBM has added its own technology to the big data concept, with things like InfoSphere BigInsights, which offers visualization and exploration, development tools, advanced engines, connectors, workload optimization, and administration and security. In terms of administration and security, IBM offers a Web console that can:
• Start and stop services
• Run and monitor jobs (applications)
• Explore and modify the file system, and
• Using built-in apps, make it easy to do common tasks
There are connectors to link to databases like DB2, Netezza, Oracle and Teradata. And there’s integration with: InfoSphere Data Stage (data collection and integration); InfoSphere Streams (real-time streams processing); InfoSphere Guardium (security and monitoring); Cognos Business Intelligence (BI capabilities); and IBM Platform Computing (cluster/grid infrastructure and management); and more. Big SQL is coming with BigInsights V2.1. This will provide SQL access to data stored in BigInsights through JDBC/ODBC, and it uses rich standard SQL to leverage MapReduce parallelism or achieve low-latency.
IBM also offers advanced engines, including an advanced text analytics engine that can automatically identify and understand key information in text. Text analytics is really useful because most of the world’s data is in unstructured or semi-structured text. For example, social media is full of discussions about products and services, while internal information in organizations is locked in blobs, description fields and sometimes even discarded. It’s been suggested that more than 80 percent of stored information is unstructured—such as e-medical records, hospital reports, case files, police records, emergency calls, tech notes, call logs, online media, insurance claims, Twitter, Facebook, blogs and forums.
In terms of development tools, there’s an Eclipse-based development environment for building and deploying applications. There are developer tools and a set of analytic extractors for fast adoption that reduce coding and debugging time by up to 30 percent, according to IBM claims. There are also plug-ins for text analytics, MapReduce programming, Jaql development, Hive query, etc.
For visualization and exploration, IBM offers Big Sheets, which provides Web-based analysis and visualization for users with a familiar spreadsheet-like interface that can define and manage long-running data collection jobs. For workload management, the open source options include are ZooKeeper, Oozie, Jaql, Lucerne, HCatalog, Pig and Hive.
You might be interested to know that IBM claims that Big SQL provides robust SQL support for the Hadoop ecosystem by providing:
• A scalable architecture
• Support for SQL and data types available in SQL '92 and some additional capabilities
• Support for JDBC and ODBC client drivers
• Efficient handling of “point queries”
• Support for a wide variety of data sources and file formats for HDFS and HBase
• The ability to interoperate well with the open source ecosystem within Hadoop
It’s well worth the time to take a look at what big data can offer your organization now.
Trevor Eddolls is CEO at iTech-Ed Ltd., an IT consultancy. For many years, he was the editorial director for Xephon’s Update publications and is now contributing editor to the Arcati Mainframe Yearbook. Eddolls has written three specialist IT books, and has had numerous technical articles published. He currently chairs the Virtual IMS and Virtual CICS user groups.
Posted: 10/22/2013 1:01:01 AM by