As activities in our world become more integrated, the rate of data growth is increasing exponentially. This data explosion is referred to as big data, which renders current data management methods inadequate. IBM® is preparing the next generation of technology to meet these data management challenges.
To provide the capability of using big data sources and analytics of these sources, IBM has developed the IBM InfoSphere® BigInsights™ product. This offering is based on the open source computing framework known as Apache Hadoop. This framework provides unique capabilities to the data management ecosystem and further enhances the value of investment in the data warehouse. This IBM Redbooks® Solution Guide describes the value of big data in organizations and how the InfoSphere BigInsights solution helps organizations to handle this data.
As activities in our world become more integrated, the rate of data growth is increasing exponentially. This data explosion (Figure 1) is referred to as big data, which renders current data management methods inadequate. IBM® is preparing the next generation of technology to meet these data management challenges.
Figure 1. Big data explosion
To provide the capability of using big data sources and analytics of these sources, IBM has developed the IBM InfoSphere® BigInsights™ product. This offering is based on the open source computing framework known as Apache Hadoop. This framework provides unique capabilities to the data management ecosystem and further enhances the value of investment in the data warehouse. This IBM Redbooks® Solution Guide describes the value of big data in organizations and how the InfoSphere BigInsights solution helps organizations to handle this data.
Did you know?
With the advent of email, smartphones, social media, sensors, and machine-generated data, significantly more data is generated today than in the past. But big data is not just about the sheer volume of data being created. With a myriad of unstructured sources that create this data, a greater variety of data is now available. Each source produces this data at different rates or what we call velocity. In addition, you still need to decipher the veracity of this new information as you would with structured data.
Business value
IBM InfoSphere BigInsights makes it simpler to use Apache Hadoop and to build big data applications. It enhances this open source technology to withstand enterprise demands, by adding administrative, workflow, provisioning, and security features, in addition to best-in-class analytical capabilities from IBM Research. The result is a more developer-friendly and user-friendly solution for complex, large-scale analytics.
By using InfoSphere BigInsights, enterprises of all sizes can cost effectively manage and analyze the massive volume, variety, and velocity of data that consumers and businesses create every day. InfoSphere BigInsights can help increase operational efficiency by augmenting the data warehouse environment. You can use it as an archive that can be queried so that you can store and analyze large volumes of multistructured data without straining the data warehouse. You can also use it as a preprocessing hub so that you can explore your data, determine what is the most valuable, and extract that data cost effectively. In addition, you can use it for ad hoc analysis so that you can perform analysis on all of your data.
The InfoSphere BigInsights offering provides a packaged Apache Hadoop distribution, a simplified installation of Hadoop, and corresponding open source tools for application development, data movement, and cluster management. InfoSphere BigInsights also provides more options for data security, which is frequently a concern for anyone who is contemplating incorporating new technology into their data management ecosystem. InfoSphere BigInsights is a component of the IBM Big Data Platform and, therefore, provides potential integration points with the other components of the platform including the data warehouse, data integration, and governance engines, and third-party data analytics tools. The stack includes tools for built-in analytics of text, natural language processing, and spreadsheet-like data discovery and exploration.
Solution overview
These days, high velocity data sources, such as streaming video or sensor data, continuously send data 24x7. When considering current data warehouse and analytics-intensive environments, data volume is a key factor. Considering that, now and in the future, we will be working with hundreds of terabytes (and in many cases petabytes (PB)), this data has to reside somewhere.
Some might say big data can be addressed by the data warehouse. They might suggest that their data warehouse works fine for collection and analysis of structured data and that their solution works well for their unstructured data needs. Although traditional data warehouses do have a role in the big data solution space, they are now a foundational piece of a larger solution. A consideration in data warehouse environments is the I/O that is required for reading massive amounts of data from storage for processing within the data warehouse database server. The ability of servers to process this data is not usually a factor because they typically have significant amounts of RAM and processor power, parallelizing tasks across the computing resources of the server. Many vendors have developed data warehouse appliances and appliance-like platforms (called data warehouse platforms), specifically for the analytics-intensive workload of large data warehouses. IBM Netezza® and IBM Smart Analytics Systems are examples of these types of platforms.
Although these data warehouse platforms are optimized for analytics-intensive workloads, they are highly specialized systems and are not cheap. At the rate that data continues to grow, it is feasible to speculate that many organizations will need petabyte-scale data warehouse systems in the next 2 - 5 years. For example, HD video generates about 1 GB of data per minute of video, which translates to 1.5 TB of data that is generated daily per camera. If five cameras are in use, 7.5 TB per day are being generated, which extrapolates to 2.52 PB/year.
You could be adding over 2 PB of data annually to your warehouse that is separate from typical day-to-day, business-centric data systems on which you might already be capturing and performing analytics. The costs of capturing and analyzing big data swell quickly. What if you could use commodity hardware as a foundation for storing data? What if you could use the resources of this hardware to filter data, and then use your existing data warehouse to process the remaining data for its business value? That approach could be more cost effective than expanding your data warehouse platform to a size large enough to perform analytics on all of the data.
Solution architecture
The IBM InfoSphere BigInsights solution is based on the widely used Hadoop framework. Fundamentally, Hadoop consists of two components: a Hadoop Distributed File System (HDFS), which provides a way to store data, and MapReduce, which is a way of processing data in a distributed manner. These components were developed by the open source community based on documents that were published by Google in an attempt to overcome the problems faced when trying to deal with an overwhelming volume of data. Google published papers on its approach to resolve these issues. Then, Yahoo! started work on an open source equivalent named after a child’s toy elephant called Hadoop.
Hadoop consists of many connected computers, called DataNodes, that store data on their local file system and process the data as directed by a central management node. The management nodes consist of the following processes:
The material included in this document is in DRAFT form and is provided 'as is' without warranty of any kind. IBM is not responsible for the accuracy or completeness of the material, and may update the document at any time. The final, published document may not include any, or all, of the material included herein. Client assumes all risks associated with Client's use of this document.