It’s time to bring in the little Elephant (Hadoop) into your Data Eco systems – Part 2

Posted by Anoop Abraham on 11 November 2014

Tags: , , ,

In an earlier blog I outlined several reasons for adopting Hadoop into the Data eco-system. In this post I'll try to demonstrate an approach for bringing Hadoop into the data environment, and outline 8 elements to consider before embarking on this journey.

1. Understand and embrace Hadoop's ecosystem and capabilities

The prime reason for adopting Hadoop is its extreme scalability, exploratory analytics capability, low cost, and support for multi-structured data.

Hadoop is a family of open source products and technologies put together by the Apache Software Foundation. The Apache Hadoop library includes (but is not limited to) the Hadoop Distributed File System (HDFS). You can find a tutorial on the system here.

This can be used in any combination, but HDFS, MapReduce, Pig, Hive and HBase form a core technology stack for applications in Business Intelligence/Data Warehousing. Impala also can be leveraged as a SQL engine for low latency data access to HDFS and Hive data.

The figure below depicts (only a subset of) the products in the Hadoop eco system.

hadoop picture

 2. Understand the common myths of Hadoop and correct them recursively

Hadoop is unlikely to happen overnight; expect time on helping peers and management understand what Hadoop can and cannot do, and why organisations need it. Set expectations by stressing that Hadoop is a complement to your existing system, not a replacement.

3. Hadoop is not free, even though it is open source

Despite being open source Hadoop still requires the employ of specialised programmers to support it once it has become a part of your infrastructure. The program should be budgeted carefully as maintaining large Hadoop clusters requires considerable time in administration, despite cheap servers.

4. Get trained in Hadoop and stay current

It is easier to train a BI professional in Hadoop than it is to train an applications developer. It is favourable to take the former approach as the learning curve is much smaller for a BI professional to get trained in Hadoop. It is important to stay current to get the best ROI on your investment as the product set is constantly evolving and you need to assess the latest features.

5. Take the integrated approach

The ultimate goal should be to integrate Hadoop into an integrated Business Intelligence and Data Warehouse environment. Adopting a strategy for Hadoop implementation is crucial even if Hadoop implementation starts in a silo, at some phase it will need to be integrated with Business Intelligence, Data Warehouse and analytics. Identifying a use case is critical to start Hadoop adoption. This can be any pain point that the current BI/DW system is having like ETL load taking too much time and window, not able to bring in the whole data set due to data storage limitation in the current environment.

The Figure below illustrates a typical use case where in Hadoop can be used to extract and load Raw Data streams, which can then be used to complement the information that is already held in the Data warehouse. Reporting tools can be leveraged to combine the data from both the Hadoop and Data Warehouses.

 hadoop 2

 6. To enable easy adoption look for capabilities that make Hadoop data appear relational

BI teams prefer the relational paradigm. Hadoop’s upcoming releases are focusing on products supporting SQL, real time query and analysis, easy integration with reporting tools, self-service, tools more familiar to users of relational databases. Once they are familiar with this NoSQL, Schema on read etc. can be brought in to support unstructured data analysis.

7. Find a place for Hadoop by adjusting the Data Warehouse architecture

There are many areas within standard Data Warehouse architectures where HDFS and other Hadoop products can make a contribution. HDFS and Hadoop capabilities can be leveraged and used in: 

  • Data Staging
  • Archiving detailed source data
  • Managing non-structured data
  • Schema flexibility
  • Managing file-based data
  • Data sandboxes
  • Adding more processing power for an ETL hub or ELT push down
  • Wherever a non-dimensional operational data store is used

It is this richness in Hadoop that makes it a perfect companion or add-on to a Data Warehouse sub-system.

 8. Setup a proof-of-concept

As part of the integration exercise it is always worth setting up a Proof of Concept in order to:

  • Deliver some “Quick Wins” that demonstrate the value of Hadoop technologies to both business and IT
  • Develop use cases like bringing in a new raw data source, offloading the Landing/Staging layer of the existing Data Warehouse to Hadoop HDFS are ideal candidates for Proof of concept.
  • Get used to the Hadoop Ecosystem tools and identify the most appropriate tools for the job
  • Get a comparison on the current and future state after adopting Hadoop.

Hadoop distributions are available as open source from the Apache.org website or look for an introductory or developer’s distribution from vendors like ClouderaHortonworks or some others. There are also trial versions of these, also available on Cloudera and Hortonworks, which can be a great way to kick start the onboarding of Hadoop.

Enterprises like Hadoop for its ability to handle really massive volumes of structured and unstructured data in an efficient manner. Though the technology continues to be challenging to integrate and hard to use, interest in Hadoop remains strong. Our goal is to help enterprises harness Hadoop and integrate it with their existing data management environments to unlock the value of their data.