Big Data: the buzzword for massive amounts of data of our in­creas­ing­ly digitized lives. Companies around the world are busy de­vel­op­ing more efficient methods for compiling elec­tron­ic data on a massive scale and saving this on enormous storage ar­chi­tec­tures where it’s then sys­tem­at­i­cal­ly processed. Masses of data measured in petabytes or exabytes are therefore no longer a rarity these days. But no single system is able to ef­fec­tive­ly process volumes of data of this magnitude. For this reason, Big Data analyses require software platforms that make it possible to dis­trib­ute complex computing tasks onto a large number of different computer nodes. One prominent solution is Apache Hadoop, a framework that provides a basis for several dis­tri­b­u­tions and Big Data suites.

Compute Engine
The ideal IaaS for your workload
  • Cost-effective vCPUs and powerful dedicated cores
  • Flex­i­bil­i­ty with no minimum contract
  • 24/7 expert support included

What is Hadoop?

Apache Hadoop is a Java-based framework used for a diverse range of software com­po­nents that makes it possible to separate tasks (jobs) into sub processes. These are divided onto different nodes of a computer cluster where they are then able to be si­mul­ta­ne­ous­ly run. For large Hadoop ar­chi­tec­tures, thousands of in­di­vid­ual computers are used. This concept has the advantage that each computer in the cluster only has to provide a fraction of the required hardware resources. Working with large quan­ti­ties of data thus does not nec­es­sar­i­ly require any high-end computers and instead can be carried out through a variety of cost-effective servers.

The open source project, Hadoop, was initiated in 2006 by the developer, Doug Cutting, and can be traced back to Google’s MapReduce algorithm. In 2004, the search engine provider published in­for­ma­tion on a new tech­nol­o­gy that made it possible to par­al­lelize complex computing processes on the basis of large data quan­ti­ties with the help of a computer cluster. Cutting, who’d spent time at Excite, Apple Inc. and Xerox Parc and already had made a name for himself with Apache Lucene, soon rec­og­nized the potential of MapReduce for his own project and received support from his employer at the time, Yahoo. In 2008, Hadoop became Apache Software Foun­da­tion’s top-level project, and the framework finally achieved the release status 1.0.0 in 2011.

In addition to Apache’s official release, there are also different forks of the software framework available as business-ap­pro­pri­ate dis­tri­b­u­tions that are provided to customers of various software providers. One form of support for Hadoop is offered through Doug Cutting’s current employer, Cloudera, which provides an ‘en­ter­prise-ready’ open source dis­tri­b­u­tion with CDH. Hor­ton­works and Teradata feature similar products, and Microsoft and IBM have both in­te­grat­ed Hadoop into their re­spec­tive products, the cloud service Azure and In­foS­phere Biglnsights.

Hadoop Ar­chi­tec­ture: set-up and basic com­po­nents

Generally, when one refers to Hadoop, this means an extensive software packet —also sometimes called Hadoop ecosystem. Here, the system’s core com­po­nents (Core Hadoop) as well as various ex­ten­sions are found (many carrying colorful names like Pig, Chukwa, Oozie or ZooKeeper) that add various functions to the framework for pro­cess­ing large amounts of data. These closely related projects also hail from the Apache Software Foun­da­tion.

Core Hadoop con­sti­tutes the basis of the Hadoop ecosystem. In version 1, integral com­po­nents of the software core include the basis module Hadoop Common, the Hadoop Dis­trib­uted File System (HDFS) and a MapReduce Engine, which was replaced by the cluster man­age­ment system YARN (also referred to as MapReduce 2.0) in version 2.3. This set-up elim­i­nates the MapReduce algorithm from the actual man­age­ment system, giving it the status of a YARN-based plugin.

Hadoop Common

The module Hadoop Common provides all of the other framework’s com­po­nents with a set of basic functions. Among these are:

  • Java archive files (JAR), which are needed to start Hadoop,
  • Libraries for se­ri­al­iz­ing data,
  • In­ter­faces for accessing the file system of the Hadoop ar­chi­tec­ture as well as the remote- procedure-call com­mu­ni­ca­tion located inside the computer cluster.

Hadoop Dis­trib­uted File System (HDFS)

HDFS is a highly available file system that is used to save large quan­ti­ties of data in a computer cluster and is therefore re­spon­si­ble for storage within the framework. To this end, files are separated into blocks of data and are then re­dun­dant­ly dis­trib­uted to different nodes; this is done without any pre­de­fined or­ga­ni­za­tion­al scheme. According to the de­vel­op­ers, HDFS is able to manage files numbering in the millions.

The Hadoop cluster generally functions according to the master/slave model. The ar­chi­tec­ture of this framework is composed of a master node to which numerous sub­or­di­nate ‘slave’ nodes are assigned. This principle is found again in the HDFS structure, which is based on a NameNode and various sub­or­di­nate DataNodes. The NameNode manages all metadata for the file system and for the directory struc­tures and files. The actual data storage takes place on the sub­or­di­nate DataNodes. In order to minimize data loss, these files are separated into single blocks and saved multiple times on different nodes. The standard con­fig­u­ra­tion is organized in such a way that each data block is saved in trip­li­cate.

Every DataNode sends the NameNode a sign of life, known as a ‘heartbeat’, in regular intervals. Should this signal fail to ma­te­ri­al­ize, the NameNode declares the cor­re­spond­ing slave to be ‘dead’, and with the help of the data copies, ensures that enough copies of the data block in question are available in the cluster. The NameNode occupies a central role within the framework. In order to keep it from becoming a ‘single point of failure’, it’s common practice to provide this master node with a Sec­ondary­Na­meN­ode. This is re­spon­si­ble for recording any changes made to meta data, making it possible to restore the HDFS’ centrally con­trolled instance.

During the tran­si­tion phase from Hadoop 1 to Hadoop 2, HDFS added a further security system: NameNode HA (high avail­abil­i­ty) adds another failsafe mechanism to the system that au­to­mat­i­cal­ly starts a backup component whenever a NameNode crash occurs. What’s more, a snapshot function enables the system to be set back to its prior status. Ad­di­tion­al­ly, the extension, Fed­er­a­tion, is able to operate multiple NameNodes within a cluster.

MapReduce-Engine

Orig­i­nal­ly developed by Google, the MapReduce algorithm, which is im­ple­ment­ed in the framework as an au­tonomous engine in Hadoop Version 1, is a further main component of the Core Hadoop. This engine is primarily re­spon­si­ble for managing resources as well as con­trol­ling and mon­i­tor­ing computing processes (job sched­ul­ing/mon­i­tor­ing). Here, data pro­cess­ing largely relies on the phases ‘map’ and ‘reduce’, which make it possible to directly process data at the data locality. This decreases the computing time and network through­put. As a part of the map phase, complex computing processes (jobs) are separated into in­di­vid­ual parts and then dis­trib­uted by a so-called Job­Track­er (located on the master node) to numerous slave systems in the cluster. There, Task­Track­ers ensure that the sub processes are processed in a par­al­lelized manner. In the sub­se­quent reduce phase, the interim results are collected by the MapReduce engine and then compiled as one single overall result.

While Master Nodes generally contain the com­po­nents NameNode and Job­Track­er, a DataNode and Task­Track­er work on each sub­or­di­nate slave. The following graphic displays the basic structure of a Hadoop ar­chi­tec­ture (according to version 1) that’s split into MapReduce layers and HDFS layers.

With the release of Hadoop version 2.3, the MapReduce engine was fun­da­men­tal­ly over­hauled. The result is the cluster man­age­ment system YARN/MapReduce 2.0, which decoupled resource and task man­age­ment (job sched­ul­ing/mon­i­tor­ing) from MapReduce and so opened the framework to a new pro­cess­ing model and a wide range of Big Data ap­pli­ca­tions.

YARN/MapReduce 2.0

With the in­tro­duc­tion of the YARN module (‘Yet Another Resource Ne­go­tia­tor’) starting with version 2.3, Hadoops ar­chi­tec­ture went through a fun­da­men­tal change that marks the tran­si­tion from Hadoop 1 to Hadoop 2. While Hadoop 1 only offers MapReduce as an ap­pli­ca­tion, it enables resource and task man­age­ment to be decoupled from the data pro­cess­ing model, which allows a wide variety of Big Data ap­pli­ca­tions to be in­te­grat­ed into the framework. Con­se­quent­ly, MapReduce under Hadoop 2 is only one of many possible ap­pli­ca­tions for accessing data that can be executed in the framework. This means that the framework is more than a simple MapReduce runtime en­vi­ron­ment; YARN assumes the role of a dis­trib­uted operating system for resource man­age­ment of Big Data ap­pli­ca­tions.

The fun­da­men­tal changes to Hadoop ar­chi­tec­ture primarily affect both trackers of the MapReduce engine, which no longer exist in Hadoop 2 as au­tonomous com­po­nents. Instead, the YARN module relies on three new entities: the Re­source­M­an­ag­er, the Node­M­an­ag­er, and the Ap­pli­ca­tion­Mas­ter.

  • Re­source­M­an­ag­er: the global Re­source­M­an­ag­er acts as the highest authority in the Hadoop ar­chi­tec­ture (Master) that’s assigned various Node­M­an­agers as ‘slaves’. Its re­spon­si­bilites include con­trol­ling computer clusters, or­ches­trat­ing the dis­tri­b­u­tion of resources onto the sub­or­di­nate Node­Mangers, and dis­trib­ut­ing ap­pli­ca­tions. The Re­source­M­an­ag­er knows where the in­di­vid­ual slave systems within the cluster are found and which resources these are able to provide. The Re­sourceSched­uler is one par­tic­u­lar­ly important component of the Re­source­Manger; this decides how available resources in the cluster are dis­trib­uted.
  • Node­M­an­ag­er: there is a Node­M­an­ag­er located on each of the computer cluster’s nodes. This takes in the slave’s position in Hadoop 2’s in­fra­struc­ture and in this way acts as a command recipient of the Re­source­M­an­ag­er. When a Node­M­an­ag­er is started on a node in the cluster, it then registers with the Re­source­M­an­ag­er and sends a ‘sign of life’, the heartbeat, in regular intervals. Each Node­M­an­ag­er is re­spon­si­ble for the resources of its own nodes and provides the cluster with a portion of these. How these resources are used is decided by the Re­source­M­an­ag­er’s Re­sourceSched­uler.
  • Ap­pli­ca­tion­Mas­ter: each node within the YARD system contains an Ap­pli­ca­tion­Mas­ter that requests resources from the Re­source­M­an­ag­er and is allocated these in the form of con­tain­ers. Big Data ap­pli­ca­tions from the Ap­pli­ca­tion­Mas­ter are executed and observed on these con­tain­ers.

Here’s a schematic depiction of the Hadoop 2 ar­chi­tec­ture:

Should a Big Data ap­pli­ca­tion need to be executed on Hadoop 2, then there are generally three actors involved:

  • A client
  • A Re­source­M­an­ag­er and,
  • One or more Node­M­an­agers.

First off, the client issues the Re­source­M­an­ag­er an order, or job, that’s to be started by a Big Data ap­pli­ca­tion in the cluster. This then allocates a container. In other words: the Re­source­M­an­ag­er reserves the cluster’s resources for the ap­pli­ca­tion and contacts a Node­M­an­ag­er. The contacted Node­M­an­ag­er starts the container and executes an Ap­pli­ca­tion­Mas­ter within it. This latter component is re­spon­si­ble for mon­i­tor­ing and executing the ap­pli­ca­tion.

The Hadoop ecosystem: optional expansion com­po­nents

In addition to the system’s core com­po­nents, the Hadoop ecosystem en­com­pass­es a wide range of ex­ten­sions that fa­cil­i­tate separate software projects and make sub­stan­tial con­tri­bu­tions to the func­tion­al­i­ty and flex­i­bil­i­ty of the software framework. Due to the open source code and numerous in­ter­faces, these optional add-on com­po­nents can be freely combined with the system’s core functions. The following shows a list of the most popular projects in the Hadoop ecosystem:

  • Ambari: the Apache project Ambari was initiated by the Hadoop dis­trib­uter Hor­ton­works and adds in­stal­la­tion and man­age­ment tools to the ecosystem that fa­cil­i­tate providing IT resources and managing and mon­i­tor­ing Hadoop com­po­nents. To this end, Apache Ambari offers a step-by-step wizard for in­stalling Hadoop services onto any amount of computers within the cluster as well as a man­age­ment function with which services can be centrally started, stopped, or con­fig­ured. A graphical user interface informs users on the status of the system. What’s more, the Ambari Metrics System and the Ambari Alert Framework enable metrics to be recorded and alarm levels to be con­fig­ured.
  • Avro: Apache Avro is a system for se­ri­al­iz­ing data. Avro uses JSON in order to define data types and protocols. The actual data, on the other hand, is se­ri­al­ized in a compact binary format. This serves as a data transfer format for the com­mu­ni­ca­tion between the different Hadoop nodes as well as the Hadoop services and client programs.
  • Cassandra: written in Java, Apache Cassandra is a dis­trib­uted database man­age­ment system for large amounts of struc­tured data that follows a non- re­la­tion­al approach. This kind of software is also referred to as NoSQL databases. Orig­i­nal­ly developed by Facebook, the goal of the open source system is to achieve high scal­a­bil­i­ty and re­li­a­bil­i­ty in large, dis­trib­uted ar­chi­tec­tures. Storing data takes place on the basis of key-value relations.
  • HBase: HBase also is an open source NoSQL database that enables real-time writing and reading access of large amounts of data within a computer cluster. HBase is based on Google’s high-per­for­mance database system, BigTable. In com­par­i­son to other NoSQL databases, HBase is char­ac­ter­ized by high data con­sis­ten­cy.
  • Chukwa: Chukwa is a data ac­qui­si­tion and analysis system that relies on the HDFS and the MapReduce of the Hadoop Big Data framework; it also allows real-time mon­i­tor­ing as well as data analyses in large, dis­trib­uted systems. In order to achieve this, Chukwa uses agents that run on every observed node and collect log files of the ap­pli­ca­tions that run there. These files are sent to so-called col­lec­tors and then saved in the HDFS.
  • Flume: Apache Flume is another service that was created for col­lect­ing, ag­gre­gat­ing, and moving log data. In order to stream data for storage and analysis purposes from different sources onto HDFS, flume im­ple­ments transport formats like Apache Thrift or Avro.
  • Pig: Apache Pig is a platform for analyzing large amounts of data that the high-level pro­gram­ming language, Pig Latin, makes available to Hadoop users. Pig makes it possible to describe the flow of MapReduce jobs on an abstract level. Following this, MapReduce requests are no longer created in Java; instead, they’re pro­grammed in the much more efficient Pig Latin. This makes managing MapReduce jobs much simpler. For example, this language allows users to un­der­stand the parallel execution of complex analyses. Pig Latin was orig­i­nal­ly developed by Yahoo. The name is based on the approach of the software: like an ‘omnivore’, Pig is designed to process all types of data (struc­tured, un­struc­tured, or re­la­tion­al).
  • Hive: with Apache Hive, a data warehouse is added to Hadoop. These are cen­tral­ized data bases employed for different types of analyses. The software was developed by Facebook and is based on the MapReduce framework. With HiveQL, Hive is endowed with a SQL-like syntax that makes it possible to call up, compile, or analyze data saved in HDFS. To this end, Hive au­to­mat­i­cal­ly trans­lates SQL-like requests into MapReduce Jobs.
  • HCatalog: a core component of Apache Hive is HCatalog, a meta data and chart man­age­ment system that makes it possible to store and process data in­de­pen­dent­ly of both format or structure. To do this, HCatalog describes the data’s structure and so makes use easier through Hive or Pig.
  • Mahout: Apache Mahout adds easily ex­ten­si­ble Java libraries that can be used for data mining and math­e­mat­ic ap­pli­ca­tions for machine learning to the Hadoop ecosystem. Al­go­rithms that can be im­ple­ment­ed with Mahout in Hadoop enable op­er­a­tions like clas­si­fi­ca­tions, clus­ter­ing, and col­lab­o­ra­tive filtering. When applied, Mahout can be used, for instance, to develop rec­om­men­da­tion services (customers who bought this item also bought…).
  • Oozie: the optional workflow component, Oozie, makes it possible to create process chains, automate these, and also have them executed in a time con­trolled manner. This allows Oozie to com­pen­sate for deficits in Hadoop 1’s MapReduce engine.
  • Sqoop: Apache Sqoop is a software component that fa­cil­i­tates the import and export of large data quan­ti­ties between the Hadoop Big Data framework and struc­tured data storage. Nowadays, company data is generally stored within re­la­tion­al databases. Sqoop makes it possible to ef­fi­cient­ly exchange data between these storage systems and the computer cluster.
  • ZooKeeper: Apache ZooKeeper offers services for co­or­di­nat­ing processes in the cluster. This is done by providing functions for storing, dis­trib­ut­ing, and updating con­fig­u­ra­tion in­for­ma­tion.

Hadoop for busi­ness­es

Given that Hadoop clusters can be set up for pro­cess­ing large data volumes with the help of standard PCs, the Big Data framework has become a popular solution for numerous companies. Some of these include the likes of Adobe, AOL, eBay, Facebook, Google, IBM, LinkedIn, Twitter, and Yahoo. In addition to the pos­si­bil­i­ty of being able to easily save and si­mul­ta­ne­ous­ly process data on dis­trib­uted ar­chi­tec­ture, Hadoop’s stability, ex­pand­abil­i­ty, and extensive range of functions are further ad­van­tages of this open source software.

Go to Main Menu