Although many software experts see the data warehouse system Apache Hive starting to disappear, it’s still regularly used to manage large data sets. Many Apache Hive features can also be found in its suc­ces­sors. For this reason, it’s worth taking a closer look at Hive and its most important uses.

What is Apache Hive?

Apache Hive is a scalable extension to the Apache server and the Apache Hadoop memory ar­chi­tec­ture. In Hadoop ar­chi­tec­tures, complex computing tasks are broken down into small processes and at the same time dis­trib­uted on computer clusters from nodes using clus­ter­ing. This allows large amounts of data to be processed with standard ar­chi­tec­tures such as servers and computers. In doing so, Apache Hive is an in­te­grat­ed query and analysis system for your data warehouse on an open source basis. You can analyze, query and summarize data using HiveQL, a similar database language to SQL thanks to Hive. Thanks to this, Hadoop data is made available even for large user groups.

With Hive you use syntax that is similar to SQL:1999 to structure programs, ap­pli­ca­tions and databases or integrate scripts. Before Hive came along, you needed to un­der­stand Java pro­gram­ming and pro­gram­ming processes to use the data query program Hadoop. We can thank Hive for making queries easy to translate into the database system. This could, for example, be MapReduce jobs. It’s also possible to integrate other SQL-based ap­pli­ca­tions in the Hadoop framework using Hive. Due to SQL being so wide­spread, Hive being a Hadoop expansion makes it easier for non-experts to work with databases and large amounts of data.

How does Hive work?

Before Apache Hive expanded the Hadoop framework, the Hadoop ecosystem still used the MapReduce framework developed by Google. In the case of Hadoop 1 this was still im­ple­ment­ed as a stand­alone engine to manage, monitor and control resources and computing processes directly in the framework. This in turn required knowledge of Java to suc­cess­ful­ly query Hadoop files.

The primary Hadoop functions to use and manage large data sets can be sum­ma­rized as follows:

  • Data summaries
  • Queries
  • Analysis

The way Hive works is based on a simple principle, which is the use of an interface similar to SQL to translate queries and analyze Hadoop files with HiveQL in MapReduce, Spark or Tez jobs. To do this Hive organizes data from the Hadoop framework into HDFS-com­pat­i­ble table formats. HDFS is short for Hadoop Dis­trib­uted File System. Targeted data queries can then be carried out via specific clusters and nodes in the Hadoop system. It also has standard features such as filters, ag­gre­ga­tions and joins.

Hive based on Schema-on-Read

Unlike re­la­tion­al databases, which work on the SoW (Schema-on-Write) principle, Hive is based on the SoR (Schema-on-Read) principle. This means that data in the Hadoop framework is stored first and foremost unedited and not saved in a pre-defined schema. Only when a Hive query is first sent, will the data be assigned to a schema. One of the ad­van­tages of this lies primarily in cloud computing by offering more scal­a­bil­i­ty, flex­i­bil­i­ty and quicker load times for the databases dis­trib­uted across clusters.

How to work with data in Hive

To query and analyze data with Hive you would use Apache Hive tables in ac­cor­dance with a Schema-on-Read principle. You use Hive to organize and sort data in these tables in small, detailed or large, general units. These Hive tables are separated in “buckets”, in other words data sets. To access the data, you use HiveQL, a database language similar to SQL. Hive tables can be written over, and attached as well as se­ri­al­ized in databases, among other things. For this, each Hive table has its own HDFS directory.

Tip

Keep your database under control with Linux hosting from IONOS – with flexible, scalable per­for­mance, SSL, DDoS pro­tec­tion and secure servers.

Hive’s most important features

Hive’s key features include querying and analyzing large amounts of data and data sets which are saved as Hadoop files in a Hadoop framework. A second primary task carried out by Hive is to translate queries in HiveQL in MapReduce, Sparks and Tez jobs.

Here’s a summary of other important Hive functions:

  • Saving meta data in re­la­tion­al database man­age­ment systems
  • Using com­pressed data in Hadoop systems
  • UDFs (user defined functions) for data pro­cess­ing and data mining
  • Support for memory types such as RCFile, Text or HBase
  • Using MapReduce and ETL support

What is HiveQL?

When talking about Hive, you’ll often hear the term “similar to SQL”. This refers to the Hive database language HiveQL which is based on SQL, however, it isn’t 100% the same as the standards such as SQL-92. HiveQL can, therefore, be con­sid­ered as a type of SQL or MySQL dialect. Despite all other sim­i­lar­i­ties, the languages do differ in some essential respects. This means HiveQL doesn’t support many SQL features for trans­ac­tions or sub­queries, only partially. On the other hand, it has its own ex­pan­sions such as mul­ti­table inserts offering better scal­a­bil­i­ty and per­for­mance in the Hadoop framework. The Apache Hive Compiler trans­lates HiveQL queries in MapReduce, Tez and Spark.

Tip

Use a dedicated server with powerful Intel or AMD proces­sors and save on your own IT with a managed server from IONOS.

Data security and Apache Hive

By in­te­grat­ing Apache Hive in Hadoop systems you can also benefit from the au­then­ti­ca­tion service Kerberos. This gives you reliable and mutual au­then­ti­ca­tion and ver­i­fi­ca­tion between servers and users. Since HDFS specifies per­mis­sions for new Hive files, au­tho­riz­ing users and groups is up to you. Another important safety aspect is that Hive offers the recovery of critical workflows in case you need it.

What are the benefits of Apache Hive?

If you’re working with large amounts of data in cloud computing or in the case of Big Data as a Service Hive offers many useful features such as:

  • Ad-hoc queries
  • Data analysis
  • The creation of tables and par­ti­tions
  • Support for logical, re­la­tion­al and arith­metic links
  • The mon­i­tor­ing and checking of trans­ac­tions
  • Day end reports
  • The loading of query results in HDFS di­rec­to­ries
  • The transfer of table data to local di­rec­to­ries

The main benefits of this include:

  • Qualitive findings on large amounts of data, e.g. for data mining and machine learning
  • Optimized scal­a­bil­i­ty, cost ef­fi­cien­cy and ex­pand­abil­i­ty for large Hadoop frame­works
  • Seg­men­ta­tion of user circles due to click­stream analysis
  • No deep knowledge of Java pro­gram­ming processes required thanks to HiveQL
  • Com­pet­i­tive ad­van­tages due to faster, scalable reaction times and per­for­mance
  • Allowing you to save hundreds of petabytes of data as well as up to 100,000 data queries per hour without high-end in­fra­struc­ture
  • Better resource loads and quicker computing and loading times depending on the workload thanks to vir­tu­al­iza­tion abilities
  • Good, error-proof data security thanks to improved emergency restora­tion options and the Kerberos au­then­ti­ca­tion service
  • An increase in data entry since there is no need to adapt data for internal databases (Hive reads and analyzes data without any manual format changes)
  • Works under the open source principle

What are the dis­ad­van­tages of Apache Hive?

One of the dis­ad­van­tages of Apache Hive is the fact that there are already many suc­ces­sors to it which offer better per­for­mance. Experts consider hive to be less relevant when managing and using databases.

Other dis­ad­van­tages include:

  • No real time data access
  • Complex pro­cess­ing and updating of data sets with the Hadoop framework with MapReduce
  • Latency, meaning it’s much slower than competing systems

An overview of the Hive ar­chi­tec­ture

The most important com­po­nents of the Hive ar­chi­tec­ture include:

  • Metastore: The central Hive storage location con­tain­ing all data and in­for­ma­tion such as table de­f­i­n­i­tions, schema and directory locations as well as metadata on par­ti­tions in the RDBMS format
  • Driver: HiveQL accepts commands and processes them with the Compiler (col­lect­ing in­for­ma­tion), Optimizer (setting the optimal pro­cess­ing methods) and Executor (executing the task)
  • Command Line + User Interface: Interface for external users
  • Thrift Server: Allows external clients to com­mu­ni­cate with Hive and allows protocols similar to JDBC and ODBC to integrate and com­mu­ni­cate through the network
Image: Diagram of the Hive architecture
Hive offers your company many benefits when pro­cess­ing and querying large amounts of data.

How did Apache Hive come about?

Apache Hive aims to make it easier for users without deep-seated SQL knowledge to work with petabytes of data. It was developed by the founders Joydeep Sen Sharma and Ashish Thusoo, who developed Hive in 2007 when de­vel­op­ing Facebook’s Hadoop framework. With hundreds of petabytes of data, it is one of the biggest in the world. In 2008 Facebook gave the open source community access to Hive and in February 2015 version 1.0 was published.

Go to Main Menu