More and more companies have large amounts of data that are valuable resources for customer seg­men­ta­tion, sales man­age­ment, and target marketing. However, if these data sets cannot be suf­fi­cient­ly analyzed and evaluated, they are prac­ti­cal­ly worthless to companies. There is a wealth of in­for­ma­tion here, but only those who know how to use it can benefit from it. This is also pointed out by trend re­searcher and fu­tur­ol­o­gist John Naisbitt with his well-known quote:

Quotation

“We are drowning in in­for­ma­tion, but starving for knowledge.”

– Trend re­searcher and fu­tur­ol­o­gist, John Naisbitt, on growing volumes of digital data

Data mining tools help to manage the amount of data and identify po­ten­tial­ly decisive trends and patterns. Data mining software is becoming in­creas­ing­ly complex and the selection of tools is growing. To help you keep track of the most important data mining programs, we have compiled a com­par­i­son of the various data mining programs available.

Tech­niques, tasks, and com­po­nents of data mining

Data mining is the term used for al­go­rith­mic methods of data eval­u­a­tion that are applied to par­tic­u­lar­ly large and complex data sets. Data mining is designed to extract hidden in­for­ma­tion from large volumes of data (es­pe­cial­ly mass data, which is known as Big Data), and therefore identify even better hidden cor­re­la­tions, trends, and patterns that are depicted in them. This is where data mining tools come in. The term 'data mining' does not mean gen­er­at­ing data or even the data set them­selves, but refers to the practice of data analysis. Many of the methods used come from sta­tis­tics; however, data mining is not purely sta­tis­ti­cal, but rather an in­ter­dis­ci­pli­nary method that connects computer science and math­e­mat­i­cal findings with machine-learning tech­nolo­gies (es­pe­cial­ly un­su­per­vised learning) and ar­ti­fi­cial in­tel­li­gence. These powerful methods are in­te­grat­ed into data mining software to enable large data sets to be evaluated.

Fact

Text mining is a special form of data mining, which gains special relevance due to the pop­u­lar­i­ty of language software and language tech­nol­o­gy. In­for­ma­tion retrieval here does not refer to data sets, but to text documents. The main points are extracted from large amounts of text (spe­cial­ist articles or company documents). This makes text mining useful for companies when re­search­ing new projects, for example.

Nev­er­the­less, users must also have a good un­der­stand­ing of data sets in order for data mining to be suc­cess­ful. Only then can they use the data mining tools in a mean­ing­ful way – pro­gram­ming skills are not required.

In­di­vid­ual data mining tasks:

  • Clas­si­fi­ca­tion: Assigns in­di­vid­ual data objects to certain pre­de­fined classes (such as cats or bicycles) that were not pre­vi­ous­ly assigned to these classes; the decision tree analysis is par­tic­u­lar­ly helpful for clas­si­fi­ca­tion.
  • Deviation outlier analysis: Iden­ti­fies objects that do not comply with the rules of de­pen­den­cy for related objects; this enables you to find the causes of the dis­crep­an­cies.
  • Cluster analysis: Iden­ti­fies clusters of sim­i­lar­i­ties and then forms groups of objects that are more similar in terms of certain aspects than other groups; unlike clas­si­fi­ca­tion, the groups (or clusters) are not pre­de­fined and can take different forms depending on the data analyzed.
  • As­so­ci­a­tion analysis: Reveals cor­re­la­tion between two or more in­de­pen­dent items that are not directly related, but occur more often together.
  • Re­gres­sion analysis: Reveals re­la­tion­ships between a dependent variable (e.g. product sales) and one or more in­de­pen­dent variables (e.g. product price or customer income), and is used, among other things, to make forecasts about the dependent variable (e.g. a sales forecast).
  • Pre­dic­tive analytics: This is actually a su­per­or­di­nate task that aims to make pre­dic­tions about future trends. It uses data mining, among other things, and works with a variable (predictor) that is measured for in­di­vid­ual people or larger entities.
Fact

With the help of as­so­ci­a­tion analysis, in­for­ma­tive cor­re­la­tions could be es­tab­lished during pur­chas­ing decisions for different products, which sig­nif­i­cant­ly improved the shopping basket analysis. This method is used to determine rec­om­mend­ed purchases from online mail order companies.

The different methods can be roughly divided into so-called ob­ser­va­tion problems (deviation analysis, cluster analysis) and fore­cast­ing problems (re­gres­sion analysis, clas­si­fi­ca­tion). A detailed ex­pla­na­tion of different data mining methods can be found on Zentut.

A com­par­i­son of data mining tools

In order to carry out a com­par­i­son of the best data mining tools, we will introduce the tools, Rapid­Min­er, WEKA, Orange, KNIME, and SAS. It has been proven that users use multiple programs, because data mining tools have different strengths that can be combined with each other. Data mining tools are often com­pat­i­ble with each other. But even with just one good all-rounder tool, you can do a lot of things as a beginner.

Rapid­Min­er

Rapid­Min­er (formerly known as: YALE, 'Yet Another Learning En­vi­ron­men­t') is one of the most popular data mining tools. In 2014, it was the most widely used data mining tool prior to the R tool, according to a survey conducted by KDnuggets. It is available for free and easy to use even if you don’t possess special pro­gram­ming skills. Nev­er­the­less, it offers a large selection of operators. Startups, in par­tic­u­lar, make the most of this tool.

Rapid­Min­er was written in Java and contains more than 500 operators with different ap­proach­es to point out con­nec­tions in data – there are options for data mining, text mining, web mining, and also for mood analysis (sentiment analysis, opinion mining), among other things. The program also imports Excel tables, SPSS files, and data sets from many databases, and in­te­grates the WEKA and R data mining tools. This makes it a com­pre­hen­sive all-rounder.

Rapid­Min­er supports all steps of the data mining process, including the pre­sen­ta­tion of results. The tool consists of three major modules: Rapid­Min­er Studio, Rapid­Min­er Server, and Rapid­Min­er Radoop, each of which executes different data mining tech­niques. In addition, Rapid­Min­er prepares the data prior to analysis and optimizes it for faster sub­se­quent pro­cess­ing. For each of these three modules, there’s a free and a fee-based version available.

A par­tic­u­lar strength of Rapid­Min­er is pre­dic­tive analytics, which is the name given to pre­dict­ing future de­vel­op­ments based on collected data. When comparing data mining software, Rapid­Min­er is one of the strongest tools out of the ones mentioned.

WEKA

WEKA (Waikato En­vi­ron­ment for Knowledge Analysis) is open source software and was developed by the Uni­ver­si­ty of Waikato. The data mining tool is based on Java and can be used with Windows, MacOS, and Linux. Known for its extensive machine learning ca­pa­bil­i­ties, it supports all major data mining tasks such as clus­ter­ing, as­so­ci­a­tion, re­gres­sion, and clas­si­fi­ca­tion. The graphic user interface fa­cil­i­tates access to the software. In addition, WEKA offers connect to SQL databases and can further process the requested data. WEKA’s strength lies in clas­si­fi­ca­tion: the data mining tool is known for its many clas­si­fi­ca­tions, including ar­ti­fi­cial neural networks, decision trees, ID3, and C4.5 al­go­rithms. However, WEKA is less powerful when it comes to other tech­niques such as cluster analysis. Only the most important pro­ce­dures are offered by this program. Another dis­ad­van­tage: WEKA can ex­pe­ri­ence problems with pro­cess­ing if the amount of data becomes too much. This is because the data mining tool tries to load all of it into the memory. To avoid this, WEKA offers a simple command line (CLI) that makes it easier to handle large amounts of data.

Fact

WEKA was awarded the 'SIGKDD Service Award' from the As­so­ci­a­tion for Computing Machinery for its high-research con­tri­bu­tion. In com­par­i­son to other data mining tools, WEKA has proven par­tic­u­lar­ly useful for teaching and research purposes.

Orange

The data mining tool Orange has existed for more than 20 years and is a project from the Uni­ver­si­ty of Ljubljana. The software’s core was written in C++, but early on the program was extended by the pro­gram­ming language, Python, which is now used as the query language. The more com­pli­cat­ed op­er­a­tions are still carried out in C++. Orange is a com­pre­hen­sive data mining software that demon­strates how much you can do with Python: It offers useful ap­pli­ca­tions for data and text analysis as well as features for machine learning. When it comes to data mining, it works with operators for clas­si­fi­ca­tion, re­gres­sion, clus­ter­ing, and much more. This data mining tool also in­te­grates visual pro­gram­ming.

What is striking about the tool is that users re­peat­ed­ly emphasize how fun this data mining software is compared to others. Both beginners and ex­pe­ri­enced users have admitted to being fas­ci­nat­ed by Orange. Its pop­u­lar­i­ty comes down to two things: firstly, the appealing data vi­su­al­iza­tion that makes it more in­ter­est­ing to work with; secondly, the speed and ease with which the vi­su­al­iza­tion takes place. The program prepares input data visually and instantly. Un­der­stand­ing these graphics and pro­cess­ing the data analysis further is rel­a­tive­ly easy, and quick business decisions can be made. This makes Orange an ideal tool for data mining.

A further advantage for beginners is that there are numerous online tutorials available for the tool. Another special feature of Orange is that it learns the pref­er­ences of its users over time and reacts ac­cord­ing­ly. This is another plus for the data mining tool.

KNIME

KNIME was developed by the Uni­ver­si­ty of Constance and is now popular with a large in­ter­na­tion­al community of de­vel­op­ers. Although KNIME was orig­i­nal­ly intended for com­mer­cial use, it is still available as open source software. It was written in Java and edited with Eclipse. If you compare this data mining software with others, its range of functions is es­pe­cial­ly im­pres­sive: with more than 1,000 modules and ready-made ap­pli­ca­tion packages, this tool helps to reveal hidden data struc­tures. The modules can be expanded by ad­di­tion­al com­mer­cial features. Among its functions, in­te­gra­tive data analysis is par­tic­u­lar­ly appealing – KNIME is one of the most powerful tools in its field and enables numerous methods of machine learning and data mining to be in­te­grat­ed. It is also par­tic­u­lar­ly effective when pre­pro­cess­ing data i.e. ex­tract­ing, trans­form­ing, and loading data. Its modular pipelin­ing makes it a data flow-oriented data mining tool. KNIME has been used in phar­ma­ceu­ti­cal research since 2006 and is also a powerful data mining tool for the financial data sector. However, KNIME is also fre­quent­ly used in the business in­tel­li­gence (BI) sector. Here, KNIME is regarded as the tool that made pre­dic­tive analytics also available to in­ex­pe­ri­enced users. The tool is also in­ter­est­ing for beginners, because despite its many strong features, you don’t need much time to fa­mil­iar­ize yourself with it. KNIME is available as a free program as well as a paid program.

SAS

SAS (Sta­tis­ti­cal Analysis System) is a product of the SAS Institute, one of the world’s largest privately-owned software companies. SAS is the leading data mining tool for business analysis and is also the most expensive of the programs listed here. However, it is the one that is best suited for use in large companies. SAS is par­tic­u­lar­ly good when it comes to the prog­nos­tic sector and in­ter­ac­tive data vi­su­al­iza­tion, which is ideal for large pre­sen­ta­tions. In principle, this data mining software provides a com­pre­hen­sive all-round solution for suc­cess­ful data mining. The tool is char­ac­ter­ized by very high scal­a­bil­i­ty, so it’s possible to increase the per­for­mance pro­por­tion­al­ly by adding ad­di­tion­al hardware or other resources. This also makes it a powerful tool for high-quality business solutions. For tech­ni­cal­ly less ex­pe­ri­enced users, it has a graphical user interface. However, this software can only be used free of charge if you get a cor­re­spond­ing license from a public in­sti­tu­tion. SAS is usually subject to a fee. The costs are decided upon request and depend on special con­di­tions i.e. it’s cheaper for au­thor­i­ties or ed­u­ca­tion­al in­sti­tu­tions. SAS is one of the more expensive al­ter­na­tives among com­mer­cial tools. However, it is possible to customize the range of functions and therefore influence the price. SAS is mainly used in phar­ma­ceu­ti­cal companies where it has es­tab­lished itself as standard. It is also fre­quent­ly used in the banking sector and offers optimal solutions for BI and web mining. Among other things, it has its own business in­tel­li­gence software for this purpose. This makes it one of the most powerful data mining tools on the market.

Data mining tools at a glance

After providing a detailed com­par­i­son of the data mining software, here’s an overview of all important features of the data mining tool:

  Char­ac­ter­is­tics Pro­gram­ming language Operating system Price/license
Rapid­Min­er Strong all-rounder with a special strength in pre­dic­tive analytics Java Windows macOS Linux Freeware Various fee-based versions
WEKA Many methods of clas­si­fi­ca­tion Java Windows macOS Linux Free software (GPL)
Orange Creates par­tic­u­lar­ly appealing and in­ter­est­ing data vi­su­al­iza­tions without the need for extensive prior knowledge Software core: C++ Ex­ten­sions and query language: Python Windows macOS Linux Free software (GPL)
KNIME The leading open data mining tool that has made pre­dic­tive analytics available to the general public Java Windows macOS Linux Free software (GPL) (from version 2.1 onwards)
SAS Expensive, but powerful data mining software for large en­ter­pris­es SAS language Windows macOS Linux Limited freeware available through ed­u­ca­tion­al in­sti­tu­tions Price only available on request Various extensive models available
Go to Main Menu