Data reduction methods can be used to lessen the amount of data that is phys­i­cal­ly stored. This saves storage space and costs.

What does data reduction mean?

The term data reduction covers various methods used to optimize capacity. Such methods aim to reduce the amount of data being stored. With data volumes in­creas­ing worldwide, data reduction is necessary to ensure resource- and cost-ef­fi­cien­cy when storing data.

Data reduction can be carried out through data com­pres­sion and dedu­pli­ca­tion. While lossless com­pres­sion uses re­dun­dan­cies within a file to compress data, dedu­pli­ca­tion al­go­rithms match data across files to avoid rep­e­ti­tion.

What is dedu­pli­ca­tion?

Dedu­pli­ca­tion is a process of data reduction that is es­sen­tial­ly based on pre­vent­ing data re­dun­dan­cies in the storage system. It can be im­ple­ment­ed either at the storage target or at the data source. A dedu­pli­ca­tion engine is used, which uses special al­go­rithms to identify and eliminate redundant files or data blocks. The main area of ap­pli­ca­tion for dedu­pli­ca­tion is data backup.

The aim of data reduction using dedu­pli­ca­tion is to write only as much in­for­ma­tion on non-volatile storage media as is necessary to be able to re­con­struct a file without losses. The more du­pli­cates are deleted, the smaller the data volume that needs to be stored or trans­ferred.

The iden­ti­fi­ca­tion of du­pli­cates can be done at file-level with Git or Dropbox, for example. However, a more efficient method is the use of dedu­pli­ca­tion al­go­rithms, which work on a sub-file level. To do this, files are first broken down into data blocks (chunks) and awarded unique checksums, or hash values. The tracking database, which contains every checksum, acts as a central su­per­vi­so­ry entity.

The block-based dedu­pli­ca­tion methods can be broken down into two vari­a­tions:

  • Fixed block length: Files are divided into sections of exactly the same length based on the cluster size of the file or RAID system (typically 4 KB)
  • Variable block length: The algorithm divides the data into different blocks, the length of which varies depending on the type of data to be processed.

The way blocks are divided has a massive influence on the ef­fi­cien­cy of the data du­pli­ca­tion. This is es­pe­cial­ly no­tice­able when dedu­pli­cat­ed files are sub­se­quent­ly modified. When using fixed block sizes, if a file is changed, all sub­se­quent segments are also clas­si­fied as new by the dedu­pli­ca­tion algorithm due to the shift in block bound­aries. This increases the computing effort and use of bandwidth.

If, on the other hand, an algorithm uses variable block bound­aries, the mod­i­fi­ca­tions of an in­di­vid­ual data block have no effect on the next segments. Instead, the modified data block is simply extended and stored with the new bytes. This relieves the burden on the network. However, the flex­i­bil­i­ty of the file changes is more computing-intensive, as the algorithm must first find out how the chunks are split up.

Cloud Backup powered by Acronis
Mitigate downtime with total workload pro­tec­tion
  • Automatic backup & easy recovery
  • Intuitive sched­ul­ing and man­age­ment
  • AI-based threat pro­tec­tion

What is data com­pres­sion?

In data com­pres­sion, files are converted into an al­ter­na­tive format, which is more efficient than the original. The aim of this type of data reduction is to reduce the required memory space as well as the transfer time. A coding gain like this can be achieved with two different ap­proach­es:

  • Re­dun­dan­cy com­pres­sion: With lossless data com­pres­sion, data can be de­com­pressed precisely after com­pres­sion. Input and output data is therefore identical. This kind of com­pres­sion is only possible when a file contains redundant in­for­ma­tion.
  • Ir­rel­e­vance com­pres­sion: With lossy com­pres­sion, ir­rel­e­vant in­for­ma­tion is deleted to compress a file. This is always ac­com­pa­nied by a loss of data. There is only an ap­prox­i­mate recovery of the original data after an ir­rel­e­vance com­pres­sion. The process for clas­si­fy­ing data as ir­rel­e­vant is dis­cre­tionary. In an audio com­pres­sion via MP3, for example, the frequency patterns removed are those that are assumed to be hardly or not at all heard by humans.

While com­pres­sion on the storage system level is es­sen­tial­ly loss-free, data losses in other areas, such as image, video and audio transfers, are de­lib­er­ate­ly accepted to reduce file size.

Both the encoding and decoding of a file require com­pu­ta­tion­al effort. This primarily depends on the com­pres­sion method that is used. While some tech­niques aim for the most compact rep­re­sen­ta­tion of the original data, others focus on reducing the required com­pu­ta­tion time. The choice of com­pres­sion method is therefore always dependent on the re­quire­ments of the project or task it is being used for.

Which data reduction method is better?

In order to implement backup pro­ce­dures or optimize storage in standard file systems, companies generally rely on dedu­pli­ca­tion. This is mainly due to the fact that dedu­pli­ca­tion systems are extremely efficient when identical files need to be stored.

Data com­pres­sion methods, on the other hand, are generally as­so­ci­at­ed with higher computing costs and therefore require more complex platforms. Storage systems that have a com­bi­na­tion of both data reduction methods can be used most ef­fec­tive­ly. First, re­dun­dan­cies are removed from the files to be stored using dedu­pli­ca­tion, and then the remaining data is com­pressed.

Go to Main Menu