Translate this page:
Search this website:

BC/DRCloud StorageComplianceData CentresDeduplicationDisk/RAID/Tape/SSDsEthernet StorageSAN/NASTiered StorageVirtualizationSNIA & SNIA EuropeDCIM
White Papers
Web Exclusives
Media Pack


Deduplication: the pros and cons end users are not always told

Deduplication is one of the hottest technologies in the current market because of its ability to reduce costs. But it comes in many flavours and organisations need to understand each one of them if they are to choose the one that is best for them. By Adrian Moir, technical director EMEA, BakBone Software.


Date: 1 Oct 2010

Deduplication is the process of examining a data set or byte stream and storing and/or sending only unique data; duplicate data is replaced with a pointer to the first occurrence of the data. Some IT professionals think that deduplication and Single Instance Store (SIS) are the same thing, but they are not. The key difference between the two is that SIS evaluates the data stream at the file level, so if a user renames a file, SIS will cause it to be seen as new and be stored again, whereas with deduplication, the entire internal contents of the file will be seen as duplicate. As a result SIS delivers less space savings.

All deduplication journeys end with a significantly reduced amount of data on the disk but the ways they get there can differ greatly. The two prevailing methods are fixed-block length and variable-block length; with the latter the deduplication engine can change the block size and recognize more duplicate patterns thereby decreasing the amount of data stored and increasing the space savings. Inline and post process deduplication also offer different advantages and tradeoffs. With inline deduplication, data is deduplicated before being stored on disk; this approach does not require any additional disk space to store the data prior to deduplication but has the following tradeoffs:

  • It lengthens the time to complete the backup, leading to longer backup windows and degraded performance during business hours as well as the inability to start the next backup because the previous backup job is still running;
  • It does not allow the flexibility to leave data that does not deduplicate well, non-deduplicated;
  • It often forces users to ‘rehydrate’ the whole backup to recover a single file, making restores slower.

With post-process deduplication, the backup is briefly placed on disk-based staging storage prior to being deduplicated; some technologies allow deduplication to start after a set amount of the data stream has been staged, reducing the sizing requirements for the staging storage while allowing the backups to complete as fast as possible.

So although post-process deduplication requires additional disk space for the staging storage area, it enables faster backups, shrinking the backup window, it allows the non-deduplication of data that does not deduplicate well, and it offers faster restores.

Where the deduplication occurs is just as: on the source/client or on the target/storage. Source-side deduplication typically uses a client-located deduplication engine that will check for duplicates against a centrally-located deduplication index, typically located on the backup server or media server; only unique blocks will be transmitted to the disk.

The advantage of source-side deduplication is that it reduces network contention because less data is sent over it.

However, by running source-side deduplication users are adding hashing, a processor-intensive algorithm, to the client. This means that clients that are already overloaded will become even more stretched possibly slowing down the backups and lengthening the backup window.

Target-side deduplication is generally better suited for data-intensive environments and runs the deduplication at the storage level, removing the need to have clients with enough horsepower because the hashing occurs at the target. The trade-off is that more data is going to be sent over the network.

Different vendors offer different solutions that mix and match the when and where: for example, one solution could do inline deduplication starting at the source, while others may do post-process starting at the target.

A final criterion to review when evaluating deduplication technologies is deciding how long to retain data; the more the data that is examined, the greater the likelihood that duplicates are found and hence the greater the space savings. For example an initial full backup will only be deduplicated against itself but when the full backup for week 2 is performed, only the unique data that has been updated or added since week 1 will be stored. When deduplicating backups, each additional week of backup can be retained using a decreasing amount of additional disk space, allowing organisations to store even more backups on the existing amount of storage for a longer period, virtually eliminating the need to restore from offsite storage unless there is complete site failure.

So, in summary, what should users consider when planning a deduplication strategy? Their goal(s) will influence the deduplication technologies they should evaluate.

Following are some typical deduplication goals and considerations:

  • Maximum disk space savings
  • Deduplication offers more disk space savings than SIS;
  • Variable block deduplication saves more space than fixed block
  • Inline deduplication reduces disk space requirements;
  • Source-side deduplication can increase the disk space savings;
  • Retaining deduplicated data longer will allow users to store even more backups on the same amount of disk storage for a longer period.
  • Maximum flexibility
  • Post-process deduplication offers the ability to leave data that does not deduplicate well in a non deduplicated state, ensuring that valuable time and processing power are not wasted on data that will not benefit from deduplication;
  • With post-process deduplication, restores are faster;
  • Post-process deduplication allows users to provision data on existing storage which can be up to 1/10 the cost of appliance storage.
  • Shorter backup windows
  • Post-process deduplication can be scheduled to occur outside the backup window;
  • Target-side deduplication does not unnecessarily elongate backup windows.

Deduplication can lead to significant savings in terms of time, human resources and of course budget.

Although the technology continues to develop, there are several proven solutions already on the market today, and those organisations who choose the right products to meet their requirements will find that few storage technologies have made such a difference to their datacentres in the past.


Tags: Deduplication

Related News

28 Sep 2015 | ICT

24 Sep 2015 | BC/DR

23 Sep 2015 | BC/DR

23 Sep 2015 | BC/DR

Read more News »
Related Web Exclusives

28 Sep 2015 | BC/DR

14 Sep 2015 | BC/DR

24 Aug 2015 | BC/DR

17 Aug 2015 | BC/DR

Read more Web Exclusives»

Related Magazine Articles


Winter 2010/2011 | Disk/RAID/Tape/SSDs

October 2010 | Cloud Storage

October 2010 | Deduplication

Read more Magazine Articles»

Related Supplements

1 Feb 2009 | SAN/NAS

Networked Enterprise Storage Solutions for Business Partners

Avnet Technology Solutions (via acquisitions) helped the Fibre Channel Industry Association (FCIA) Europe put storage networking technology on the map, across Europe, more than 10 years ago. Move forward to the present day and the FCIA Europe has ?evolved? into the Storage Networking Industry Association (SNIA) Europe.

Click here to learn more »

1 Jun 2007 | Deduplication

Data Deduplicaton

This SNS Europe Technology Focus is designed to help you understand how the different data de-duplication solutions operate, and, as importantly, how they can help improve the speed of your back-up and recovery process, optimise the use of your back-up media at the same time as helping you understand the types of data that lend themselves to the data de-duplication process.

Click here to learn more »

Read more Supplements »


Latest IT jobs from leading companies.


Click here for full listings»