Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Deduplication Approaches: Concepts, Strategies, and Challenges
Data Deduplication Approaches: Concepts, Strategies, and Challenges
Data Deduplication Approaches: Concepts, Strategies, and Challenges
Ebook849 pages8 hours

Data Deduplication Approaches: Concepts, Strategies, and Challenges

Rating: 0 out of 5 stars

()

Read preview

About this ebook

In the age of data science, the rapidly increasing amount of data is a major concern in numerous applications of computing operations and data storage. Duplicated data or redundant data is a main challenge in the field of data science research. Data Deduplication Approaches: Concepts, Strategies, and Challenges shows readers the various methods that can be used to eliminate multiple copies of the same files as well as duplicated segments or chunks of data within the associated files. Due to ever-increasing data duplication, its deduplication has become an especially useful field of research for storage environments, in particular persistent data storage. Data Deduplication Approaches provides readers with an overview of the concepts and background of data deduplication approaches, then proceeds to demonstrate in technical detail the strategies and challenges of real-time implementations of handling big data, data science, data backup, and recovery. The book also includes future research directions, case studies, and real-world applications of data deduplication, focusing on reduced storage, backup, recovery, and reliability.
  • Includes data deduplication methods for a wide variety of applications
  • Includes concepts and implementation strategies that will help the reader to use the suggested methods
  • Provides a robust set of methods that will help readers to appropriately and judiciously use the suitable methods for their applications
  • Focuses on reduced storage, backup, recovery, and reliability, which are the most important aspects of implementing data deduplication approaches
  • Includes case studies
LanguageEnglish
Release dateNov 25, 2020
ISBN9780128236338
Data Deduplication Approaches: Concepts, Strategies, and Challenges

Related to Data Deduplication Approaches

Related ebooks

Science & Mathematics For You

View More

Related articles

Related categories

Reviews for Data Deduplication Approaches

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Deduplication Approaches - Tin Thein Thwel

    topic.

    1

    Introduction to data deduplication approaches

    G.R. Sinha¹, Tin Thein Thwel¹, Samrudhi Mohdiwale² and Divya Prakash Shrivastava³,    ¹1Myanmar Institute of Information Technology (MIIT), Mandalay, Myanmar,    ²2National Institute of Technology Raipur, Raipur, India,    ³3Higher Colleges of Technology, Abu Dhabi, United Arab Emirates

    Abstract

    In the field of computing, data deduplication methods are used to reduce duplicate copies and repeated data. The data storage and its utilization are improved by deduplication in addition to better network transfer. Due to reduced data storage requirement, the data transfer also becomes fats and efficient. Unique data chunks are identified, and by using suitable approach the duplicate data is eliminated by replacing the redundant chunks. This chapter presents an overview of data deduplication methods, types, file chunking, metadata, and the applications of deduplication process.

    Keywords

    Data deduplication; file chunking; metadata; repeated data; redundancy

    1.1 Introduction

    With the advent of information and communication technologies (ICT), popular use of internet, and search engines, the data storage has become an important issue in the area of computing. Using internet, all files in almost all communications are shared or transmitted in the form of digital data that requires storage. The problem begins only when repetitive data is stored in the storage space. Two data files identical in nature but different by few bits or bytes, used for conveying same message or information, create duplicate data. This is referred to as data redundancy or duplicity, and the methods used to handle the data duplication issues are called data deduplication (Al-Kiswany, Subhraveti, Sarkar, & Ripeanu, 2011; Chun, Kim, Lee, & Kim, 2019XXX; Deduplication & Cs, 2017XXX; Dong & Wang, 2014XXX; Lin, Chang, Kuo, Chang, & Li, 2015; Meister et al., 2012XXX; Meyer & Bolosky, 2012; Microsoft, 2016; Ng, Wen, & Zhu, 2012; Romański, Heldt, Kilian, Lichota, & Dubnicki, 2011; Storer, Greenan, Long, & Miller, 2008XXX), also referred to as intelligent compression techniques or single instance (Microsoft, 2016; Rabotka & Mannan, 2016) storage systems. The method of reducing data duplication is referred to as single instance storage because only one instance among similar instances for repeated data is stored in the storage system. The method also acts as smart system for the compression of unnecessary and repeating data and keeping only one and thus called as intelligent system that uses artificial intelligence (AI) and machine learning methods for handing duplicate data. Data deduplication can be applied in numerous fields of computing in research and other applications, such computer vision, image processing based applications (Ahmad, Sinha, & Kashyap, 2014XXX; Barde, Zadgaonkar, & Sinha, 2014XXX; Bhonsle, Chandra, & Sinha, 2012XXX; Misal, 2012XXX; Siddhartha, SG, Charan, Abha, & Kavita, 2011XXX; Sinha, 2015XXX; Sinha & Sharma, 2020XXX; Tiwari, Kumar, Kumar, & Sinha, 2015XXX), natural language processing, face recognition, pattern matching, database systems, cloud computing, big data application, parallel computing (Al-Kiswany et al., 2011; Bhattacherjee, Narang, & Garg, 2011XXX; Fu, Jiang, Xiao, Tian, & Liu, 2011; Meister & Brinkmann, 2009; Sun, Shen, & Yong, 2013XXX), etc. Fig. 1–1 shows basic steps of data deduplication approach where we can see that the data values are repeated initially and using a suitable algorithm like hash function or similar, the values are assigned a particular value of function. The next step compares all values with respect to standard value or the function, and if same values are reported and then those values are ignored. This can be seen in the outcome of the last step that the values which were repeating initially have been ignored and replaced by single value of data. Fig. 1–1 depicts a typical way of how the steps of data deduplication work in any applications of data science (Budiarti, Sukaridhoto, Hariadi, & Purnomo, 2019XXX; Gracia, González, Robles, & Menasalvas, 2014XXX; Kaur et al., 2020XXX; Sorzano, Vargas, & Montano, 2014XXX) or database management system (DBMS).

    Figure 1–1 Data deduplication basic approach.

    1.2 Methods of data deduplication

    For variety of data science and applications, different types of methods and algorithms are used in data deduplication approaches and strategies but two main types of methods are most popularly used in a large number of applications, namely the hash-based and content-aware methods. Based on the data storage and backup, the deduplication methods are also referred to as inline and postprocessing methods.

    1.2.1 Hash-based and content-aware methods

    In hash-based method, hashing (El-Shimi et al., 2019; Feldman, Bhat, Laborde, Yi, & Dechev, 2013; Ma et al., 2017; Oh et al., 2018; Puzio, Molva, Önen, & Loureiro, 2016) function or algorithm is popularly used which is further employed in the identification of chunks (Lkhagvasuren, So, Lee, Yoo, & Ko, 2013; Meister & Brinkmann, 2009; Meister, Kaiser, & Brinkmann, 2013XXX; Microsoft, 2016; Xu, 2016) of data. The chunks are fragments of small group of data which describe small similar groups that assist in ignoring the repeated data. Fig. 1–2 shows a typical diagram of hashing method of data deduplication. The different slices or chunks of data are assigned different values of hash function and after comparing a hash value with all other slices, the updated has values result. This process is repeated until the assignment of values converges to no change state. A hash value generally requires certain number of bits and whenever the next chunks of data search and find chunks of same hash values then the chunks are treated as duplication data and not stored in the process of data deduplication. If the hash value is a new and not found in the already existing values, then the hash value is stored and corresponding data chunk is considered and stored in the databases. In Fig. 1–2, we can see that new slice K is updated in the next level with new hash value. New hash value is used as a new index value and acts as an important indexing factor in the further comparison of subsequent data slices or chunks.

    Figure 1–2 Assignment of hash values.

    In content-aware method, data patterns are investigated and when same patterns are reported then obviously the patterns are treated as duplicate data. In this approach, file name mechanism is followed which means a suitable file name is assigned to different data patterns and upcoming patterns that are different are assigned new file names. In this method, whenever the match of files is detected, the bitwise comparison is performed and based on this comparison, similarity is assessed; which means that only the similar file names are not only the criterion but bitwise matching result also matters in the process of finding duplicate data. The knowledge of database becomes essential in this approach because the number of files may be in millions and therefore some sort of memory indexing is also needed to obtain an efficient management of files and their updating as a consequence to ignore duplicate data. This scheme of deduplication can be seen in Fig. 1–3 where the steps are self-explanatory.

    Figure 1–3 File name concept of deduplication.

    These methods are based on timing during which the deduplication is performed. If the operation is performed during the data being stored, then the approach is called as inline method; and postprocessing method is performed just after storing the data being processed in the deduplication operation. Inline method does not depend on the type of method whether it is hash-based or content-aware method, whereas the postprocessing method and its performance are based on the type of method being used for data deduplication.

    1.3 Classic research and classification of methods

    When a large amount of data is stored for a long period of time and some of the data are duplicate then the system has difficulty in handling the large volume of data and thus data deduplication mechanism or search algorithms are employed. There are some classical research approaches attempted by a number of authors in the current literature. We will discuss a summary of the research impact and then the classification of methods.

    The most popular research on the approach is based on using chunks as variable-length data bundles. In this method, the data is segmented into small size data streams called as chunks, unique chunks are only retained in the system, and repeated data is avoided with help of some indexing mechanisms. Generally, this approach is referred to as file chunking method of data deduplication (El-Shimi et al., 2019; Meister & Brinkmann, 2009) where data input represents files and the outputs are small chunks or blocks of data. The size of chunk may be fixed or variable, and small size of chunk makes the deduplication more efficient due to the better compression applied using small size chunks. In this process the cost of deduplication and complexity increase due to increased indexing and metadata (El-Shimi et al., 2019; Meyer & Bolosky, 2012; Oh et al., 2018; Sun, Zhang, Lou, & Hou, 2018; Tarasov et al., 2014). The overhead also increases in case of smaller size chunks and this can be overcome by choosing larger chunks but the selection of larger chunks would not reduce the duplicate data efficiently. According to the length of chunk, deduplication methods are divided into fixed-length (Lin et al., 2015) and variable-length chunking (Noorafiza, n.d.XXX). Variable-length chunking method will have more flexibility, and the overhead issues will be managed whenever needed in the data streams. In variable-length category, MUltithreaded content-based file CHunking (MUCH) method (Ma et al., 2017) utilizes the concept of multicore architecture of the processor and thus operates in the same manner a multicore processor does. In other words, we can also state that the operation is performed in parallel manner with more efficient data deduplication as compared to single-threaded chunking approach. There are other methods of deduplication, also used in literature, namely local maximum chunking, semantic chunking, content defined chunking, B+ tree method (Chen et al., 2018XXX), etc.

    1.3.1 Classification of methods

    Data deduplication is sometime referred to as data compression (Feldman et al., 2013; Ontap, 2015XXX; Tian, Khan, Jiménez, & Loh, 2014XXX; Xia et al., 2016; Xu, 2016) which is used to reduce the data storage requirement by eliminating the duplicates or redundant data. The way the deduplication operates using a suitable indexing function is called as the intelligent or smart compression method of data. The compression method is implemented as lossy and lossless in some type of data especially images. Main benefits of data deduplication methods include potential resource saving and the reduction of potential cost of backup and energy. Data deduplication methods are broadly classified based on the following three important factors:

    • Point of application

    • Time of application

    • Granularity

    Point of processing means the place of processes. On this basis, the methods are divided as source-based and target-based data deduplication methods. Source-based method (Fu et al., 2011; Oh et al., 2018; Zhu et al., 2018XXX) is also known as client-based approach of data deduplication, in which a backup agent is introduced that ensures the removal of redundant data and forwarding only unique data for subsequent processing and operation. Utilization of bandwidth as well as storage is improved; however, the complexity increases due to back agent usage in the process. Fig. 1–4 shows a typical flow of source-based data deduplication approach. Here, duplicate-aware agent is actually the backup agent that decides the data to be either redundant or unique. The unique data is subjected to the server and the data is available at source.

    Figure 1–4 Source-based method.

    In the target-based method (Al-Kiswany et al., 2011; Hossain, Atrey, & El Saddik, 2011XXX; Pang, Jia, & Feng, 2014XXX), also known as site data deduplication method, the client is not aware of deduplication and a deduplication engine is actually embedded with hardware array, as network archival storage (NAS) or storage area network (SAN). Fig. 1–5 shows this mechanism of data deduplication.

    Figure 1–5 Target-based method.

    Based on time of application, the data deduplication methods are divided into inline and postprocessing. Fig. 1–6 shows inline (Aronovich et al., 2009XXX) data deduplication method. This method is applied at the arrival of data before it hits the disk storage. This method is very simple and faster as compared to postprocessing (Oh et al., 2018; Wang, Xiao, & Lee, 2015XXX) method due to no need of backup and recovery. Appropriate use of a suitable function like hash function will be required in this method to indicate the data to be actually stored in the disk.

    Figure 1–6 Inline deduplication method.

    The main advantages of the inline method include: less I/O operations to be performed since the deduplication is applied before the storage and redundancy removed; simple and faster. Fig. 1–7 depicts postprocessing data deduplication method. It can be seen that the duplication engine or algorithm is operating after the storage which means the deduplication is postprocessing type. While storing the data, the identification of uniqueness of data is performed and immediately after the storage the duplicates are removed.

    Figure 1–7 Postprocessing deduplication method.

    According to granularity which determines whether the deduplication is applied to the whole data or the files or the subfiles. File-level method is also known as single instance storage (SIS) (Rabotka & Mannan, 2016). In this method, file-level data is compared with the help of an indexed and using attributes of file. When the files are found different then those files are updated with the help of indexing mechanism, else ignored in case of similar files. In subfile-level method, all files are divided into smaller chunks or blocks and then these chunks are compared with respect to a subject chunk using a hash function. The similar chunks are removed and unique chunks are stored in the storage and embedded in the subfile.

    1.4 File chunking and metadata

    In the process of data deduplication using file chunking approach (El-Shimi et al., 2019; Meister & Brinkmann, 2009; Ni, Lin, & Jiang, 2019), the data is divided into small chunks and each chunk is compared with all the chunks available in the data. The chunks are described by summary information also called fingerprint, rather than the granularity (bit-level comparing). Therefore the fingerprints are used in matching the chunks with all remaining chunks present in the data to determine if the chunks are redundant or not. The unique file or data is stored in the storage using chunks, as can be seen in Fig. 1–8. We can see in Fig. 1–8 that file is divided into a number of chunks c1, c2, c3… which are further assigned a summary information called as fingerprint. The comparison of fingerprint helps to determine the file to be unique or redundant. The size of chunks may be fixed or variable and accordingly the file chunking methods are of two types, namely fixed-length and variable-length file chunking. In fixed-length chunking, file is divided into fixed-length of chunks using certain number of bits. This method is simple, fast, and computationally efficient also.

    Figure 1–8 File chunking.

    In variable-size chunking method, the length of chunks is different for all the chunks and this makes the method adaptive and flexible while executing the comparison and indexing processes in the deduplication method. There are a number of algorithms available in the literature for file chunking used in various applications of data deduplication (Kaczmarczyk, Barczynski, Kilian, & Dubnicki, 2012XXX; Lkhagvasuren et al., 2013; Ni et al., 2019; Romański et al., 2011). Chunk indexing is an important term in file chunking method which is used to indicate new or updated files, and its size and time taken decide the performance of chunking-based deduplication performance. Larger is the indexing and longer in the processing time and thus the performance of the storage system is affected consequently.

    1.4.1 Metadata

    In the data deduplication process, two different types of data are involved, namely the actual data and metadata (El-Shimi et al., 2019; Meyer & Bolosky, 2012; Oh et al., 2018; Sun et al., 2018; Tarasov et al., 2014). The files or documents that are actually to be stored in the system are considered as first category of data, whereas the data that corresponds to redundancies indicates the metadata. Metadata is used by the algorithm or software in the identification of redundant data present in the files and also to rehydrate data in case restoration of data is needed upon detection of useful data among redundant data. Metadata is also known as data about data or chunk metadata and this is very critical in determining the performance of the storage system. Metadata uses certain data structure called as inodes that indicate blocks for updating the data and stores the file-specific information of the blocks associated with particular files of data. The inodes are chosen in such a manner that there is no loss of data else system performance with regards to potential data will be adversely affected. Therefore most appropriate hash function or an indexing algorithm like B+ algorithm is chosen so that the data structure does not create any data losses in the system.

    1.5 Implementation strategies

    In a typical data deduplication architecture, we generally have two important stages or phases: deduplication and reconstruction phases. Deduplication phase further includes following components:

    • File chunker

    • Chunk ID generator

    • Duplicate finder

    • Metadata

    • Storage

    In reconstruction phase, we have following components:

    • Reconstructor

    • Metadata

    • Storage

    As we can see in Fig. 1–8, the input files are chunked by file chunker and then each chunk is assigned a unique identity (ID) with the help of ID generator. The generation of ID will employ a suitable algorithm so that the ID can be allotted in systematic manner for performing efficient deduplication process. The indexing mechanism and ID generation help in finding the duplicates, and when duplicates are reported, the redundant data is associated with metadata else the data is stored in the storage. In the reconstruction phase, the data is thoroughly checked and ID is compared with suitable indexing algorithm and when a potential data is detected then it is associated with the storage else redundant data is output by the system.

    The file formats that are subjected to the deduplication system may be portable document format (PDF), Microsoft document formats, Microsoft Excel document, Hypertext Markup Language file (HTML), etc. These files are converted into chunks using file chunker and subjected to further operation as shown in Fig. 1–9. Generally, the following steps are used:

    • Input the file

    • Convert into chunks using suitable file chunking algorithm

    • Check if duplicate exists with the help of indexing and chunk ID

    • IF this already exists, associate with metadata

    • Else associate with the actual data

    Figure 1–9 A typical deduplication architecture.

    For file chunking, Two Threshold Two Divisors (TTTD) chunking method is the commonly used method, which uses variable length of chunks in a flexible manner so that the deduplication is implemented adaptively. The chunking process is shown in Fig. 1–10, in which two thresholds are selected along with main divisor and backup divisor. Based on threshold values, the chunks are divided employing appropriate breakup points and the process continues until all the files are compared and chunked. Hashing function is used in the identification of unique chunks in the system, and a typical chunk generation process is shown in Fig. 1–11, in which an algorithm called as secure hash algorithm (SHA1) (Ng et al., 2012; Puzio et al., 2016; Xia et al., 2016) of National Institute of Standards and Technology (NIST) is used. This algorithm is used to generate ID for new chunk and the resulting chunk is updated in final list of files.

    Figure 1–10 Chunking process.

    Figure 1–11 Chunk generation.

    1.6 Performance evaluation and concluding remarks

    The performance of data deduplication approaches is evaluated using few impotent metrics, such as deduplication ration and space reduction ratio. Deduplication ratio is defined as the ratio of number of bytes given as input and number of bytes produced after deduplication. This can be written

    Enjoying the preview?
    Page 1 of 1