Data Deduplication Approaches: Concepts, Strategies, and Challenges

Ebook849 pages8 hours

Data Deduplication Approaches: Concepts, Strategies, and Challenges

Name: Data Deduplication Approaches: Concepts, Strategies, and Challenges
ISBN: 9780128236338

Rating: 0 out of 5 stars

()

Read preview

About this ebook

In the age of data science, the rapidly increasing amount of data is a major concern in numerous applications of computing operations and data storage. Duplicated data or redundant data is a main challenge in the field of data science research. Data Deduplication Approaches: Concepts, Strategies, and Challenges shows readers the various methods that can be used to eliminate multiple copies of the same files as well as duplicated segments or chunks of data within the associated files. Due to ever-increasing data duplication, its deduplication has become an especially useful field of research for storage environments, in particular persistent data storage. Data Deduplication Approaches provides readers with an overview of the concepts and background of data deduplication approaches, then proceeds to demonstrate in technical detail the strategies and challenges of real-time implementations of handling big data, data science, data backup, and recovery. The book also includes future research directions, case studies, and real-world applications of data deduplication, focusing on reduced storage, backup, recovery, and reliability.

Includes data deduplication methods for a wide variety of applications
Includes concepts and implementation strategies that will help the reader to use the suggested methods
Provides a robust set of methods that will help readers to appropriately and judiciously use the suitable methods for their applications
Focuses on reduced storage, backup, recovery, and reliability, which are the most important aspects of implementing data deduplication approaches
Includes case studies

Skip carousel

LanguageEnglish

PublisherAcademic Press

Release dateNov 25, 2020

ISBN9780128236338

Related to Data Deduplication Approaches

Related ebooks

Skip carousel

Private Cloud Computing: Consolidation, Virtualization, and Service-Oriented Infrastructure
Ebook
Private Cloud Computing: Consolidation, Virtualization, and Service-Oriented Infrastructure
byStephen R Smoot
Rating: 0 out of 5 stars
0 ratings
Systems Programming: Designing and Developing Distributed Applications
Ebook
Systems Programming: Designing and Developing Distributed Applications
byRichard Anthony
Rating: 0 out of 5 stars
0 ratings
Information Storage and Management: Storing, Managing, and Protecting Digital Information
Ebook
Information Storage and Management: Storing, Managing, and Protecting Digital Information
byEMC Education Services
Rating: 0 out of 5 stars
0 ratings
Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center
Ebook
Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center
byCaesar Wu
Rating: 0 out of 5 stars
0 ratings
Software Architecture for Big Data and the Cloud
Ebook
Software Architecture for Big Data and the Cloud
byIvan Mistrik
Rating: 0 out of 5 stars
0 ratings
Data Center Power Costs A Complete Guide - 2020 Edition
Ebook
Data Center Power Costs A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
DataOps A Complete Guide - 2020 Edition
Ebook
DataOps A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Center Infrastructure Providers A Complete Guide - 2019 Edition
Ebook
Data Center Infrastructure Providers A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Hyperscale Data Centers A Clear and Concise Reference
Ebook
Hyperscale Data Centers A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Enterprise Data Warehouse Third Edition
Ebook
Enterprise Data Warehouse Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Micro Data Centers A Complete Guide - 2019 Edition
Ebook
Micro Data Centers A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Application Architecture Standard Requirements
Ebook
Application Architecture Standard Requirements
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Edge Data Center A Complete Guide - 2020 Edition
Ebook
Edge Data Center A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Center Hardware A Complete Guide - 2019 Edition
Ebook
Data Center Hardware A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
CCIE Data Center Third Edition
Ebook
CCIE Data Center Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Microsoft SharePoint 2013 Disaster Recovery Guide
Ebook
Microsoft SharePoint 2013 Disaster Recovery Guide
byPeter Ward
Rating: 0 out of 5 stars
0 ratings
Azure SQL Data Warehouse A Complete Guide - 2020 Edition
Ebook
Azure SQL Data Warehouse A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Center Power And Cooling Techniques A Complete Guide - 2020 Edition
Ebook
Data Center Power And Cooling Techniques A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Microsoft Certified Database Administrator A Complete Guide - 2020 Edition
Ebook
Microsoft Certified Database Administrator A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Citrix XenApp® 7.5 Desktop Virtualization Solutions
Ebook
Citrix XenApp® 7.5 Desktop Virtualization Solutions
byAndy Paul
Rating: 0 out of 5 stars
0 ratings
Data model (ArcGIS) The Ultimate Step-By-Step Guide
Ebook
Data model (ArcGIS) The Ultimate Step-By-Step Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Microsoft Exchange Server 2013 High Availability
Ebook
Microsoft Exchange Server 2013 High Availability
byNuno Mota
Rating: 0 out of 5 stars
0 ratings
Instant SQL Server Analysis Services 2012 Cube Security
Ebook
Instant SQL Server Analysis Services 2012 Cube Security
bySatya SK Jayanty
Rating: 0 out of 5 stars
0 ratings
Hadoop in Practice
Ebook
Hadoop in Practice
byAlex Holmes
Rating: 0 out of 5 stars
0 ratings
Oracle Warehouse Builder 11g: Getting Started
Ebook
Oracle Warehouse Builder 11g: Getting Started
byBob Griesemer
Rating: 0 out of 5 stars
0 ratings
Data Center Facilities Second Edition
Ebook
Data Center Facilities Second Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Storage Systems: Organization, Performance, Coding, Reliability, and Their Data Processing
Ebook
Storage Systems: Organization, Performance, Coding, Reliability, and Their Data Processing
byAlexander Thomasian
Rating: 0 out of 5 stars
0 ratings
Getting Started with XenDesktop® 7.x
Ebook
Getting Started with XenDesktop® 7.x
byEllrod Craig Thomas
Rating: 0 out of 5 stars
0 ratings
Azure Data Lake A Complete Guide - 2019 Edition
Ebook
Azure Data Lake A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Data Center Colocation A Clear and Concise Reference
Ebook
Data Center Colocation A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings

Science & Mathematics For You

Skip carousel

Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy
Ebook
Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy
byDaniel T. Willingham
Rating: 4 out of 5 stars
4/5
Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It
Ebook
Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It
byDr. Laurie Mintz
Rating: 4 out of 5 stars
4/5
Feeling Good: The New Mood Therapy
Ebook
Feeling Good: The New Mood Therapy
byDavid D. Burns, M.D.
Rating: 4 out of 5 stars
4/5
The Gulag Archipelago [Volume 1]: An Experiment in Literary Investigation
Ebook
The Gulag Archipelago [Volume 1]: An Experiment in Literary Investigation
byAleksandr I. Solzhenitsyn
Rating: 4 out of 5 stars
4/5
The Gulag Archipelago: The Authorized Abridgement
Ebook
The Gulag Archipelago: The Authorized Abridgement
byAleksandr I. Solzhenitsyn
Rating: 4 out of 5 stars
4/5
The Dorito Effect: The Surprising New Truth About Food and Flavor
Ebook
The Dorito Effect: The Surprising New Truth About Food and Flavor
byMark Schatzker
Rating: 4 out of 5 stars
4/5
Lies My Gov't Told Me: And the Better Future Coming
Ebook
Lies My Gov't Told Me: And the Better Future Coming
byRobert W. Malone
Rating: 4 out of 5 stars
4/5
A Letter to Liberals: Censorship and COVID: An Attack on Science and American Ideals
Ebook
A Letter to Liberals: Censorship and COVID: An Attack on Science and American Ideals
byRobert F. Kennedy, Jr.
Rating: 3 out of 5 stars
3/5
Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life
Ebook
Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life
byScott G. Halford
Rating: 4 out of 5 stars
4/5
The Big Book of Hacks: 264 Amazing DIY Tech Projects
Ebook
The Big Book of Hacks: 264 Amazing DIY Tech Projects
byDoug Cantor
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Oppenheimer: The Tragic Intellect
Ebook
Oppenheimer: The Tragic Intellect
byCharles Thorpe
Rating: 5 out of 5 stars
5/5
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
Ebook
The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos,
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
Memory Craft: Improve Your Memory with the Most Powerful Methods in History
Ebook
Memory Craft: Improve Your Memory with the Most Powerful Methods in History
byLynne Kelly
Rating: 3 out of 5 stars
3/5
How Emotions Are Made: The Secret Life of the Brain
Ebook
How Emotions Are Made: The Secret Life of the Brain
byLisa Feldman Barrett
Rating: 4 out of 5 stars
4/5
Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness
Ebook
Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness
byPeter Godfrey-Smith
Rating: 4 out of 5 stars
4/5
A Crack In Creation: Gene Editing and the Unthinkable Power to Control Evolution
Ebook
A Crack In Creation: Gene Editing and the Unthinkable Power to Control Evolution
byJennifer A. Doudna
Rating: 4 out of 5 stars
4/5
The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet
Ebook
The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet
byNina Teicholz
Rating: 4 out of 5 stars
4/5
Homo Deus: A Brief History of Tomorrow
Ebook
Homo Deus: A Brief History of Tomorrow
byYuval Noah Harari
Rating: 4 out of 5 stars
4/5
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
Ebook
Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career
byScott H. Young
Rating: 4 out of 5 stars
4/5
The Joy of Gay Sex: Fully revised and expanded third edition
Ebook
The Joy of Gay Sex: Fully revised and expanded third edition
byCharles Silverstein
Rating: 4 out of 5 stars
4/5
2084: Artificial Intelligence and the Future of Humanity
Ebook
2084: Artificial Intelligence and the Future of Humanity
byJohn C. Lennox
Rating: 4 out of 5 stars
4/5
Free Will
Ebook
Free Will
bySam Harris
Rating: 4 out of 5 stars
4/5
Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It
Ebook
Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It
byGina Kolata
Rating: 4 out of 5 stars
4/5
This is Your Brain on Depression: Creating Your Path To Getting Better
Ebook
This is Your Brain on Depression: Creating Your Path To Getting Better
byFaith G. Harper, PhD, LPC-S, ACS, ACN
Rating: 4 out of 5 stars
4/5
No-Drama Discipline: the bestselling parenting guide to nurturing your child's developing mind
Ebook
No-Drama Discipline: the bestselling parenting guide to nurturing your child's developing mind
byDaniel J. Siegel
Rating: 4 out of 5 stars
4/5
The Wisdom of Psychopaths: What Saints, Spies, and Serial Killers Can Teach Us About Success
Ebook
The Wisdom of Psychopaths: What Saints, Spies, and Serial Killers Can Teach Us About Success
byKevin Dutton
Rating: 4 out of 5 stars
4/5
The Trouble With Testosterone: And Other Essays On The Biology Of The Human Predi
Ebook
The Trouble With Testosterone: And Other Essays On The Biology Of The Human Predi
byRobert M. Sapolsky
Rating: 4 out of 5 stars
4/5
On Food and Cooking: The Science and Lore of the Kitchen
Ebook
On Food and Cooking: The Science and Lore of the Kitchen
byHarold McGee
Rating: 5 out of 5 stars
5/5
The Hungry Brain: Outsmarting the Instincts That Make Us Overeat
Ebook
The Hungry Brain: Outsmarting the Instincts That Make Us Overeat
byStephan J. Guyenet, Ph.D.
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
Podcast episode
Building Real Time Applications On Streaming Data With Eventador - Episode 129: An interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to simplify building real time applications
byData Engineering Podcast
0 ratings
0% found this document useful
S3: 15 Years and 100 Trillion Objects Later with Kevin Miller: Kevin Miller is currently the global General Manager for Amazon Simple Storage Service (S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. Prior to this role, Kevin has had multiple leade
Podcast episode
S3: 15 Years and 100 Trillion Objects Later with Kevin Miller: Kevin Miller is currently the global General Manager for Amazon Simple Storage Service (S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. Prior to this role, Kevin has had multiple leade
byScreaming in the Cloud
0 ratings
0% found this document useful
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
Podcast episode
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
CockroachDB In Depth with Peter Mattis - Episode 35
Podcast episode
CockroachDB In Depth with Peter Mattis - Episode 35
byData Engineering Podcast
0 ratings
0% found this document useful
The AWS Evangelist with Jon Myer: Jon Myer is a partner solutions architect for cloud management tools at AWS. Prior to joining AWS, Jon worked as a senior cloud solutions architect at NetEnrich AWS, an AWS consultant for DevOps and Solutions at MetroStar Systems, and an AWS course author
Podcast episode
The AWS Evangelist with Jon Myer: Jon Myer is a partner solutions architect for cloud management tools at AWS. Prior to joining AWS, Jon worked as a senior cloud solutions architect at NetEnrich AWS, an AWS consultant for DevOps and Solutions at MetroStar Systems, and an AWS course author
byScreaming in the Cloud
0 ratings
0% found this document useful
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
Podcast episode
Dapr Distributed Application Runtime with Azure CTO Mark Russinovich: Dapr is a an event-driven, portable runtime for building microservices on cloud and edge. In this episode Scott talks to Azure CTO Mark Russinovich about what this means and why you should care? What are the responsibilities of a microservice, and what should YOU worry about and what a responsibilities better delegated to an open source project like Dapr?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Build Your Own Data Pipeline - Andreas Kretz
Podcast episode
Build Your Own Data Pipeline - Andreas Kretz
byDataTalks.Club
0 ratings
0% found this document useful
Conquering the Last Mile in Data - Caitlin Moorman
Podcast episode
Conquering the Last Mile in Data - Caitlin Moorman
byDataTalks.Club
0 ratings
0% found this document useful
Episode 143: Smart Technologies and Cloud Computing for a Greener Future: In this episode of Cloud Talk, Jeff DeVerter and Srinivas Koushik discuss sustainability in the context of IT. They explore how IT affects and is affected by environmental and social issues, and how to make IT more sustainable with smart technologies and cloud computing. They also share some best practices and solutions for implementing a sustainable IT strategy.
Podcast episode
Episode 143: Smart Technologies and Cloud Computing for a Greener Future: In this episode of Cloud Talk, Jeff DeVerter and Srinivas Koushik discuss sustainability in the context of IT. They explore how IT affects and is affected by environmental and social issues, and how to make IT more sustainable with smart technologies and cloud computing. They also share some best practices and solutions for implementing a sustainable IT strategy.
byCloud Talk
0 ratings
0% found this document useful
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
Podcast episode
Privacy-aware Data Pipelines with Skyflow’s Piper Keyes: A data analytics pipeline is important to modern businesses because it allows them to extract valuable insights from the large amounts of data they generate and collect on a daily basis. This leads to better decision making, improved efficiency, and ...
byPartially Redacted: Data, AI, Security, and Privacy
0 ratings
0% found this document useful
[Bite] Documenting Data Science Projects
Podcast episode
[Bite] Documenting Data Science Projects
byDataCafé
0 ratings
0% found this document useful
Database Monitoring & Observability
Podcast episode
Database Monitoring & Observability
byThe Cloudcast
0 ratings
0% found this document useful
Defining Success: Metrics and KPIs - Adam Sroka
Podcast episode
Defining Success: Metrics and KPIs - Adam Sroka
byDataTalks.Club
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Analytics for a Better World - Parvathy Krishnan
Podcast episode
Analytics for a Better World - Parvathy Krishnan
byDataTalks.Club
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
Episode 15: Nagios was the Original Call of Duty: Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. Yet, people suffer from the psychological barrier of thinking they
Podcast episode
Episode 15: Nagios was the Original Call of Duty: Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. Yet, people suffer from the psychological barrier of thinking they
byScreaming in the Cloud
0 ratings
0% found this document useful
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
Podcast episode
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
byScreaming in the Cloud
0 ratings
0% found this document useful
DEEP DIVE: The Lowdown on Decentalized Data Storage in Web3 | #153
Podcast episode
DEEP DIVE: The Lowdown on Decentalized Data Storage in Web3 | #153
byThe Milk Road Show
0 ratings
0% found this document useful
Just Fetch the Data and then... // David Bayliss // Coffee Sessions #110
Podcast episode
Just Fetch the Data and then... // David Bayliss // Coffee Sessions #110
byMLOps.community
0 ratings
0% found this document useful
#104 Data Visualization with Dr. Curran Kelleher: Today I'm joined by Dr. Curran Kelleher. He's a data visualization expert and has taught a number of in-depth data visualization courses on freeCodeCamp's YouTube channel. We talk about what it's like to get a Ph.D. under one of the pioneers of data...
Podcast episode
#104 Data Visualization with Dr. Curran Kelleher: Today I'm joined by Dr. Curran Kelleher. He's a data visualization expert and has taught a number of in-depth data visualization courses on freeCodeCamp's YouTube channel. We talk about what it's like to get a Ph.D. under one of the pioneers of data...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Episode 45: Everybody Needs Backup and Recovery: Do you have to deal with data protection? Do you usually mess it up? Some people think data protection architecture is broken and requires too many dependencies. By the time a business needs to backup a lot of data, it’s a complex problem to go back in ti
Podcast episode
Episode 45: Everybody Needs Backup and Recovery: Do you have to deal with data protection? Do you usually mess it up? Some people think data protection architecture is broken and requires too many dependencies. By the time a business needs to backup a lot of data, it’s a complex problem to go back in ti
byScreaming in the Cloud
0 ratings
0% found this document useful
Understanding Graph Database Patterns
Podcast episode
Understanding Graph Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
[Bite] Data Science and the Scientific Method
Podcast episode
[Bite] Data Science and the Scientific Method
byDataCafé
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Podcast episode
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
byData Engineering Podcast
0 ratings
0% found this document useful
A Practical Approach to Cloud-native Patterns
Podcast episode
A Practical Approach to Cloud-native Patterns
byThe Cloudcast
0 ratings
0% found this document useful

Skip carousel

KAFKA Build Utilities With The Kafka Server
Linux Format
Article
KAFKA Build Utilities With The Kafka Server
Jul 2, 2019
Nowadays, quite a few data architectures involve both a database and Apache Kafka, which is a distributed streaming platform and the subject of this tutorial. You can also find Kafka described as a publish-subscribe message system, which is a fancy w
7 min read
Living Life In The Fast Lane
MacLife
Article
Living Life In The Fast Lane
Oct 15, 2019
8 min read
Raspberry Pi 4 B
APC
Article
Raspberry Pi 4 B
Aug 12, 2019
5 min read
Federated Learning Uses The Data Right On Our Devices
Futurity
Article
Federated Learning Uses The Data Right On Our Devices
Jul 21, 2022
2 min read
One Tree To Rule Them All
Family Tree
Article
One Tree To Rule Them All
Apr 19, 2022
7 min read
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Futurity
Article
Team Encodes Digital ‘Hello’ Into Lab-made DNA
Mar 26, 2019
4 min read
Photogenealogy: Step 5 Your Photo Legacy
Family Tree UK
Article
Photogenealogy: Step 5 Your Photo Legacy
Nov 11, 2022
4 min read
Prototype Paves Way For ‘Computer-on-a-chip’
Futurity
Article
Prototype Paves Way For ‘Computer-on-a-chip’
Feb 22, 2019
2 min read
Hybrid Backup For Business
PC Pro Magazine
Article
Hybrid Backup For Business
Apr 8, 2021
4 min read
Where is Streaming Data Stored Temporarily? The Role of Storage in Streaming Media
Techfastly
Article
Where is Streaming Data Stored Temporarily? The Role of Storage in Streaming Media
May 1, 2022
4 min read
Out On A Limb
Family Tree
Article
Out On A Limb
Aug 29, 2023
Where do you build your family tree? It’s a question almost as old as the “genealogy internet” itself. Researchers today have several options: online trees hosted at big genealogy websites, desktop genealogy software programs, and (of course) classic
1 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
DNA Data Storage Takes A Leap Forward With ‘DORIS’
Futurity
Article
DNA Data Storage Takes A Leap Forward With ‘DORIS’
Jun 14, 2020
3 min read
Naga Chandrasekaran
HWM Singapore
Article
Naga Chandrasekaran
Dec 6, 2022
Micron’s 232-layer NAND technology provided the high-performance storage necessary to support advanced solutions and real-time services required in data centre and automotive applications, thanks to benefits like longer battery life, better performan
3 min read
Building PCs
Linux Format
Article
Building PCs
Apr 7, 2020
2 min read
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Family Tree UK
Article
GENEALOGY GADGETS & APPS FOR ALL OCCASIONS!
Aug 12, 2022
2 min read
Facilities Systems
Facility Management
Article
Facilities Systems
Oct 21, 2018
5 min read
Data Driven Trails
Australian Mountain Bike
Article
Data Driven Trails
Mar 20, 2022
2 min read
Responsible Recycling: Where Now For FM?
Facility Management
Article
Responsible Recycling: Where Now For FM?
Apr 11, 2018
4 min read
Measuring Performance For Nature Recovery
Landscape Architecture Australia
Article
Measuring Performance For Nature Recovery
Jan 29, 2024
5 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Create Your Own Virtual Classroom
Linux Format
Article
Create Your Own Virtual Classroom
Mar 8, 2022
Credit: https://moodle.org David Rutland believes it’s impossible for a person to be either be overdressed or overeducated. People who know him agree that he is neither. Education, education, education. If you were around in 1997, you probably rememb
10 min read
Create Your Own Virtual Classroom
Linux Format
Article
Create Your Own Virtual Classroom
Mar 8, 2022
Credit: https://moodle.org David Rutland believes it’s impossible for a person to be either be overdressed or overeducated. People who know him agree that he is neither. Education, education, education. If you were around in 1997, you probably rememb
10 min read
From “Less Bad” To “More Good”: Keys To A Sustainable Practice
Architecture Australia
Article
From “Less Bad” To “More Good”: Keys To A Sustainable Practice
Jun 26, 2022
Sustainability is not a dirty word, nor is it a static state to be achieved. For practices struggling to thread sustainability through their work, Clare Parry suggests that sharing resources with other organizations and focusing on the high-impact ar
5 min read
Real World Computing
PC Pro Magazine
Article
Real World Computing
May 11, 2023
Migrating to Azure isn’t necessarily the toughest part of a successful cloud migration, explains our guest columnist Many organisations succeed at deploying resources in or migrating to Microsoft Azure. But many of those same organisations fail to en
6 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Cloud Computing in Health Care Industry
Techfastly
Article
Cloud Computing in Health Care Industry
Apr 1, 2021
The vast impact of digital transformation in the health care industry makes its future firm. What’s interesting in cloud computing with healthcare? Cloud computing changes the traditional way of dealing with data. Big data analytics with cloud comput
4 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
Building Trends, Building Momentum
Facility Management
Article
Building Trends, Building Momentum
Oct 14, 2019
3 min read
Choose The Tech Kit And Services That Are Best For The Planet
PC Pro Magazine
Article
Choose The Tech Kit And Services That Are Best For The Planet
Apr 6, 2023
6 min read

Related categories

Skip carousel

Reviews for Data Deduplication Approaches

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Data Deduplication Approaches - Tin Thein Thwel

topic.

1 Introduction to data deduplication approaches

G.R. Sinha¹, Tin Thein Thwel¹, Samrudhi Mohdiwale² and Divya Prakash Shrivastava³, ¹1Myanmar Institute of Information Technology (MIIT), Mandalay, Myanmar, ²2National Institute of Technology Raipur, Raipur, India, ³3Higher Colleges of Technology, Abu Dhabi, United Arab Emirates

Abstract

In the field of computing, data deduplication methods are used to reduce duplicate copies and repeated data. The data storage and its utilization are improved by deduplication in addition to better network transfer. Due to reduced data storage requirement, the data transfer also becomes fats and efficient. Unique data chunks are identified, and by using suitable approach the duplicate data is eliminated by replacing the redundant chunks. This chapter presents an overview of data deduplication methods, types, file chunking, metadata, and the applications of deduplication process.

Keywords

Data deduplication; file chunking; metadata; repeated data; redundancy

1.1 Introduction

With the advent of information and communication technologies (ICT), popular use of internet, and search engines, the data storage has become an important issue in the area of computing. Using internet, all files in almost all communications are shared or transmitted in the form of digital data that requires storage. The problem begins only when repetitive data is stored in the storage space. Two data files identical in nature but different by few bits or bytes, used for conveying same message or information, create duplicate data. This is referred to as data redundancy or duplicity, and the methods used to handle the data duplication issues are called data deduplication (Al-Kiswany, Subhraveti, Sarkar, & Ripeanu, 2011; Chun, Kim, Lee, & Kim, 2019XXX; Deduplication & Cs, 2017XXX; Dong & Wang, 2014XXX; Lin, Chang, Kuo, Chang, & Li, 2015; Meister et al., 2012XXX; Meyer & Bolosky, 2012; Microsoft, 2016; Ng, Wen, & Zhu, 2012; Romański, Heldt, Kilian, Lichota, & Dubnicki, 2011; Storer, Greenan, Long, & Miller, 2008XXX), also referred to as intelligent compression techniques or single instance (Microsoft, 2016; Rabotka & Mannan, 2016) storage systems. The method of reducing data duplication is referred to as single instance storage because only one instance among similar instances for repeated data is stored in the storage system. The method also acts as smart system for the compression of unnecessary and repeating data and keeping only one and thus called as intelligent system that uses artificial intelligence (AI) and machine learning methods for handing duplicate data. Data deduplication can be applied in numerous fields of computing in research and other applications, such computer vision, image processing based applications (Ahmad, Sinha, & Kashyap, 2014XXX; Barde, Zadgaonkar, & Sinha, 2014XXX; Bhonsle, Chandra, & Sinha, 2012XXX; Misal, 2012XXX; Siddhartha, SG, Charan, Abha, & Kavita, 2011XXX; Sinha, 2015XXX; Sinha & Sharma, 2020XXX; Tiwari, Kumar, Kumar, & Sinha, 2015XXX), natural language processing, face recognition, pattern matching, database systems, cloud computing, big data application, parallel computing (Al-Kiswany et al., 2011; Bhattacherjee, Narang, & Garg, 2011XXX; Fu, Jiang, Xiao, Tian, & Liu, 2011; Meister & Brinkmann, 2009; Sun, Shen, & Yong, 2013XXX), etc. Fig. 1–1 shows basic steps of data deduplication approach where we can see that the data values are repeated initially and using a suitable algorithm like hash function or similar, the values are assigned a particular value of function. The next step compares all values with respect to standard value or the function, and if same values are reported and then those values are ignored. This can be seen in the outcome of the last step that the values which were repeating initially have been ignored and replaced by single value of data. Fig. 1–1 depicts a typical way of how the steps of data deduplication work in any applications of data science (Budiarti, Sukaridhoto, Hariadi, & Purnomo, 2019XXX; Gracia, González, Robles, & Menasalvas, 2014XXX; Kaur et al., 2020XXX; Sorzano, Vargas, & Montano, 2014XXX) or database management system (DBMS).

Figure 1–1 Data deduplication basic approach.

1.2 Methods of data deduplication

For variety of data science and applications, different types of methods and algorithms are used in data deduplication approaches and strategies but two main types of methods are most popularly used in a large number of applications, namely the hash-based and content-aware methods. Based on the data storage and backup, the deduplication methods are also referred to as inline and postprocessing methods.

1.2.1 Hash-based and content-aware methods

In hash-based method, hashing (El-Shimi et al., 2019; Feldman, Bhat, Laborde, Yi, & Dechev, 2013; Ma et al., 2017; Oh et al., 2018; Puzio, Molva, Önen, & Loureiro, 2016) function or algorithm is popularly used which is further employed in the identification of chunks (Lkhagvasuren, So, Lee, Yoo, & Ko, 2013; Meister & Brinkmann, 2009; Meister, Kaiser, & Brinkmann, 2013XXX; Microsoft, 2016; Xu, 2016) of data. The chunks are fragments of small group of data which describe small similar groups that assist in ignoring the repeated data. Fig. 1–2 shows a typical diagram of hashing method of data deduplication. The different slices or chunks of data are assigned different values of hash function and after comparing a hash value with all other slices, the updated has values result. This process is repeated until the assignment of values converges to no change state. A hash value generally requires certain number of bits and whenever the next chunks of data search and find chunks of same hash values then the chunks are treated as duplication data and not stored in the process of data deduplication. If the hash value is a new and not found in the already existing values, then the hash value is stored and corresponding data chunk is considered and stored in the databases. In Fig. 1–2, we can see that new slice K is updated in the next level with new hash value. New hash value is used as a new index value and acts as an important indexing factor in the further comparison of subsequent data slices or chunks.

Figure 1–2 Assignment of hash values.

In content-aware method, data patterns are investigated and when same patterns are reported then obviously the patterns are treated as duplicate data. In this approach, file name mechanism is followed which means a suitable file name is assigned to different data patterns and upcoming patterns that are different are assigned new file names. In this method, whenever the match of files is detected, the bitwise comparison is performed and based on this comparison, similarity is assessed; which means that only the similar file names are not only the criterion but bitwise matching result also matters in the process of finding duplicate data. The knowledge of database becomes essential in this approach because the number of files may be in millions and therefore some sort of memory indexing is also needed to obtain an efficient management of files and their updating as a consequence to ignore duplicate data. This scheme of deduplication can be seen in Fig. 1–3 where the steps are self-explanatory.

Figure 1–3 File name concept of deduplication.

These methods are based on timing during which the deduplication is performed. If the operation is performed during the data being stored, then the approach is called as inline method; and postprocessing method is performed just after storing the data being processed in the deduplication operation. Inline method does not depend on the type of method whether it is hash-based or content-aware method, whereas the postprocessing method and its performance are based on the type of method being used for data deduplication.

1.3 Classic research and classification of methods

When a large amount of data is stored for a long period of time and some of the data are duplicate then the system has difficulty in handling the large volume of data and thus data deduplication mechanism or search algorithms are employed. There are some classical research approaches attempted by a number of authors in the current literature. We will discuss a summary of the research impact and then the classification of methods.

The most popular research on the approach is based on using chunks as variable-length data bundles. In this method, the data is segmented into small size data streams called as chunks, unique chunks are only retained in the system, and repeated data is avoided with help of some indexing mechanisms. Generally, this approach is referred to as file chunking method of data deduplication (El-Shimi et al., 2019; Meister & Brinkmann, 2009) where data input represents files and the outputs are small chunks or blocks of data. The size of chunk may be fixed or variable, and small size of chunk makes the deduplication more efficient due to the better compression applied using small size chunks. In this process the cost of deduplication and complexity increase due to increased indexing and metadata (El-Shimi et al., 2019; Meyer & Bolosky, 2012; Oh et al., 2018; Sun, Zhang, Lou, & Hou, 2018; Tarasov et al., 2014). The overhead also increases in case of smaller size chunks and this can be overcome by choosing larger chunks but the selection of larger chunks would not reduce the duplicate data efficiently. According to the length of chunk, deduplication methods are divided into fixed-length (Lin et al., 2015) and variable-length chunking (Noorafiza, n.d.XXX). Variable-length chunking method will have more flexibility, and the overhead issues will be managed whenever needed in the data streams. In variable-length category, MUltithreaded content-based file CHunking (MUCH) method (Ma et al., 2017) utilizes the concept of multicore architecture of the processor and thus operates in the same manner a multicore processor does. In other words, we can also state that the operation is performed in parallel manner with more efficient data deduplication as compared to single-threaded chunking approach. There are other methods of deduplication, also used in literature, namely local maximum chunking, semantic chunking, content defined chunking, B+ tree method (Chen et al., 2018XXX), etc.

1.3.1 Classification of methods

Data deduplication is sometime referred to as data compression (Feldman et al., 2013; Ontap, 2015XXX; Tian, Khan, Jiménez, & Loh, 2014XXX; Xia et al., 2016; Xu, 2016) which is used to reduce the data storage requirement by eliminating the duplicates or redundant data. The way the deduplication operates using a suitable indexing function is called as the intelligent or smart compression method of data. The compression method is implemented as lossy and lossless in some type of data especially images. Main benefits of data deduplication methods include potential resource saving and the reduction of potential cost of backup and energy. Data deduplication methods are broadly classified based on the following three important factors:

• Point of application

• Time of application

• Granularity

Point of processing means the place of processes. On this basis, the methods are divided as source-based and target-based data deduplication methods. Source-based method (Fu et al., 2011; Oh et al., 2018; Zhu et al., 2018XXX) is also known as client-based approach of data deduplication, in which a backup agent is introduced that ensures the removal of redundant data and forwarding only unique data for subsequent processing and operation. Utilization of bandwidth as well as storage is improved; however, the complexity increases due to back agent usage in the process. Fig. 1–4 shows a typical flow of source-based data deduplication approach. Here, duplicate-aware agent is actually the backup agent that decides the data to be either redundant or unique. The unique data is subjected to the server and the data is available at source.

Figure 1–4 Source-based method.

In the target-based method (Al-Kiswany et al., 2011; Hossain, Atrey, & El Saddik, 2011XXX; Pang, Jia, & Feng, 2014XXX), also known as site data deduplication method, the client is not aware of deduplication and a deduplication engine is actually embedded with hardware array, as network archival storage (NAS) or storage area network (SAN). Fig. 1–5 shows this mechanism of data deduplication.

Figure 1–5 Target-based method.

Based on time of application, the data deduplication methods are divided into inline and postprocessing. Fig. 1–6 shows inline (Aronovich et al., 2009XXX) data deduplication method. This method is applied at the arrival of data before it hits the disk storage. This method is very simple and faster as compared to postprocessing (Oh et al., 2018; Wang, Xiao, & Lee, 2015XXX) method due to no need of backup and recovery. Appropriate use of a suitable function like hash function will be required in this method to indicate the data to be actually stored in the disk.

Figure 1–6 Inline deduplication method.

The main advantages of the inline method include: less I/O operations to be performed since the deduplication is applied before the storage and redundancy removed; simple and faster. Fig. 1–7 depicts postprocessing data deduplication method. It can be seen that the duplication engine or algorithm is operating after the storage which means the deduplication is postprocessing type. While storing the data, the identification of uniqueness of data is performed and immediately after the storage the duplicates are removed.

Figure 1–7 Postprocessing deduplication method.

According to granularity which determines whether the deduplication is applied to the whole data or the files or the subfiles. File-level method is also known as single instance storage (SIS) (Rabotka & Mannan, 2016). In this method, file-level data is compared with the help of an indexed and using attributes of file. When the files are found different then those files are updated with the help of indexing mechanism, else ignored in case of similar files. In subfile-level method, all files are divided into smaller chunks or blocks and then these chunks are compared with respect to a subject chunk using a hash function. The similar chunks are removed and unique chunks are stored in the storage and embedded in the subfile.

1.4 File chunking and metadata

In the process of data deduplication using file chunking approach (El-Shimi et al., 2019; Meister & Brinkmann, 2009; Ni, Lin, & Jiang, 2019), the data is divided into small chunks and each chunk is compared with all the chunks available in the data. The chunks are described by summary information also called fingerprint, rather than the granularity (bit-level comparing). Therefore the fingerprints are used in matching the chunks with all remaining chunks present in the data to determine if the chunks are redundant or not. The unique file or data is stored in the storage using chunks, as can be seen in Fig. 1–8. We can see in Fig. 1–8 that file is divided into a number of chunks c1, c2, c3… which are further assigned a summary information called as fingerprint. The comparison of fingerprint helps to determine the file to be unique or redundant. The size of chunks may be fixed or variable and accordingly the file chunking methods are of two types, namely fixed-length and variable-length file chunking. In fixed-length chunking, file is divided into fixed-length of chunks using certain number of bits. This method is simple, fast, and computationally efficient also.

Figure 1–8 File chunking.

In variable-size chunking method, the length of chunks is different for all the chunks and this makes the method adaptive and flexible while executing the comparison and indexing processes in the deduplication method. There are a number of algorithms available in the literature for file chunking used in various applications of data deduplication (Kaczmarczyk, Barczynski, Kilian, & Dubnicki, 2012XXX; Lkhagvasuren et al., 2013; Ni et al., 2019; Romański et al., 2011). Chunk indexing is an important term in file chunking method which is used to indicate new or updated files, and its size and time taken decide the performance of chunking-based deduplication performance. Larger is the indexing and longer in the processing time and thus the performance of the storage system is affected consequently.

1.4.1 Metadata

In the data deduplication process, two different types of data are involved, namely the actual data and metadata (El-Shimi et al., 2019; Meyer & Bolosky, 2012; Oh et al., 2018; Sun et al., 2018; Tarasov et al., 2014). The files or documents that are actually to be stored in the system are considered as first category of data, whereas the data that corresponds to redundancies indicates the metadata. Metadata is used by the algorithm or software in the identification of redundant data present in the files and also to rehydrate data in case restoration of data is needed upon detection of useful data among redundant data. Metadata is also known as data about data or chunk metadata and this is very critical in determining the performance of the storage system. Metadata uses certain data structure called as inodes that indicate blocks for updating the data and stores the file-specific information of the blocks associated with particular files of data. The inodes are chosen in such a manner that there is no loss of data else system performance with regards to potential data will be adversely affected. Therefore most appropriate hash function or an indexing algorithm like B+ algorithm is chosen so that the data structure does not create any data losses in the system.

1.5 Implementation strategies

In a typical data deduplication architecture, we generally have two important stages or phases: deduplication and reconstruction phases. Deduplication phase further includes following components:

• File chunker

• Chunk ID generator

• Duplicate finder

• Metadata

• Storage

In reconstruction phase, we have following components:

• Reconstructor

• Metadata

• Storage

As we can see in Fig. 1–8, the input files are chunked by file chunker and then each chunk is assigned a unique identity (ID) with the help of ID generator. The generation of ID will employ a suitable algorithm so that the ID can be allotted in systematic manner for performing efficient deduplication process. The indexing mechanism and ID generation help in finding the duplicates, and when duplicates are reported, the redundant data is associated with metadata else the data is stored in the storage. In the reconstruction phase, the data is thoroughly checked and ID is compared with suitable indexing algorithm and when a potential data is detected then it is associated with the storage else redundant data is output by the system.

The file formats that are subjected to the deduplication system may be portable document format (PDF), Microsoft document formats, Microsoft Excel document, Hypertext Markup Language file (HTML), etc. These files are converted into chunks using file chunker and subjected to further operation as shown in Fig. 1–9. Generally, the following steps are used:

• Input the file

• Convert into chunks using suitable file chunking algorithm

• Check if duplicate exists with the help of indexing and chunk ID

• IF this already exists, associate with metadata

• Else associate with the actual data

Figure 1–9 A typical deduplication architecture.

For file chunking, Two Threshold Two Divisors (TTTD) chunking method is the commonly used method, which uses variable length of chunks in a flexible manner so that the deduplication is implemented adaptively. The chunking process is shown in Fig. 1–10, in which two thresholds are selected along with main divisor and backup divisor. Based on threshold values, the chunks are divided employing appropriate breakup points and the process continues until all the files are compared and chunked. Hashing function is used in the identification of unique chunks in the system, and a typical chunk generation process is shown in Fig. 1–11, in which an algorithm called as secure hash algorithm (SHA1) (Ng et al., 2012; Puzio et al., 2016; Xia et al., 2016) of National Institute of Standards and Technology (NIST) is used. This algorithm is used to generate ID for new chunk and the resulting chunk is updated in final list of files.

Figure 1–10 Chunking process.

Figure 1–11 Chunk generation.

1.6 Performance evaluation and concluding remarks

The performance of data deduplication approaches is evaluated using few impotent metrics, such as deduplication ration and space reduction ratio. Deduplication ratio is defined as the ratio of number of bytes given as input and number of bytes produced after deduplication. This can be written

Enjoying the preview?

Page 1 of 1

Data Deduplication Approaches: Concepts, Strategies, and Challenges

About this ebook

Related to Data Deduplication Approaches

Related ebooks

Science & Mathematics For You

Related podcast episodes

Related articles

Related categories

Reviews for Data Deduplication Approaches

What did you think?

Book preview

Data Deduplication Approaches - Tin Thein Thwel

1

Introduction to data deduplication approaches

Abstract

Keywords

Data deduplication; file chunking; metadata; repeated data; redundancy

1.1 Introduction

1.2 Methods of data deduplication

1.2.1 Hash-based and content-aware methods

1.3 Classic research and classification of methods

1.3.1 Classification of methods

1.4 File chunking and metadata

1.4.1 Metadata

1.5 Implementation strategies

1.6 Performance evaluation and concluding remarks