How does deduplication work
Warning Unless instructed by authorized Microsoft Support Personnel, do not attempt to manually modify the chunk store. Is this page helpful?
Yes No. Any additional feedback? Skip Submit. Submit and view feedback for This product This page. View all page feedback. General purpose file server: Team shares Work Folders Folder redirection Software development shares. The Optimization job deduplicates by chunking data on a volume per the volume policy settings, optionally compressing those chunks, and storing chunks uniquely in the chunk store.
The optimization process that Data Deduplication uses is described in detail in How does Data Deduplication work? The Garbage Collection job reclaims disk space by removing unnecessary chunks that are no longer being referenced by files that have been recently modified or deleted.
The Integrity Scrubbing job identifies corruption in the chunk store due to disk failures or bad sectors. When possible, Data Deduplication can automatically use volume features such as mirror or parity on a Storage Spaces volume to reconstruct the corrupted data. Additionally, Data Deduplication keeps backup copies of popular chunks when they are referenced more than times in an area called the hotspot. The Unoptimization job, which is a special job that should only be run manually, undoes the optimization done by deduplication and disables Data Deduplication for that volume.
A chunk is a section of a file that has been selected by the Data Deduplication chunking algorithm as likely to occur in other, similar files. The chunk store is an organized series of container files in the System Volume Information folder that Data Deduplication uses to uniquely store chunks. Every file contains metadata that describes interesting properties about the file that are not related to the main content of the file. The file stream is the main content of the file. This is the part of the file that Data Deduplication optimizes.
The file system is the software and on-disk data structure that the operating system uses to store files on storage media.
A file system filter is a plugin that modifies the default behavior of the file system. To preserve access semantics, Data Deduplication uses a file system filter Dedup. A file is considered optimized or deduplicated by Data Deduplication if it has been chunked, and its unique chunks have been stored in the chunk store. Therefore, as additional data is written to the deduplicated volume, fingerprints are created for each new block and written to a change log file. For subsequent deduplication operations, the change log is sorted and merged with the fingerprint file, and the deduplication operation continues with fingerprint comparisons as previously described.
Deduplication removes data redundancies, as shown in the following illustration:. When deduplication runs for the first time on a volume with existing data, it scans all the blocks in the volume and creates a digital fingerprint for each of the blocks. Each of the fingerprints is compared to all the other fingerprints within the volume.
With hundreds of identical or close to identical desktop images, deduplication has the potential to significantly reduce the capacity needed to store all of those virtual machines. Deduplication works by creating a data fingerprint for each object that is written to the storage array.
As new data is written to the array, if there are matching fingerprints, additional data copies beyond the first are saved as tiny pointers. If a completely new data item is written — one that the array has not seen before — the full copy of the data is stored.
As you might expect, different vendors handle deduplication in different ways. In fact, there are two primary deduplication techniques that deserve discussion: Inline deduplication and post-process deduplication. Inline deduplication takes place at the moment that data is written to the storage device.
While the data is in transit, the deduplication engine fingerprints the data on the fly. As you might expect, this deduplication process does create some overhead. First, the system has to constantly fingerprint incoming data and then quickly identify whether or not that new fingerprint already matches something in the system.
If it does, a pointer to the existing fingerprint is written. If it does not, the block is saved as-is.
0コメント