First, some background reading. I have been managing hard drive storage for many years now. While my experiences come from years of maintaining data, I am not an IT Professional (though I am an engineer). Start off by reading this article on Storage Spaces War Stories blog, a blog about Microsoft Storage Spaces among other things. The article linked is not specific to Microsoft Storage Spaces, it applies to Network Attached Storage (NAS) in general, including NAS appliances, Unraid servers, etc., or any sort of Redundant Array of Inexpensive Disks (RAID) environment.
The idea in question is that of parity or RAID 5 to be generic. Microsoft Storage Spaces is not really RAID; it’s an implementation similar to RAID which offers similar features such as striping, parity and mirroring. There are other implementations such as Unraid that do the same. Although I would like to refer to RAID, it is only for the purpose of being generic and referring to the basic features thereof. Specifically, this article discusses parity.
Why am I questioning parity? The use of parity is commonly misperceived as a way to store data without needing a backup. That is the point of the article linked above: while RAID (or other implementations) offer this feature, it is commonly not used properly. To sum it up very plainly, parity is not a replacement for a data backup, especially if you are unfamiliar with data storage concepts and the environments offering them. Always, always have a backup of your data. Better yet, have more than one backup and keep these backups in different physical locations (such as a safety deposit box). Keep backups fresh on hardware that is not likely to fail while restoring it. Part of the point of the article linked is that your backup is not a backup anymore if you are restoring data from it. Until that data has redundancy again on physical hardware, you do not have a backup! What if your backup fails while retrieving data from the media?
So what’s the purpose of parity? Refer to the concept of striping for an example. When operating on large files where IO speeds might need to be high, striping allows you to store data on multiple drives in chunks that are smaller at the media level. When accessing this data, it is read from multiple sources at once, dividing the amount of physical data on the disk. The net effect of this is that large file copies, for example, can see speeds multiply by the amount of striping. The disadvantage of this is that now your data is stored across multiple physical drives, so if any one drive fails your data is gone, even if only one drive in the array fails. Parity allows for recreating data by storing a checksum value in addition to splitting data across multiple drives. For example, in a 3 column parity array (let’s keep this simple and assume we have 3 drives), now the data is split across two drives, but a third drive can store parity data. This means that by using 3 drives, we effectively lose storage space over the total space offered by all 3 drives, but if one drive fails we can recreate the data later. There are many reasons why Storage Spaces (and RAID) can fail at this point, but this offers you resiliency to your data since your array (of 3 drives) could see a failure yet recover its data. I won’t get into the reasons why your parity array can still fail. Those reasons are explained elsewhere such as on the SSWS blog.
So what’s the point of parity? Notably, it offers a LEVEL of data protection (i.e., resiliency), while also offering advantages in performance. It’s a combination of striping and mirroring. Without going into detail, mirroring (or RAID 1) is also not a replacement for data backups. One reason for this is because RAID setups (and especially MS Storage Spaces) typically require some sort of operating system to run at the software level. As stated, this makes malware a threat to your data since it could simply delete the data on the drive. It’s an “online” redundancy. RAID, in general, is about performance improvements. Understanding this, parity is a way to survive a drive failure, however there are many, many other things that can interfere with your data. While RAID can offer storage performance improvements, generally its complexity and implementation is much more likely to cause data loss than, say, data stored on a single drive. Backup drives generally don’t need great performance, at least for typical use, especially if they are offline backup drives.
I hope you see where I am going. All things explained, I’m beginning to lose the apparent “helpfulness” of the resiliency offered by parity because of all of its drawbacks. It is a tool, like any other, that when used properly can help in specific situations under specific conditions. It is not a replacement for data redundancy across multiple offline drives at (ideally) separate physical locations.
What I’d like to know is, what ARE those specific situations and circumstances, however I don’t want to find out. I have multiple offline backups of my data, and I think I will do away with Microsoft Storage Spaces on my system, or perhaps switch to mirroring. The apparent gains of parity just aren’t worth it to me anymore.