LVM is a very cool thing that I feel is a bit underappreciated.
I've always built my systems like this:
two hard disks => RAID1 => LVM VG1. nvme => LVM VG2
Because at some point I decided LVM is just like flexible partitions and didn't pay much attention to the huge amounts of other stuff in the manpages.
Turns out that's a pretty crappy way of doing things.
LVM can do RAID, can deal with moving things across devices, and has many other fancy features. So instead of the above, stick it all into the same VG without RAID on top.
You can RAID individual LVs. Some RAID1, some RAID5, some RAID0, some nothing, and pick on which disks stuff goes. You can add integrity checking which is not something MD RAID does. You can do caching. You can do writable snapshots. You can do thin provisioning. It's extremely flexible tech.
The only thing I'd recommend staying away from is thin provisioning unless it's necessary. The performance impact is significant, virt-manager for some reason doesn't see thin LVs as legitimate disks (can be worked around easily though) and thin provisioning seems to be able to develop faults of its own. I've had one old setup where one block was somehow not readable and it seems to be a thin provisioning thing -- it doesn't say the underlying disk is bad, it's something more along the lines of the metadata being broken.
To upgrade you can do things like moving LVs from one disk to another to free it up and remove it. Disks of different sizes are much less of a problem than with standard RAID, too.
~20 years ago I was helping someone with their storage server for their mail server, it had 8x 10K discs in a RAID-1. It was always struggling to keep up and often couldn't keep up. Intel had just come out with their SSDs, and I had a friend with access to one that was larger in size than the array. I shipped this little laptop drive off to get installed in the storage server.
Once the SSD was on the server, I did a "pvmove" to move the data from spinning array to SSD, while the system was up and running. Over the next several hours the load on the server dropped. It was comical that these 8 hard drives could be replaced by something smaller than my wallet.
But pvmove was the other star of this show.
If those drives were short-stroked, they probably could have kept up with that SSD, though at a reduced capacity. SSD would probably have a lower power bill over its life, though. I did some calculations for an array of short-stroked 15k SAS disks to replace a consumer 4GB SSD for a write-intensive app that chews through SSDs, and its performance would be in spitting distance of the SSD. Ended up not doing it due to likely not having any parts availability for 15k SAS drives in the not too distant future.
Except that we would have needed a lot more than the 8 drives then, to keep the same capacity. I think it was 1TB of storage around 1999. For mail storage it unfortunately needed both capacity and seek latency.
More drives would be acceptable if it meant not having to replace all the storage every few months.
AFAIK LVM RAID is mdraid under the covers, I guess the advantage of using LVM to configure and manage RAID is doing it all with one set of tools, assuming that the LVM utilities offer everything you need. I've always used mdraid for that, and (sometimes) LVM on top of the mdraid devices.
Oh so it doesn't avoid bit rot the way ZFS does (by checksumming)? Mdadm, not being a filesystem, doesn't concern itself with file integrity. I don't suppose LVM would either :(
unlike mdadm, LVM does have checksumming.
https://docs.redhat.com/fr/documentation/red_hat_enterprise_...
Wow, didn't know about that. Though it seems to require additional metadata, not just the parity data for the RAID level. From the Ubuntu 22 lvmraid(7) man page:
When creating a RAID LV with integrity, or adding integrity, space is required for integrity metadata. Every 500MB of LV data requires an additional 4MB to be allocated for integrity metadata, for each RAID image.
Also:
The following are not yet permitted on RAID LVs with integrity: lvreduce, pvmove, snapshots, splitmirror, raid syncaction commands, raid rebuild.
The typical workaround for these seems to be remove the integrity, make the change, then add/reinitialize the integrity metadata.
> Wow, didn't know about that. Though it seems to require additional metadata, not just the parity data for the RAID level.
You don't have parity data on RAID1, unless you've got a fancy enterprise setup with irregular size blocks. Most hobbyists don't, and it's probably not even possible on most NVMes.
I think this is most helpful on RAID1, where you can have cases where there's a discrepancy but both disks can read the sector, so you have no idea which one is right.
So, this kind of integrity is not available on normal lvs?
It doesn't work on non-RAID LVs, no
I think that's right. Even with RAID 4/5 I think the parity is used to reconstuct missing data if a device fails, not to verify data integrity or detect bitrot.
Checksumming doesn't prevent 'bit rot', it can only detect it, which if your detecting it with modern hardware, its likely because your not using ECC ram somewhere.
Every modern harddrive, and most if not all nvme/sata SSDs have built in error correction as part of the encode/decode of your data to the media/flash, Combined with link layer data integrity protection/etc the most likely place for data corruption is low end intel machines without ECC ram, or really old arm designs that don't have cache/interconnect protections, which don't have ECC ram either.
So, the drive usually has far better error correction and detection than your getting with these software algorithms and running the mdadm, scrubbing is more than sufficient to detect basically 100% of 'bitrot'.
There is a ton of false information all over the internet about RAID1 vs RAID5/6, and this article is in the same boat WRT why one would prefer RAID1 vs one of those. (Clue, it has absolutely nothing to do with data integrity).
Pretty much everyone running a home NAS is going to be better off with just a RAID5 + regular scrubbing, vs all this other nonsense. For people who want something a bit better RAID6+scrubbing and then a DIX/DIF enabled path. I think your more likely to hit a critical ZFS bug, than have a problem with a well maintained raid5 setup running on reliable hardware. Think ECC + working AER/MCE/etc RAS reporting. Nevermind that pretty much none of these applications close the loop on their own data manipulation and that fancy new database your running overwriting the wrong record won't be saved by anything other than a good snapshot/backup mechanism.
LVM isn't underrated, it's obsolete. ZFS in particular has none of the many, maaany problems you run into trying to make LVM work reliably in practice (no self-corrupting snapshots, e.g.).
ZFS is great. Ensuring the next kernel update is compatible with the module is not.
Honest question: what’s wrong with DKMS?
> Honest question: what’s wrong with DKMS?
Nothing is wrong with DKMS, but that's not enough. The problem is that it will often fail to build ZFS for updated kernels because the kernel changed something it uses. Sometimes that can happen within a release.
I don't mind it for a lot of things, but I'm not sure how great an experience it is for something that needs to be installed as part of the initial setup of the system rather than something you can add afterwards. It works fine for stuff like a wifi or graphics driver if you're doing the initial installation without either of those, but for stuff like the filesystem that kind of has to be done up-front, it can lead to a weird bootstrapping problem where you need to have a way of getting the package before you've set up the system, but it's generally not available through the normal means that the installation media provides. You can go out of your way to obtain it elsewhere it, but that's a pretty noticeable downgrade in the experience of setting up a Linux system from what most people are used to, so I think it's understandable that the completely artificial barrier to having it available through conventional means would be frustrating for people.
Some distributions release ZFS kernel modules pinned for their kernel builds, so there's no bootstrapping problem in the first place.
And the biggest issue really is that none of that would be necessary in the first place, if only btrfs or bcachefs were actually reliable.
Nothing, other than FUD.
LTS kernels solve that issue, and many, many others.
That just delays the problem until the next version is LTS, and isn't a guarantee. I've had module build failures within releases before, and couldn't access my array. It's a very bad situation to be in, especially if your repo/cache has already removed the previous kernel and module.
When I care, I run ZFS.
When I run ZFS, I run BSD (or illumos).
ZFS needs to be merged into the Linux kernel already. I can't believe the licensing nonsense is still preventing that.
I never want to be on the wrong side of Oracle's lawyers. Larry Ellison is the closest thing we have to a real Bond villain.
The copyright holders have no interest in changing the licenses.
But there could be, with enough funding, a ZFS team which tracks Linux development more closely .
My personal experience is the opposite, but to be fair I gave up on ZFS years ago, it may be (probably is) better now.
I appreciate these points of view, but it is of course not optimal when you see people simply making polar opposite statements with neither making an attempt at explanation.
LVM was a decent idea at the time, which was the early 1980s. But that locked in many architectural decisions that make it incredibly painful to use in practise, and its devs follow the "it was hard to write, so it should be hard to use" mentality, which make it even more painful.
Self-corrupting snapshots are considered a skill issue on the user's part, e.g., while ZFS simply doesn't allow users to shoot themselves in the foot like that. (And you rarely even need block-level snapshots in the first place!)
Encryption, data integrity, redundancy, are all out scope for LVM, so now you need a RAID layer and a verification layer and an encryption layer and if you stack them wrong, everything breaks. Skill issue! And not LVM's problem in the first place.
ZFS doesn't make you jump through hoops to manually calculate sector offsets to figure out how you can make an SSD cache that doesn't blow up in your face either.
So, no, LVM isn't underrated. It's a painful relic of a bygone age andif anything, overrated.
I see you were downvoted. Interestingly, on HN downvotes are considered as appropriate as explanations, from what I've read. Sure, you should be allowed to downvote, and can't be required to explain, but in a thread about lack of explanation, downvoting without explanation is hard to understand.
> Because at some point I decided LVM is just like flexible partitions and didn't pay much attention to the huge amounts of other stuff in the manpages.
I remember when I was first learning to use Linux, I was mystified by the convention of LVM on top of md. I was reading the LVM man pages and it was clear I could just do all that stuff with LVM without having to involve another abstraction layer.
But I had a hardware raid controller and just used that instead.
It depends on when you started, LVM changed a fair amount over time and the earlier versions had less functionality.
So a lot of people would do like I did -- make an initial decision and then stick with it for two decades, way past of it making any sense. That kind of thing also happens with documentation.
It's a bit pain in three neck for me to rebuild my NAS using LVM or ZFS (to handle RAID in place of mdadm) but if I was starting over, I'd use either of them to avoid bit rot.
As it is though, I'm happy with RAID 10 on SATA HDDs a and RAID 1 on NVME SDDs, using bcache to create a single volume using both, and then ext4 on top. With sufficient SMART monitoring and backups, it'll do.