Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

quantum5.ca

・

221 points

・

todsacerdoti

・

6 days ago

91 comments

dale_glass ・ 5 days ago

LVM is a very cool thing that I feel is a bit underappreciated.

I've always built my systems like this:

two hard disks => RAID1 => LVM VG1. nvme => LVM VG2

Because at some point I decided LVM is just like flexible partitions and didn't pay much attention to the huge amounts of other stuff in the manpages.

Turns out that's a pretty crappy way of doing things.

LVM can do RAID, can deal with moving things across devices, and has many other fancy features. So instead of the above, stick it all into the same VG without RAID on top.

You can RAID individual LVs. Some RAID1, some RAID5, some RAID0, some nothing, and pick on which disks stuff goes. You can add integrity checking which is not something MD RAID does. You can do caching. You can do writable snapshots. You can do thin provisioning. It's extremely flexible tech.

The only thing I'd recommend staying away from is thin provisioning unless it's necessary. The performance impact is significant, virt-manager for some reason doesn't see thin LVs as legitimate disks (can be worked around easily though) and thin provisioning seems to be able to develop faults of its own. I've had one old setup where one block was somehow not readable and it seems to be a thin provisioning thing -- it doesn't say the underlying disk is bad, it's something more along the lines of the metadata being broken.

To upgrade you can do things like moving LVs from one disk to another to free it up and remove it. Disks of different sizes are much less of a problem than with standard RAID, too.

linsomniac ・ 5 days ago

~20 years ago I was helping someone with their storage server for their mail server, it had 8x 10K discs in a RAID-1. It was always struggling to keep up and often couldn't keep up. Intel had just come out with their SSDs, and I had a friend with access to one that was larger in size than the array. I shipped this little laptop drive off to get installed in the storage server.
Once the SSD was on the server, I did a "pvmove" to move the data from spinning array to SSD, while the system was up and running. Over the next several hours the load on the server dropped. It was comical that these 8 hard drives could be replaced by something smaller than my wallet.
But pvmove was the other star of this show.
- MrDrMcCoy ・ 5 days ago
  
  If those drives were short-stroked, they probably could have kept up with that SSD, though at a reduced capacity. SSD would probably have a lower power bill over its life, though. I did some calculations for an array of short-stroked 15k SAS disks to replace a consumer 4GB SSD for a write-intensive app that chews through SSDs, and its performance would be in spitting distance of the SSD. Ended up not doing it due to likely not having any parts availability for 15k SAS drives in the not too distant future.
  
  linsomniac ・ 5 days ago
  ・ 2 more
  
  Except that we would have needed a lot more than the 8 drives then, to keep the same capacity. I think it was 1TB of storage around 1999. For mail storage it unfortunately needed both capacity and seek latency.
  
  MrDrMcCoy ・ a day ago
  
  More drives would be acceptable if it meant not having to replace all the storage every few months.
SoftTalker ・ 5 days ago

AFAIK LVM RAID is mdraid under the covers, I guess the advantage of using LVM to configure and manage RAID is doing it all with one set of tools, assuming that the LVM utilities offer everything you need. I've always used mdraid for that, and (sometimes) LVM on top of the mdraid devices.
- raffraffraff ・ 5 days ago
  
  Oh so it doesn't avoid bit rot the way ZFS does (by checksumming)? Mdadm, not being a filesystem, doesn't concern itself with file integrity. I don't suppose LVM would either :(
  
  dale_glass ・ 5 days ago
  ・ 5 more
  
  unlike mdadm, LVM does have checksumming.
  https://docs.redhat.com/fr/documentation/red_hat_enterprise_...
  
  SoftTalker ・ 5 days ago
  ・ 2 more
  
  Wow, didn't know about that. Though it seems to require additional metadata, not just the parity data for the RAID level. From the Ubuntu 22 lvmraid(7) man page:
  When creating a RAID LV with integrity, or adding integrity, space is required for integrity metadata. Every 500MB of LV data requires an additional 4MB to be allocated for integrity metadata, for each RAID image.
  Also:
  The following are not yet permitted on RAID LVs with integrity: lvreduce, pvmove, snapshots, splitmirror, raid syncaction commands, raid rebuild.
  The typical workaround for these seems to be remove the integrity, make the change, then add/reinitialize the integrity metadata.
  
  dale_glass ・ 5 days ago
  
  > Wow, didn't know about that. Though it seems to require additional metadata, not just the parity data for the RAID level.
  You don't have parity data on RAID1, unless you've got a fancy enterprise setup with irregular size blocks. Most hobbyists don't, and it's probably not even possible on most NVMes.
  I think this is most helpful on RAID1, where you can have cases where there's a discrepancy but both disks can read the sector, so you have no idea which one is right.
  
  brnt ・ 5 days ago
  ・ 2 more
  
  So, this kind of integrity is not available on normal lvs?
  
  dale_glass ・ 5 days ago
  
  It doesn't work on non-RAID LVs, no
  
  SoftTalker ・ 5 days ago
  
  I think that's right. Even with RAID 4/5 I think the parity is used to reconstuct missing data if a device fails, not to verify data integrity or detect bitrot.
  
  StillBored ・ 4 days ago
  
  Checksumming doesn't prevent 'bit rot', it can only detect it, which if your detecting it with modern hardware, its likely because your not using ECC ram somewhere.
  Every modern harddrive, and most if not all nvme/sata SSDs have built in error correction as part of the encode/decode of your data to the media/flash, Combined with link layer data integrity protection/etc the most likely place for data corruption is low end intel machines without ECC ram, or really old arm designs that don't have cache/interconnect protections, which don't have ECC ram either.
  So, the drive usually has far better error correction and detection than your getting with these software algorithms and running the mdadm, scrubbing is more than sufficient to detect basically 100% of 'bitrot'.
  There is a ton of false information all over the internet about RAID1 vs RAID5/6, and this article is in the same boat WRT why one would prefer RAID1 vs one of those. (Clue, it has absolutely nothing to do with data integrity).
  Pretty much everyone running a home NAS is going to be better off with just a RAID5 + regular scrubbing, vs all this other nonsense. For people who want something a bit better RAID6+scrubbing and then a DIX/DIF enabled path. I think your more likely to hit a critical ZFS bug, than have a problem with a well maintained raid5 setup running on reliable hardware. Think ECC + working AER/MCE/etc RAS reporting. Nevermind that pretty much none of these applications close the loop on their own data manipulation and that fancy new database your running overwriting the wrong record won't be saved by anything other than a good snapshot/backup mechanism.
creshal ・ 5 days ago

LVM isn't underrated, it's obsolete. ZFS in particular has none of the many, maaany problems you run into trying to make LVM work reliably in practice (no self-corrupting snapshots, e.g.).
- MrDrMcCoy ・ 5 days ago
  
  ZFS is great. Ensuring the next kernel update is compatible with the module is not.
  
  jcgl ・ 5 days ago
  ・ 5 more
  
  Honest question: what’s wrong with DKMS?
  
  MrDrMcCoy ・ a day ago
  
  > Honest question: what’s wrong with DKMS?
  Nothing is wrong with DKMS, but that's not enough. The problem is that it will often fail to build ZFS for updated kernels because the kernel changed something it uses. Sometimes that can happen within a release.
  
  saghm ・ 5 days ago
  ・ 2 more
  
  I don't mind it for a lot of things, but I'm not sure how great an experience it is for something that needs to be installed as part of the initial setup of the system rather than something you can add afterwards. It works fine for stuff like a wifi or graphics driver if you're doing the initial installation without either of those, but for stuff like the filesystem that kind of has to be done up-front, it can lead to a weird bootstrapping problem where you need to have a way of getting the package before you've set up the system, but it's generally not available through the normal means that the installation media provides. You can go out of your way to obtain it elsewhere it, but that's a pretty noticeable downgrade in the experience of setting up a Linux system from what most people are used to, so I think it's understandable that the completely artificial barrier to having it available through conventional means would be frustrating for people.
  
  creshal ・ 5 days ago
  
  Some distributions release ZFS kernel modules pinned for their kernel builds, so there's no bootstrapping problem in the first place.
  And the biggest issue really is that none of that would be necessary in the first place, if only btrfs or bcachefs were actually reliable.
  
  creshal ・ 5 days ago
  
  Nothing, other than FUD.
  
  creshal ・ 5 days ago
  ・ 2 more
  
  LTS kernels solve that issue, and many, many others.
  
  MrDrMcCoy ・ a day ago
  
  That just delays the problem until the next version is LTS, and isn't a guarantee. I've had module build failures within releases before, and couldn't access my array. It's a very bad situation to be in, especially if your repo/cache has already removed the previous kernel and module.
  
  quesera ・ 5 days ago
  
  When I care, I run ZFS.
  When I run ZFS, I run BSD (or illumos).
  
  matheusmoreira ・ 5 days ago
  ・ 3 more
  
  ZFS needs to be merged into the Linux kernel already. I can't believe the licensing nonsense is still preventing that.
  
  MrDrMcCoy ・ a day ago
  
  I never want to be on the wrong side of Oracle's lawyers. Larry Ellison is the closest thing we have to a real Bond villain.
  
  actionfromafar ・ 5 days ago
  
  The copyright holders have no interest in changing the licenses.
  But there could be, with enough funding, a ZFS team which tracks Linux development more closely .
- SoftTalker ・ 5 days ago
  
  My personal experience is the opposite, but to be fair I gave up on ZFS years ago, it may be (probably is) better now.
- QuantumGood ・ 5 days ago
  
  I appreciate these points of view, but it is of course not optimal when you see people simply making polar opposite statements with neither making an attempt at explanation.
  
  creshal ・ 5 days ago
  ・ 2 more
  
  LVM was a decent idea at the time, which was the early 1980s. But that locked in many architectural decisions that make it incredibly painful to use in practise, and its devs follow the "it was hard to write, so it should be hard to use" mentality, which make it even more painful.
  Self-corrupting snapshots are considered a skill issue on the user's part, e.g., while ZFS simply doesn't allow users to shoot themselves in the foot like that. (And you rarely even need block-level snapshots in the first place!)
  Encryption, data integrity, redundancy, are all out scope for LVM, so now you need a RAID layer and a verification layer and an encryption layer and if you stack them wrong, everything breaks. Skill issue! And not LVM's problem in the first place.
  ZFS doesn't make you jump through hoops to manually calculate sector offsets to figure out how you can make an SSD cache that doesn't blow up in your face either.
  So, no, LVM isn't underrated. It's a painful relic of a bygone age andif anything, overrated.
  
  QuantumGood ・ 4 days ago
  
  I see you were downvoted. Interestingly, on HN downvotes are considered as appropriate as explanations, from what I've read. Sure, you should be allowed to downvote, and can't be required to explain, but in a thread about lack of explanation, downvoting without explanation is hard to understand.
RulerOf ・ 5 days ago

> Because at some point I decided LVM is just like flexible partitions and didn't pay much attention to the huge amounts of other stuff in the manpages.
I remember when I was first learning to use Linux, I was mystified by the convention of LVM on top of md. I was reading the LVM man pages and it was clear I could just do all that stuff with LVM without having to involve another abstraction layer.
But I had a hardware raid controller and just used that instead.
- dale_glass ・ 5 days ago
  
  It depends on when you started, LVM changed a fair amount over time and the earlier versions had less functionality.
  So a lot of people would do like I did -- make an initial decision and then stick with it for two decades, way past of it making any sense. That kind of thing also happens with documentation.
raffraffraff ・ 5 days ago

It's a bit pain in three neck for me to rebuild my NAS using LVM or ZFS (to handle RAID in place of mdadm) but if I was starting over, I'd use either of them to avoid bit rot.
As it is though, I'm happy with RAID 10 on SATA HDDs a and RAID 1 on NVME SDDs, using bcache to create a single volume using both, and then ext4 on top. With sufficient SMART monitoring and backups, it'll do.

gopalv ・ 6 days ago

As always the YMMV of caching is access patterns, but the more consistent cacheable pattern has been the ext4 journals for me.

They are tiny and often hit with a huge number of IOPS.

Ext4 supported external journals and moving it to a single SSD for a large number of otherwise slow SMR disks has worked great in the past.

However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).

Where I was working that didn't matter as I was mostly working with HDFS which both likes a JBOD layout of several disks instead of RAID (no battery backed write caches), tolerant to a single node failing completely and having a ton more metadata operations thanks to writing a single large file as many fixed-size files named blk_<something> with a lot of directories containing thousands of files.

SSDs were expensive then, but it's been a decade of getting cheaper from that.

GauntletWizard ・ 6 days ago

The same for ZFS; there's provisioning to make a "zil" device - ZFS Intent Log, basically the journal. ZFS is a little nicer in that this journal is explicitly disposable - If you lose your ZIL device, you lose any writes since it's horizon, but you don't lose the whole array.
The next step up is building a "metadata" device, which stores the filesystem metadata but not data. This is dangerous in the way the ext4 journal is; lose the metadata, and you lose everything.
Both are massive speedups. When doing big writes, a bunch of spinning rust can't achieve full throughput without a SSD ZIL. My 8+2 array can write nearly two gigabits, but it's abysmal (roughly the speed of a single drive) without a ZIL.
Likewise, a metadata device can make the whole filesystem feel as snappy as SSD, but it's unnecessary if you have enough cache space; ZFS prefers it, so if your metadata fits into your cache SSD, most of it will stay loaded
- Szpadel ・ 6 days ago
  
  I just want to mention that ZIL is just to speed up sync writes, as it ends syscall when data are written to ZIL, but might be still in progress on slower storage.
  ZIL is also basically write only storage, therefore sad without very significant over provisioning will die quickly (you only read from ZIL after unclean shutdown)
  if you don't really case about latest version of file (risk of loosing recent chances is acceptable) you might set sync=disabled for that dataset and you can have great performance without ZIL
  
  magicalhippo ・ 5 days ago
  
  Minor nitpick, your post is primarily talking about SLOG, separate intent log.
  The pool always has a ZIL, but you can put it on a separate device, or decices, with SLOG[1].
  [1]: https://www.truenas.com/docs/references/zilandslog/
- JonChesterfield ・ 6 days ago
  
  There's a configuration option that amounts to putting a directory (or maybe a volume) entirely into the metadata drive.
  It's been a long time since I set that up, but the home storage has spinning rust plus a raid 1 of crucial ssd (sata! But ones with a capacitor to hopefully handle writes after power loss), where the directory I care about performance for lives on the ssd subarray. Still presents as one blob of storage. Metadata on the ssd too, probably no ZIL but could be wrong about that. Made ls a lot more reasonable.
  Thinking about it that system must be trundling towards expected death, it might be a decade old now.
trinsic2 ・ 6 days ago

This reminds me of the hybrid drives. When the NVM failed its was a nightmare to deal with. IMHO it's a bad idea from a stability perspective to be caching off drive to Non-volatile memory.
- wtallis ・ 6 days ago
  
  Your last sentence does not follow from the preceding one. Hybrid drives were doomed by having truly tiny caches, making them not particularly fast (you need a lot of flash chips in parallel to get high throughput), prone to cache thrashing, and easy to wear out the NAND flash. These days, even if you try, it's hard to build a caching system that bad. There just aren't SSDs small and slow enough to have such a crippling effect. Even using a single consumer SSD as a cache for a full shelf of hard drives wouldn't be as woefully unbalanced as the SSHDs that tried to get by with only 8GB of NAND.
Dylan16807 ・ 5 days ago

> However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).
In theory you can massively reduce this risk by keeping a copy of the journal in memory so it only corrupts if you have a disk loss and a power outage within a few seconds of each other. But I don't know if the tools available would let you do that properly.
hinkley ・ 5 days ago

Twin SSDs and RAID 1.

Szpadel ・ 6 days ago

something that people forget with raid1 is that this only protect from catastrophic disk failure.

this means your your drive need to be dead for raid to do it's protection and this is usually the case.

the problem is when starts corrupting data it reads of writes. in that case raid have no way to know that and can even corrupt data on the healthy drive. (data is read corrupted and then written to both drives)

the issue is that there are 2 copies of the data and raid have no way of telling with one is correct so it's basically flips a coin and select one of them, even if filesystem knows that content makes no sense.

that's basically biggest advantage of filesystems like zfs or btrfs that manage raid themselves, they have checksums and that know with copy is valid and are able to recover and say that one drive appears healthy but it's corrupting data so you probably want to replace it

iforgotpassword ・ 6 days ago

Made that experience once ca. 2011. I hosted a Minecraft server ona box with raid1.
The "cool" part was that I ran a cronjob that rendered the map to a png file once and hour, and at some point a friend asked why there were holes in the map. Back then, Minecraft stored every 16x16 chunk of the map in an individual gzipped file. When the raid1 decided to read the chunk from the bad drive, it couldn't unzip it. If that happened to the renderer, there was a hole on the map. If that happened to the game server, it would regenerate the chunk, and overwrite the old one on both drives, even the healthy one. Luckily as far a I remember that only happened on random terrain, otherwise someone would have ended up with half their house missing.
hinkley ・ 5 days ago

I was surprised when I found out my fancy RAID5 card was using DMA for all of its disk accesses instead of having its own memory. Just the dumbest design. I had a power issue fry the last memory stick in my machine so it would come up clean and then under load report disk corruption. The disk was fine. Memory was broken. I rebuilt that fucking array three times before I ran a memory analyzer on the box.
iam-TJ ・ 6 days ago

When using LVM one can use the dm-integrity target to detect data corruption.
- dinosaurdynasty ・ 5 days ago
  
  You can even use it without LVM, though it's still a pain to setup.
undefined ・ 6 days ago

[deleted]

sneak ・ 6 days ago

One thing I’ve been doing lately is storing data on SSDs or raid0s of SSDs, then running continuous backups of changed files to a slow HDD. I don’t have single-transaction data safety requirements and the window in which I can lose data is very small, given that the backups run continuously.

This is so much faster than these hyper-vigilant HA setups, too, while being quite a bit cheaper (the HDDs can be in a different box or building, or be Glacier, or whatever).

LTL_FTC ・ 5 days ago

By any chance, would you be willing to expand on your approach? What tools are using for this? How is it set up?
- sneak ・ 4 days ago
  
  make a zfs temp snapshot, cd to snapshot dir, rsync it somewhere (usually rsync over ssh), cd out of snapshot dir, delete snapshot, do it again. it’s fairly easy to script. i usually run it via runit in /etc/service/realtimesync or such.

KaiserPro ・ 5 days ago

I have a special and burning hatred for mdadm and to a lesser extent LVM.

I suspect that nowadays they are much easier to look after, but when I first was introduced, the quality of the documentation was utterly appalling. WE had special classes with old sysadmins who passed on the secret knowledge for how to not fuck everything with mdadm.

LVM is slightly better, apart from the early false promise that was LVM snapshots which were essentially just a placebo(this might have changed now).

I've only ever seen LVM used once successfully in production. That was on a 60 bad LFF raid draw to create a single namespace over 4 hardware raid-7 disk array things.

Every other time it turn out to be a clusterfuck.

ZFS if youre poor, GPFS if youre rich.

I'm sure there are newer systems but they almost always fail ever the resilience, latency or performance test. (Cough gluster, ceph, anything that uses HTTP as a transport system)

riedel ・ 6 days ago

Does someone know what the technology behind the tiering on QNAP NAS Systems is? I use an SSD RAID 1 in front of an RAID 10, which seems to work great.

IMHO flexible tiering rather than caching would be very nice for many Systems as it is rather difficult to teach users to separate rather stale data from changing data. Often does not have to be perfect.

rzzzt ・ 6 days ago

Bcachefs supports both caching and tiering: https://wiki.archlinux.org/title/Bcachefs#SSD_caching
A FUSE-based solution is autotier: https://github.com/45Drives/autotier
- riedel ・ 5 days ago
  
  Looked at Autotier before, but the development looked pretty stale (which is not too bad per se if it is stable). Is there any experience/ recommendations with putting bcachefs also on top of networked block storage such as CEPH ? As CEPH SSD caching seems pretty deprecated by default, at work we looked at a solution to marry our HDD and SSD pool for users that do want to put too much thoughts into tiering by mount points.
  
  magicalhippo ・ 5 days ago
  
  Will be interesting to play with.
  I used lvm cache to put ZFS on top of local NVMe for write-back caching of iSCSI targets, as ZFS has good built-in read caching.
  Worked pretty well in the limited tests I did, but it's not magic. Main reason I didn't pursue is that it felt a bit like a house of cards. Though on the positive side, one could always mount the underlying storage, ie partitions serving the iSCSI targets, as a local pool on another machine.
  
  IcePic ・ 5 days ago
  
  You could use bcachefs on the OSD drives, but you can also just point the WAL/DB of the OSD to a partition on the ssd and have the data on hdd. You don't have to tier pools to get help with small writes/metadata using ssds.

bjt12345 ・ 6 days ago

Oh I miss Optane drives.

dur-randir ・ 5 days ago

Grab one from Aliexpress while you stil can)

ndsipa_pomu ・ 5 days ago

I should try and do something like this. I've got lots of media stored on several large spinning rust drives and combine them using mergerfs (a union filesystem). However, I think I'd need something that caches at the file level rather than block devices, as I've got separate filesystems on the different disks.

I got the idea of using mergerfs from this: https://perfectmediaserver.com/02-tech-stack/mergerfs/

ThatPlayer ・ 5 days ago

I use mergerfs too. I ended up with all the drives under mergerfs running on bcache with a single SSD as the cache.
Mergerfs's documentations brings up ways to do tiered caching: https://trapexit.github.io/mergerfs/preview/usage_patterns/ . Though it doesn't really do a read cache. It only keeps the newest files on the SSDs.
I hope bcachefs (not bcache) gets better. It supports combining multiple different slow drives like I do under mergerfs, and also a file level cache on SSDs. And even supports keeping metadata on the SSDs completely.
Erasure coding isn't done yet either, so that's no replacement for SnapRAID.

rsync ・ 6 days ago

A reminder that zfs recently (past ~5 years) implemented dedicated metadata cache devices ... which allows you to cache either filesystem metadata or even small files to a blazing fast SSD mirror:

https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...

This is a quick and easy way to add thousands of iops to even something very slow like a raidz3 zpool.

As always:

"Let's repeat, and emphasize: unlike an SLOG or L2ARC which are merely inconvenient to lose, if you lose your metadata vdev (your "special" vdev) you will lose your entire zpool just as surely as if you lost one of the other full vdevs ..."

sitkack ・ 6 days ago

I would hope ZFS has a way to mirror metadata from the pool into an ssd, so it is actually a cache but doesn't increase the probability of dataloss.
- wongarsu ・ 6 days ago
  
  If you set up a normal L2arc (read cache device) that will cache both data and metadata. However you can configure it to only cache one of the two. Set it to metadata only and size it appropriately and you have basically a read-only metadata mirror.
  If you also want to have fast writes you can get a second SSD and set up a mirrored metadata device (storing metadata on mirrored SSDs, and regular data on whatever the rest of your pool uses)
  
  Dylan16807 ・ 5 days ago
  ・ 2 more
  
  Last time I stapled l2arc onto a hard drive zfs it was still super slow to do things like ls large directories and the l2arc never filled more than a couple gigabytes.
  Maybe it works better now?
  
  Filligree ・ 5 days ago
  
  It’s gotten smarter, and among other things no longer erases the cache on reboot.
  But for that specific case you should use a metadata vdev. Just make sure it’s mirrored!
- magicalhippo ・ 5 days ago
  
  Yes, the docs[1] tells you to match the redundancy level of the pool, but you could technically do something else.
  Note the docs say level and not type. So if you have raid-z1 the a mirror would be appropriate as they both tolerate losing a single disk, and if you have raid-z2 you can use a 3-way mirror etc.
  [1]: https://openzfs.github.io/openzfs-docs/man/master/7/zpoolcon...
- olavgg ・ 5 days ago
  
  You do, the normal way is to create a special metadata devices of mirrored SSD's. You do not need a 3 way mirror, as ZFS by default stores 2 copies of your metadata.
- undefined ・ 5 days ago
  
  [deleted]

thebeardisred ・ 5 days ago

This is a good writeup, but it's also missing the testing methodology (fio, a rsync job, etc) which allows one to understand the performance improvement quantitatively.

oakwhiz ・ 5 days ago

I've got an LVM setup where the EFI boot stuff is triple RAID1 and the OS is on LVM with RAID5, but all other LVM volumes for data are selectively RAID0 or RAID5. It was not a fun experience to set up and it seems like certain actions could break it. But as long as it's respected it keeps going really well.

jpecar ・ 5 days ago

Been using lvm cache since about 2017 in high performance storage environment. Didn't like it much as it was too much of a blask box back then and didn't have any knobs to tune to adjust its behavior to the workload and we had plenty that didn't fit its logic. Ended up going full flash.

hinkley ・ 5 days ago

Seems like there’s a very old trick we may have forgotten: put your WAL file on the fastest storage you have to speed up your database. I’d like to see benchmarks comparing both on the same hardware.

vanillax ・ 4 days ago

ZFS, Just use ZFS.

Padriac ・ 6 days ago

RAID is great but without monitoring and alerting you can still have a problem. Better still is the automatic creation of incident records and escalation.

slashdave ・ 5 days ago

mdadm can be scripted using any ordinary system health monitoring system. On some servers I was responsible for, I had automatic monitoring setup configured to send e-mail the instant any drive failure was noticed.
This is one of the nice things with open software. I can't imagine the trouble it would be to do the same for a proprietary raid card.

whazor ・ 6 days ago

Theoretically, since you have three drives you want one drive to be with writeback. This way you could double the speed of your writes.

iam-TJ ・ 6 days ago

When using LVM there is no need to use separate mdadm (MD) based RAID - just use LVM's own RAID support.

I have a workstation with four storage devices; two 512GB SSDs, one 1GB SSD, and one 3TB HDD. I use LUKS/dm_crypt for Full Disk Encryption (FDE) of the OS and most data volumes but two of the SSDs and the volumes they hold are unencrypted. These are for caching or public and ephemeral data that can easily be replaced: source-code of public projects, build products, experimental and temporary OS/VM images, and the like.

  dmsetup ls | wc -l

reports 100 device-mapper Logical Volumes (LV). However only 30 are volumes exposing file-systems or OS images according to:

  ls -1 /dev/mapper/${VG}-* | grep -E "${VG}-[^_]+$" | wc -l

The other 70 are LVM raid1 mirrors, writecache, crypt or other target-type volumes.

This arrangement allows me to choose caching, raid, and any other device-mapper target combinations on a per-LV basis. I divide the file-system hierarchy into multiple mounted LVs and each is tailored to its usage, so I can choose both device-mapper options and file-system type. For example, /var/lib/machines/ is a LV with BTRFS to work with systemd-nspawn/machined so I have a base OS sub-volume and then various per-application snapshots based on it, whereas /home/ is RAID 1 mirror over multiple devices and /etc/ is also a RAID 1 mirror.

The RAID 1 mirrors can be easily backed-up to remote hosts using iSCSI block devices. Simply add the iSCSI volume to the mirror as an additional member, allow it to sync 100%, and then remove it from the mirror (one just needs to be aware of and minimising open files when doing so - syncing on start-up or shutdown when users are logged out is a useful strategy or from the startup or shutdown initrd).

Doing it this way rather than as file backups means in the event of disaster I can recover immediately on another PC simply by creating an LV RAID 1 with the iSCSI volume, adding local member volumes, letting the local volumes sync, then removing the iSCSI volume.

I initially allocate a minimum of space to each volume. If a volume gets close to capacity - or runs out - I simply do a live resize using e.g:

  lvextend --resizefs --size +32G ${VG}/${LV}

or, if I want to direct it to use a specific Physical Volume (PV) for the new space:

    lvextend --resizefs --size +32G ${VG}/${LV} ${PV}

One has to be aware that --resizefs uses 'fsadmn' and only supports a limited set of file-systems (ext*, ReiserFS and XFS) so if using BTRFS or others their own resize operations are required, e.g:

  btrfs filesystem resize max /srv/NAS/${VG}/${LV}

ecef9-8c0f-4374 ・ 6 days ago

Mdadm raid is rock solid. Lvm raid is not at the same level. There was a bug for years that made me doubt anybody even uses lvm-raids. I could not fix a broken raid without unmounting it. Mdadm and ext4 is what I use in production with all my trust. Lvm and btrfs for hobby projects.
- ars ・ 5 days ago
  
  lvm raid is simply a passthrough to mdraid.
rzzzt ・ 6 days ago

XFS could only grow in size for quite a while (using xfs_growfs), don't know if that changed in recent times.

ggm ・ 6 days ago

The logic for not zfs cited reduces to two things: FUD and not in baseline Linux.

The pro case for BTRFS is being able to do JBOD with a bit of additional comfort around mirror state over drives.

dangus ・ 5 days ago

Yeah I really didn’t understand why ZFS or something like a TrueNAS dedicated storage solution setup was ruled out so quickly.
But before we even go there, we have to consider an all-SSD solution. SSDs are so cheap nowadays. It’s a trivial cost to buy a 2-4TB SSD. Heck, even an 8TB SSD is reasonably affordable, only about $600.
So first step is justifying a complicated caching setup. Why cache at all when you can just use all SSD storage and bring the complexity down a whole lot?
And if you’re using more than 2-8TB of storage and you’re in the spinning disk territory, why are you not using a dedicated NAS solution like TrueNAS?
This solution just seems like one of the worst possible choices.
- Numerlor ・ 5 days ago
  
  Buying 3-4 $600 ssds instead of HDDs that cost about quarter of that is quite a bit difference.
  Though I feel like it's cheaper and sufficient for most home use nas to buy some cheap phased out ddr4 to get let's say 128gb and let it cache things there, and the ram can be handy for other uses
  
  dangus ・ 4 days ago
  
  The real point isn’t about the literal cost savings with spinning drives, the point is that if you’re at a capacity level where the cost is becoming a factor that represents more than one evening out at your favorite restaurant, you’re at the point where a dedicated NAS solution is going to make sense for you.
  And once you are at that point there is no sense in fussing around with setting up a “bespoke” solution like this that is a product of artificial constraints.
  
  ggm ・ 5 days ago
  
  Disks are noisy. My FS lives in a room I need to be quiet. Disks are hot. I had hoped SSD would run cooler, they are hot but I find the room is perhaps a bit cooler now too.
  Disks however, are much, much, much cheaper. Your ramcache point is well made.
- dur-randir ・ 5 days ago
  
  I have a 100TB array setup with LVM. I've thought (twice, each time I expanded this array) about using ZFS. It was ruled out for three things:
  - there had been no raidz expansion (but now is, so this point is crossed)
  - ZFS degrades read latency with the number of drives in array, while mdraid/lvm scales linearly - this is the price you pay for your checksums
  - write-cache options on ZFS are atrotious, ZIL/SLOG are nothing compared to dm_writecache - I can stream 2GB/s full of sync writes until my cache drive is filled up, and it also provides reads to freshly written data, without going to backing pool
  So, saying "why not ZFS" or "go buy some SSDs" is not really productive for promoting ZFS - it just underscores that "for ZFS" crowd are zealots.
  
  dangus ・ 4 days ago
  
  My main point is that the article is applying artificial constraints as a reason to avoid using a gold-standard plug-and-play solution.
  Basically, I’m saying that if you need this quantity of storage and storage performance, you’re best off not artificially constraining yourself to running on your existing box or using a non-NAS oriented Linux distribution. You’ll have a much easier time going with a single-purpose storage solution like TrueNAS where it’s running a dedicated OS on dedicated hardware.
  It doesn’t really have to be ZFS-based, either, but most people in the homelab community seem to agree that TrueNAS is a top option.
  I would say that suggesting a web GUI solution versus a bespoke thing like we see in the article isn’t exactly a “zealot”-like thing to do.
  
  ggm ・ 5 days ago
  
  Maybe I skimmed too fast, but I didn't see your criticisms in the write up cited. They're a good critique, ssd aside: ssd are magic, any filesystem.
  ZFS mainline kernel aside, is the only FS I've seen which is able to encompass redundancy and is portable BSD <-> Linux. It isn't a big reason I run it, but it's one of them. Snapshotting is the big reason although the various journal fs had this ages ago.
  I don't personally feel a zealot, but I admit to proselytising.

evrennetwork ・ 6 days ago

[flagged]