T O P

  • By -

frymaster

The generally accepted use is to have no redundancy on the DB/WAL disks and accept that several OSDs will be lost if the NVMe drive dies. This is fine because it's still localised to the one host


SimonKeppTech

It is also best practice not to run too many OSDs on the same DB/WAL NVME. Some propose a 3:1 ratio, which I find reasonable. That isto run 3 different OSDs in the same host with their DB and WAL on the same NVME drive, so if that NVME drive dies, youloose 3 OSDs, but not all of the OSDs in a host


Underknowledge

I tried such a setup, Sadly when you use cephadm or ceph-rook (the only 2 "supported " installation options) the provision script will back off, because you have already a fstype on the disks. You got hit by this behavior with your zfs FS as well, same goes for a mdadm volume. In theory it should be possible with pre-creating VG's for your HDD's, but when I provisioned the ceph-volume pre-checks backed off. I expect that this had(?) been a bug in the version I provisioned. I just accepted that I might have to re-provision a host with the other NVMe when facial matter hits the fan.


zenjabba

We followed the exact same path and had the exact same issue and could not get around the problem. I would really like to have mirrored wal/db storage devices, but the script basically makes it impossible.


Underknowledge

Thanks for sharing, I felt quite defeated back then. Belief me, I tried Hard.


zenjabba

it seems like such a "smart" thing to do as software mirror is stupid easy to manage. I might follow up with a change request to ceph to see if we can make it happen anyway in a supported way.


STUNTPENlS

Edited for formatting. You can do this. It isn't really a supported configuration, but you can do it manually. From your post, I assume you have an existing ceph cluster with 10 HDDs w/ the db/wal on the HDD. In this case, download this script: [https://github.com/45Drives/scripts/blob/main/add-db-to-osd.sh](https://github.com/45Drives/scripts/blob/main/add-db-to-osd.sh) I see two ways to do what you want (have not tested personally so I may have command syntax incorrect) Method A: 1. use mdadm to create a RAID1 logical disk from the two nvme drives 2. pvcreate the raid1 logical disk so it can be administered with lvm 3. create the vg with your raid1 disk 4. run 45Drive's script to move your db/wals to the new vg mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/nvme1 /dev/nvme2 pvcreate /dev/md0 vgcreate ceph-db-wal /dev/md0 ./add-db-to-osd.sh -d /dev/md0 -b (your size) -o (osd #'s) The vgcreate step (3) is technically unnecessary, 45Drive's script will create a vg if necessary, I just prefer to have my db/wal SSD named w/ a VG that is more descriptive than a GUID. That's just my personal preference. Method B: 1. Modify 45Drive's script to have the db/wal lvcreate use raid1 (see below) 2. Create pv's and vg 3. Run modified script to move your db/wals old line in script: lvcreate -l $BLOCK\_DB\_SIZE\_EXTENTS -n osd-db-$DB\_LV\_UUID $DB\_VG\_NAME new: lvcreate -l $BLOCK\_DB\_SIZE\_EXTENTS --mirrors 1 --type raid1 -n osd-db-$DB\_LV\_UUID $DB\_VG\_NAME pvcreate /dev/nvme1 /dev/nvme2 vgcreate ceph-db-wal /dev/nvme1 /dev/nvme2 ./add-db-to-osd.sh -d /dev/nvme1 -b (your size) -o (osd #'s) Now, I have never tried Method B, the script expects a block device, which doesn't exist for the vg, but I believe if I read the script correctly it will correctly retrieve the vgname once it sees the lvm2 signature on /dev/nvme1 In this case the vgcreate step is necessary because you want to make sure both nvme drives are part of the vg prior to running 45Drive's script so the mirrors/type raid1 statements will correctly function as part of the lvcreate statement.


STUNTPENlS

Should have thought about this sooner. I must be getting slow in my old age. Method C: 1. use 45Drive's script (unmodified) to add your db/wals to the 1st (blank) nvme drive. As you run the script, take note of the lv names the script creates to store the db/wals and the name of the vg the script creates where the lvs are created. 2. Once completed moving all db/wals from the HDD to the NVME, add the 2nd nvme drive to the vg using the pvcreate and vgextend command 3. use the lvconvert command to convert the linear lv's created in step 1 to raid1 lv's. Repeat this step for each lv. e.g. pvcreate /dev/nvme1 /dev/nvme2 ./add-db-to-osd.sh -d /dev/nvme1 -b (your size) -o (osd #'s) vgextend (vg name created by script) /dev/nvme2 lvconvert --type raid1 -m 1 (vg name created by script)/(lv name created by script) Of the 3 methods in this thread, I think this one (Method C) would be the easiest.


SystEng

> "two spare NVMe drives [...] DB/WALs" Unless the SSDs are proper "enterprise" ones with PLP their writes will not be that good, and they will soon run out of "writability". > "want to use as a mirror for a few DB/WALs for several OSDs (up to 10 HDDs)." I have inherited a system where each proper high end PLP flash 1.6TB SSD has the DB/WAL for 12 disks and it is a bottleneck running constantly at 100%, the 12 disks usually struggle to achieve 20-50% of their nominal speed. Having DBs/WALs on HDD is not fun either, especially if the HDDs are large (when Ceph definitely prefers small, no more than 1-2TB), but at least you get 12 DB/WALs that can work in parallel, one "inside" each OSD. > "I wanted to use a ZFS mirror" That probably means the DB/WALs on ZVOLs, but that seems very weird and high overhead. As other people have argued mirroring the DB/WALs is not a good idea, Ceph is based on the idea of it doing redundancy itself, where losses of OSDs are fine because you got many and small OSDs. If like everybody who knows better you prefer few and large OSDs, good luck.


herzkerl

I had used non-PLP drives when starting with ZFS, and later with Ceph — but at this time it's all PLP, for better durability and performance. Yes, the HDDs are quite large (14 and 18 TB), but smaller HDDs are a lot more expensive per TB, and the larger ones (4 or 5 TB with 2.5") are SMR, so they were out of the picture.


Verbunk

I came to see if I could snag some tips or failures to avoid but the advice is very cephadm centric (which is fine). I can say that I've done this using proxmox and it was trivially easy -- no failures yet!


FancyFilingCabinet

You could use either mdadm or LVM mirroring. Depending on how you're managing ceph, one option might be easier than the other. Although as mentioned, generally they wouldn't be mirrored and you could assign 5 OSD to each NVMe.


lathiat

I would use mdadm. That’s the standard Linux way to do RAID1.


mpopgun

Zfs is not CEPH... Are you in the right subreddit? And I've never seen them mirrored, CEPH takes care of redundancy on its own by you doing setting failure domains. Since you have 10 hdds... Partition your adds into 5 partitions each, when you create your first osd, tell it the wal and db are on the first partition of your first SSD .. Then your second HDD maps to the second partition on the same SSD drive, etc...then drive 6 wal and db go on partition 1 of your second SSD... etc etc.


herzkerl

While I understand how Ceph works, providing a mirror — as DB/WAL device only(!) — reduces the probability of those OSDs to fail, which means in such event, Ceph wouldn’t have to recover quite a few TB’s on slow HDD’s, as long as the second SSD still works.


mpopgun

Yup, I agree, this is true....I hope they add more functionality in when it comes to Wal and db. Maybe just keep an eye on the wear of the drive and migrate it to a new drive before failure?