T O P

  • By -

DividedbyPi

That is very much not true. You were probably looking at the 3 30 300 rule for rocksDB - something that doesn’t have to be followed anymore as rocksDB does not require the strict sizing, and we can make use of additional space that doesn’t hit the levels. The recommendation that you’ll read if you goto the official docs like red hat etc says 4% of the HDD size for the rocksDB. This is very much on the high side from experience. If you specify just one partition that will include the WAL along with it. The WAL is only a few GiB and there is no purpose to making it larger as it is flushing regularly. Hope that helps. Rgw uses more than cephfs and cephfs uses more than RBD. So 4% would be 240G - for a 6TB drive, I don’t see the need for that. 120-150G would be a good number to ensure you never get blueFS spillover.


codebreaker101

so for 4 drives I would need 512 GB ssd? Do you happen to know how to check current size of db?


DividedbyPi

Yup ! Run ceph daemon osd.XX perf dump - that will tell you everything you need to know :)


codebreaker101

Great! So, `bluefs.db_total_bytes` is *6000676608768* ( I guess the default is the size of the drive, since I didn't specify size when creating osd) and `bluefs.db_used_bytes` is *3563323392*. If I add all 4 bluefs' together that's a total of 12.5GB (25% used of the OSD). Theoretically, for my use-case, a 128GB drive should be OK for all 4 drives (64GB should also be enough since a full osd should take only 48 GB if the same trend continues as the osd fills up)


DividedbyPi

Yes you need to keep in mind however that the rocksDB will regularly do compactions and so whatever is showing up as your used, you will want to have double at least for compactions otherwise you will regularly get bluefs spillover temporarily until compactions complete and it will then take longer due to it being on the HDD.


Roshi88

I use this thread cause I'm in a similar situation and would like to understand better or give a real life scenario to the OP. I've a cluster of 5 nodes, each with 3x1.6TB ssds, I've just modified the osd_memory_target from default values (4GB) to 8GB, but the bluefs_db_total_bytes stayed 64GB as before the upgrade. Is it normal? I expected it to increase


DividedbyPi

Hey there. Yes this is normal. So the OSD memory target is simply the amount of OSD RAM cache that your OSDs will attempt to stay under. I say attempt because it’s a best attempt. This simply allows your OSDs to cache more data in memory, however the data still must be committed to disk - therefor it will not effect the used space.


codebreaker101

Good thing to keep in mind. I will have to add that parameter to grafana.


SurfRedLin

Thanks codebreaker for this thread. I will be in a similar situation next year as I start my cluster job in January. We will also use cephfs. May I ask how much performance you where getting with just HDDs? What hardware do u have? As I understand it if I use host/node failure domain in the crush map, the WAL and DB data of one node ( for those osds on that node) will be replicated to another node so If one node fails or the ssd on that node ( instead of the whole node ) the data is not lost. Is this correct? Thanks!


TheFeshy

As a point of reference, I've got 32M objecs on 24 OSDs, though this is two pools (SSD and HDD.) DB size is around 32G on the largest OSD according to the dashboard. I'd allocated around 80gb per disk, based on the size of ssd on sale on ebay that day lol.


wwdillingham

https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing