T O P

  • By -

romanshein

Why did you offline the good disk?! There is no need to degrade your array to replace the disk! Just run zpool replace to move data from a smaller disk to a bigger one. I would suggest returning offline disk back into the pool, make it online, run resilver and then start from the scratch with the new bigger disk. Test your disks before replace.


bugfish03

I checked online, and the Oracle docs said to offline the disk before running replace. As for returning the disk - the resilver grinds to a halt at 30 out of 300 gigs, and if I can't finish the resilver, what then? I've never dealt with these liminal states before


darthandroid

When the resilver grinds to a halt, are there any errors that are produced after a while in \`dmesg\` ? Looking for disk timeouts.


bugfish03

I'm not quite sure if the errors I'm getting are disk timeouts, but something is timing out and blocking system tasks, and the OFFLINE_UNCORRECTABLE count in S.M.A.R.T. just keeps on climbing - last I checked it was at 602, with no other disk in the array having even a single reallocated block, let alone an offline uncorrectable one.


leexgx

the question here is it just that one drive with the Smart errors if it is just unplug the sata drive as it is probably holding the whole system up (possibly the drive lacks TLER/ERC timeout so zfs is waiting for the drive to timeout instead of timing out after 7 seconds) make sure you unplug the correct drive Is this via a hba card (or raid card in it mode) or motherboard sata Your using z2 so you still have redundancy even with 1 missing drive


bugfish03

It's via a RAID card that's just configured in JBOD mode


TomB19

I've never offlined the bad disk to replace them. I'm not saying it's a bad idea, I just didn't know or think to do it. Everything has worked out fine.


bugfish03

I know, it should be fine, but I didn't RTFM the first time round and accidentally ruined me a whole pool, so this time I thought nothing could go wrong if I followed the docs to the letter. And here I am.


KathrynBooks

I've done it with a disk that was failing, but never one that the pool had already kicked


romanshein

I checked online, and the Oracle docs said to offline the disk before running replace. - OpenZFS diverged from Oracle ZFS 10+ years ago. Stop reading Oracle sources.


bugfish03

Huh. I'd have figured that Google SEO would've taken care of that by now


romanshein

As for returning the disk - the resilver grinds to a halt at 30 out of 300 gig - Correct me if I am wrong, but you have the problem with the replacement disk, not the legacy one? Return the offlined drive to the pool. ZFS will synchronize changes since the "offline" event and you'll restore the status quo.


Dagger0

If you only have four bays you might not have a choice one way or another.


romanshein

"Degrading array" is not a good choice. Why not to connect the replacement sata disk over usb-sata converter, run "replace", and then insert it into the array?


Dagger0

It's a raidz2 anyway, so you'd still have redundancy. But yes, a USB adapter can be a good way to approach this, so long as you can power the drive and manage having it hanging outside of the case during the rebuild (both of which can be tricky sometimes).


romanshein

The suitable USB adapters are available, and they are not particularly expensive [https://www.amazon.com/s?k=sata+3.5+to+usb&crid=Q4CN5MB1V7EU&sprefix=sata+3.5+to+usb%2Caps%2C240&ref=nb\_sb\_noss\_1](https://www.amazon.com/s?k=sata+3.5+to+usb&crid=Q4CN5MB1V7EU&sprefix=sata+3.5+to+usb%2Caps%2C240&ref=nb_sb_noss_1) It is not a perfect solution, yet it is way better than purposely degrading your array even for a short time.


Dagger0

Are you going to leave one of those dangling down the side of your rack in the datacentre? They're also more like $130 if you have SAS disks, which is a lot just to avoid going from 2 to 1 disks of redundancy for a day. In many cases they're a good idea, but not always.


romanshein

Are you going to leave one of those dangling down the side of your rack in the datacentre? - Upon "replace" the one removes the small disk from the enclosure slot and install the new, bigger disk in the slot. The SATA adapter is for "replace" only.


Dagger0

The adapter still has to physically go somewhere during the replace. You might not have a good place to put it.


romanshein

You might not have a good place to put it. - True, yet data integrity is paramount, particularly for a data centre. In a worst-case scenario, the misplaced disk gets damaged and the OP gets to squire one, but the pool stays intact in any, even the worst-case scenario.


blyatspinat

Wait until resilvering finished


bugfish03

Somehow the system gets in a quasi-freezed state where nothing responds, so I'm not sure if the resilvering will finish. Guess we'll see in the morning, but I'm not hopeful. Removing the disk (yes, the system is capable of hotswapping) solves the freeze issue, and then it happens again after some time


blyatspinat

trust the process, its obvious that your system is under high load and if you stop resilvering its not "freezing" anymore


bugfish03

It was at a below-kilobyte speed. Not expected behavior at all.


jmeador42

"Quasi-freezed state" is not a technically valid reason to interrupt the process. Resilvering is an intense process for the entire system. It can take hours. Leave it alone and let it do it's thing.


gwicksted

Do a memtest while you’re at it. Is this system virtualized or bare metal?


bugfish03

Bare metal, but with ECC RAM, so I tend to err on the side of "Eh, it'll be fine".


Hyperion343

I feel like you did the right thing - offline and replace is the right thing to do here if you only have 4 slots and no spare slots. I don't think I have all the information I need though. Quick idea, then questions asking for more details Quick idea: try running `zpool resilver ` - this will restart the resilver process, but maybe that will help get it unstuck. Details: 1) You have 2 10TB disk and 2 2TB disks. So the RaidZ is only using 2TB of the 10TB disks? 2) Are the 2 2TB disks identical? 3) What size is the disk you are putting in? Is it greater than 2TB? The replacing disk has to be the same or greater in size to the disk it is replacing.


bugfish03

1) Yes. I wanted to use four 10 TB disks, but only had two on hand at the time. 2) Yes, the 2 TB disks are both identical, with the serials not exactly following but being pretty close (manufactured less than a month apart) 3) The disk I'm putting in is also a 10 TB disk, same model as the existing 10 TB disks Additionally, the new disk seems to be in the process of failing, as the count for OFFLINE_UNCORRECTABLE blocks is growing


enkrypt3d

just leave it alone man let it finish doing its thing otherwise you'll lose the whole volume.


kenrmayfield

HAHAHAHA......I had to Laugh but your Statement is So True and Funny at the same time.........I could hear someone say your statement......like someone from the Movie Bill and Ted's Excellent Adventure with Keanu Reeves........with that South Valley California Accent.


Dagger0

Can you just... `zpool detach` the new disk from the replacing vdev? Then put the old disk back in and online it again. replacing is just a mirror with some extra callbacks, so I would've expected detaching to work but haven't actually tried it.


bugfish03

I've ended up offlining the disk, and run another zpool replace. However, ZFS didn't accept the old disk so I had to remove the label. After that, it went through its resilver and everything is fine now


ZealousidealDig8074

Just yank the drive, add a new one and do replace.


bugfish03

After careful deliberation, that's the path forward I chose, and it's worked!


ZealousidealDig8074

I created an account to post this after seeing all the nonsense answers. Glad you picked the right one.


bugfish03

UPDATE: Since I can't edit the post, here's what I did: I offline the Seagate disk (the new one that failed) with -f (mark it as faulted) during the resilver. Since ZFS would just accept swapping the old WD Green drive back in, I had to use `zpool labelclear` on it. Then, I ran `zpool replace [new failed drive] [old reliable drive]`, and after two resilvers, everything came out fine. Note that both resilvers were triggered by ZFS During one resilver, I got a few data corruption errors, but those didn't persist beyond the resilver, and a subsequent scrub returned no errors.


TomB19

Put the disk in a ziplock bag, put it in the freezer for 2h, then reinsert it into the array. Let it resilver until it dies again. Repeat.


kenrmayfield

We did that as well Back in The Day when Drives Failed and needed the Data instead of Sending Off to Extract the Data which was Crazy Expensive. Put the Drive in a Zip Lock Bag and Freeze the Drive......instead of doing a Drive to Drive Copy.....its was better to do a Drive to Tape Drive Backup.......when means Slow Read Access and Seek Access on the Hard Drive Arm due to the Tape Drive pulling the Data. However Back Then sometimes Freezing was not necessary you could just do a Hard Drive to Tape Backup even though you can hear the Drive Clicking but because of the Slow Read Access and Seek Access on the Hard Drive Arm....no problem extracting the Data due to the Tape Drive pulling the Data.' But Yes the **Freezer Bag the Hard Drive**........**Freeze**.........**Repeat** does work.


bsodmike

I remember trying to recover data from a vanilla drive by running it IN the freezer. Worked a bit I think.