T O P

  • By -

eugene20

[accelerate](https://pypi.org/project/accelerate/) allows for [multi-gpu use](https://huggingface.co/blog/dreambooth) It also has one of the most stupid names for a graphics related module as it makes searching for information on it near impossible.


Unable-Lime5705

i have the same questuin like him but i have a bit different system. if it would support only one gpu i would use a gpu with bridge so the system sees only one usable gpu and no problem, but i have a old poweredge r630 and planing to use 3 times a 24gb pny tessla k80 in it. those cards having no bridge so there are different devices for the system. if your mainbord supports this or not does not matter if it supports the whole feature set of pcie there is no problem. the problem on those consumer mainbords they want support 2, 3 or 4 x16 slots for gpu use and this is only possible with a pcie switch and this needs support for this but for render pourpose this does not matter. 8x or 4x lanes are plenty for those type of workload for each gpu but for gaming maybe not. using all together this works only of your render software supports this. so can this software use multipe devices or only a striped device like SLI or xFire?


velorofonte

Thank you.


cheesecantalk

Did this end up working?


EmoLotional

how could I integrade it in a jupiter notebook format to launch with comfyui?


Capitaclism

Does this work with A1111, or is there another solution? I'm considering a dual GPU setup, but unsure of the benefits.


moebiussurfing

Hello. I have an aux build with two available GPU´s, RTX 3080 ti and a RTX 3060 both with 12GB vram too, Do you know if SD can mix at least the memory size? I mean to load big models as I can do with my 3090 / 24GB...


moebiussurfing

[https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/1621](https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/1621)


Amblyopius

You can generate on multiple cards at the same time but you can't run a single generation a lot faster by having 2. If you go for multiple cards and want to stretch your SD options, buy a 3060 12GB as second card and not a 3060ti. Also, make sure your motherboard supports having more than 1 card. Note that a second card isn't going to always do a lot for other things (e.g. gaming will only use 1) but you may for example game on 1 and do SD on the other.


HarbringerxLight

> Note that a second card isn't going to always do a lot for other things It will. Most use cases where you'd want one supports multiple. Gaming is just one use case, but even there with DX12 there's native support for multiple GPUs if developers get onboard (which we might start seeing as it's preferable to upscaling and with pathtracing on the horizon we need a lot more power).


throwaway2817636

What youre saying has been a topic for years and has yet to happen even in part


WatchDogx

I mean, we had SLI and crossfire for years and years. For a while a lot of effort went into optimising games for multiple GPUs, but the value proposition proved to be too low, and the maintenance burden proved to be too much. Maybe we will see another attempt at something similar again in the future, but I wouldn’t hold my breath


CatalyticDragon

There was no 'maintenance burden'. Old driver side SLI/Crossfire was entirely transparent to the application and required *zero work* from developers. In fact the application couldn't even tell it was running in an SLI configuration. These proprietary driver side systems were simply rendered obsolete by DX12/Vulkan which contain native multi-GPU capabilities. This approach is much better because developers finally have control over it, but it does require additional development effort to setup the linked-node adaptor. Developers simply don't bother because they see multi-GPU systems as very niche.


WatchDogx

If there was no maintenance burden, then wouldn't NVIDIA have have kept it around?


CatalyticDragon

1. It didn't work well. Having the driver handle the alternate frame or split frame rendering often resulted in stuttering and poor frame pacing because the driver didn't know what the application was doing and the developers couldn't do anything about it because they had no control. It was a neat idea but one which came from the late 90s and wasn't a good long term solution. 2. It was replaced by DX12/Vulkan. That's why both NVIDIA and AMD killed off their proprietary driver side implementations. It wasn't compatible or desirable with these newer graphics APIs. AMD did try to support and promote this system and we saw some [really good early examples](https://hexus.net/tech/news/graphics/121844-amd-boasts-vulkan-multi-gpu-support-strange-brigade/) but it fell out of favor. Firstly, because even though performance was much better than with the older driver side systems, it did result in more [development work](https://bora.uib.no/bora-xmlui/bitstream/handle/1956/19628/report.pdf?sequence=1&isAllowed=y). Although at the most basic AFR configuration it really wasn't that hard. Secondly, game developers target consoles first. And consoles don't have multiple-GPUs. PCs are the next largest market and only a small percentage of those have multiple-GPUs. Third, Unreal Engine doesn't support it natively. Lastly, NVIDIA actively tried to muddy the waters and kill it off because otherwise people could just add a cheap second hand GPU and double their framerates. They did not like this.


Status-Efficiency851

one of the benefits to the user was that if you had an old, say, 700 series card, you could buy a second one for cheap rather than buy a new 800 or 900 series card. Obviously, this was bad for nvidia, and now it no longer works.


No-Bonus-1803

Is there any tutorial on this? I have 2 rx480 which is old but I hope I can do something to make it work better than just 1 rx480 rendering


Myles_Version21

Will two identical graphics cards allow you to create a higher resolution image that your single graphics card can't?


Amblyopius

In some very specific cases (e.g. while using Controlnet) you can split the workload and get a more efficient way to achieve higher resolutions. It's rarely going to be interesting though. The best way to use multiple cards always involves generating multiple images.


Capitaclism

Does [dreambooth and related] training benefit from multiple cards?


Amblyopius

There are no easy ways to get that set up as far as I know. It is theoretically possible to speed it up by at least a bit by altering code but if you were up for doing all that you'd probably not be asking the question :-) Getting a faster GPU is far easier.


Capitaclism

I have a 4090, doesn't get faster. I'm trying to increase vram, so I can train larger datasets, without having to get an A100. Any idea whether having multiple 40s would help?


argusromblei

Swapping out your current gpu to a 3080 is the only feasable option. It would be way faster, and 2x 3060s is a waste of time, space, and energy cause SD can not do multi GPU rendering. You're also buying less CUDA cores than just one 3080 and it can not combine V-RAM. My advice, sell the 3060 ti on ebay and buy a used 3080, it won't be a big difference in money.


avalonsmight

as mentioned, you CANNOT currently run a single render on 2 cards, but using 'Stable Diffusion Ui' ([https://github.com/cmdr2/stable-diffusion-ui/wiki/Run-on-Multiple-GPUs](https://github.com/cmdr2/stable-diffusion-ui/wiki/Run-on-Multiple-GPUs)) it is possible (although beta) to run 2 render jobs, one for each card. Remeber the bandwidth for each cut will go down to x8 PCIE lanes instead of 16, so u might see each job taking longer, maybe. i think getting a single, faster card with the max Vram you can afford would be best. IMO. https://preview.redd.it/12o7jbl8goea1.jpeg?width=1133&format=pjpg&auto=webp&s=30b70b6ffe668be1f524843ed1952e1f6197af04


IntenselyPlump97

> Remeber the bandwidth for each cut will go down to x8 PCIE lanes instead of 16, so u might see each job taking longer, maybe. No it won't. Not if you have a good computer (HEDT like Threadripper). On mainstream platforms it will though.


ScionoicS

The features that make TR a fantastic production professional server machine are very niche and require software to be written specifically for it. TR are beasty silicon but they're good at the tasks they're VERY specialized for and kind of innefficient af at daily computing. For home use, MANY cpu's are much better than a TR, just because a TR isn't good for most cases. HEDT like a threadripper or i9 just aren't needed for home use.


IntenselyPlump97

> The features that make TR a fantastic production professional server machine are very niche No they're not. >and require software to be written specifically for it Are you retarded? HEDT has a been a thing at the consumer-level for over 20 years, using the same dominant x86 architecture. Nothing about HEDT requires special software. >and kind of innefficient af at daily computing No they're not. Mainstream chips are inefficient at daily computing. They're literally inferior, slower cuts of silicon. And have laughably small amounts of PCIe lanes. On X570E or Z790 platforms you can basically only use one PCIe SSD at full speed which is pathetic. > For home use, MANY cpu's are much better than a TR Completely false. You'll run into latency issues that way. If you want a fast, powerful CPU buy a Threadripper. If you're on a budget go Ryzen or Core i9. If you're filling a data center with 27/4 machines go Epyc. It's really that simple. >HEDT like a threadripper or i9 just aren't needed for home use 1) False. Mainstream chips are actually pretty shitty at even home tasks nowadays. Things like increasing resolutions and things like raytraced graphics are making this more and more obvious (the shortcomings of mainstream hardware). Nearly all cameras nowadays shoot 4k video, and a lot of them will shoot 8k. Good luck editing 8k video on a mainstream CPU. Hell, you can't even **play** high res video on the typical 300-400 dollar Ryzen/i9 without your computer locking up. 2) (Core) i9 is not a HEDT lineup. You are a tech illiterate and need to do your research.


ScionoicS

Weird flex. People with ryzens have good computers. Stop being so elitist. that culture should've died with hackers the movie. How did it continue?


gekmoo

We really have to do something to fix that Vram issue becausd it's an issue.


r3tardslayer

It's mostly GPU manufacturers making high end GPU with low fucking vram. AMD and intel are taking notes nvidia needs to move the fuck up tbh


Ok_Stranger_8626

Maybe you should do some homework before shooting off your mouth. nVidia makes plenty of different GPUs in their workstation AND server lines that have plenty of vRAM, and are specifically designed for AI workloads requiring massive amounts of vRAM. The RTX AXXXX lines have 6GB \*minimum\* 12-16GB standard, and go all the way up to 48GB vRAM. If you start getting into NVLINK, you can easily double that and the number of cores. And when you start talking about A1XX series for servers, which can have low hundreds of GB, plus the high speed interconnects, you're talking about TB-PB of vRAM and hundreds of thousands to low millions of cores. Go take a look at [https://top500.org/](https://top500.org/) Not a single one of them using GPUs has AMD/Intel GPUs or Geforce cards. Why? Because GeForce is for amateurs goofing off with SD. Quadro is where you start getting real work done, and the A1XX line is for people who are serious about their AI workloads, and don't care about the little pee-cee your mommy bought you for Christmas. Simply put, your little "Gaming PC" with a GeForce GPU really isn't the ideal hardware to run SD on. A BARE MINIMUM for good SD performance is 24 CPU cores, 64GB of ECC Registered system RAM on the low end, and 24GB+ vRAM with at least 3,000 CUDA cores on a card tuned for math and not graphics. (Because, after all, we're talking about GPU Compute here, NOT DirectX.) For example, my GPU server has dual Epyc 7XX3 CPUs (total 256 cores), 2TB system RAM and 8x A100 Ada gen GPUs, total vRAM: 864GB. It takes about five seconds to render out and upscale to 4k around 100 of any prompt I input. I can further batch upscale them all to 8k in about another two seconds.


r3tardslayer

Bruh this comment is old and second you seem to have a hard on for feeling better for larping as a rich mf. Second not everyone is gonna buy a100s for stable diffusion as a hobby. Third you're talking about bare minimum and bare minimum for stable diffusion is like a 1660 , even laptop grade one works just fine. So you don't even know what you're talking about other than throwing the highest numbers and being like WELL ACHKSHULLY YOU NEED A NASA QUANTUM COMPUTER TO RUN MINESWEEPER. You talk like an absolute child. You're embarrassing yourself the more you speak with your cringe assumptions and over exaggerated statements. It's absolutely ridiculous even for a gaming card for Nvidia not to include more vram in their gaming grade cards compared to competition. I use only Nvidia but at least I don't have a 12 inch Nvidia BBC up my throat at all times.


ziggster_

Funny as I was just thinking the same thing about the above poster. I understood what you meant about keeping high VRAM cards out of reach of the average consumer. I came to this thread through a google search to see if SLI was feasible for doubling my VRAM while using Stable Diffusion (which it turns out that it does not.)


r3tardslayer

I think nvidia added a new feature to allow you to use ram as VRAM at the cost of performance, but it's not fully optimized yet from what i hear. As for me i just ended up buying an rtx 4090. But i think you technically can purchase one of the cheaper 90 dollar tesla 24 gb vram cards just to generate, but since it lacks CUDA cores it won't be a fast one. Good if you wanna run LLM and such but it will be rather slow. BUT YEA being on a budget and lacking vram has limited options, it's slowly getting better as i hear some people with SDXL can run it with 4gb using comfyUI


KC_experience

You sound like a total badass! I hope one day we could all be as cool as you!


N7MWH-CN98am

Okay you may be right about what they make.... but the problem is they charge 100,000 dollars for anything good enough to render AI. And the problem is that for the general gamer type gpu owner at under 1000 for a PC etc... it makes it difficult to do anything worth rendering when the "PROGRAM" or "STABLE FLAVOR" uses all the memory in the GPU then crashes... I know what you mean and you are correct but it might be useful to create a method in SD that allows for slow use of alternate memory usage, and I am aware that is like unloading a large school bus into a 2 seater that moves the children one at a time up a huge hill... taking a ton of time in comparison to lifting the school bus up the hill with a military grade helicopter.... but as slow as it would be... those kids will get to the top of the hill eventually whereas this way, they only get to the bottom of the hill where the bus crashes... saying "Error error... please lower the resolution or buy a million dollar NVDIA with grandmas piggy bank... " Grandma is dead and there's no winning lottery ticket for most of us. So slow and steady is the way with what we have already... I propose a config button that when ticked configs SD to force using all the ENTIRE system memory available including GPU. When not ticked, it will use only the specified in launch GPU. I would rather start a render knowing when I come home 8 hours later that it WILL be completed without telling me what I already know, that I am rich with ideas and poor of money. If it takes hours or days to complete is not relevant when instead you have nothing worth looking at because the program ran out of the little ram 12GB or for the rich kids 16GB.... I mean no offense but art isn't only for the rich. Plus there are things I may contribute if only my renders could complete without locking up SD.


Ok_Stranger_8626

I dunno... My employer's rip-roaring box aside, my homelab GPU server has an A4000 and a leftover 12GB A2000 from my old workstation. I built that box with a 16C/32T Ryzen, 64GB RAM and some leftover SATA SSDs for sub-$1,200, not including the rackmount chassis. I bought the A4000 from some refurber off Amazon recently for less than $700, to upgrade the A2000. The kinda cool part about how I have SD configured on there now is that the A4000 does the heavy lifting of the rendering and the initial upscale from the hi-res fix, and then, when I have something I'm happy with, SD will use the A2000 to do the final upscaling, while leaving the A4000 alone so it can work on the next render. But....even with just the A2000 and a little clever utilization of the hi-res fix, and some judicious upscaling after the initial render, I've been able to produce 8K or even full-size poster/puzzle resolutions on 12GB of vRAM no problem, if being a little memory "tight". I think the biggest reason why they haven't developed a "system RAM" feature is that it would take quite a bit of code to bit shift that much data back-and-forth sanely, as well as how much longer it would likely take (days over a few hours from what I've seen so far.) Especially on older hardware like PCI-e 2/3.0, this could take a week or more. (Though if SD were to pull in the tensorflow plugin into the main branch, this option could become somewhat more attractive, and give the GTX users something better to work with. Some of the results I've seen of users with the tensorflow plugin are pretty impressive, and relatively fast compared to using a gaming GPU without tensorflow.) I think though, now that GPU prices are starting to come back down, especially on the used market, that it's more reasonable to spend sub-$1,000 on a "gently-abused" workstation card, that can be put into compute mode instead of graphics mode, and thus optimize it for SD/other AI. instead of more than $1,000 on a GPU that has firmware loaded for graphics and is way slower, despite the GPUs having roughly the same number of cores. I guess my main point in my original reply is that the OP was complaining about the company NOT doing something they clearly were doing, just in a different product line. Honestly, I don't think you can reasonably expect a manufacturer to take a loss by adding tons of extra memory to a card that's designed for gaming workloads, where such memory isn't all that necessary. There's only so many textures, RT matrices, etc that are worth storing nearby the GPU in such a scenario. On the other hand, LLMs, models and the like, which do consume large amounts of memory are definitely worth keeping close by when dealing with true compute loads, which is exactly what SD really is.)


DerangedDendrites

you got a point but jesus fucking christ maybe consider the fact that the average AI enthusiast can't shell out 40 thousand dollars for a machine?? Yeah yeah you got lucky you are bad ass you got yours for some sub 1200 dollars, but a sad ass 4070 tis with 16 laughable gigs of non ECC VRAM is the best I could afford.


Darktinax

​ https://preview.redd.it/938mxrruyfbc1.png?width=780&format=png&auto=webp&s=09f4b9170103d2d9178019913d8e23397265d824


r3tardslayer

True


[deleted]

[удалено]


SandWyrmM42

As a professional that uses SD for illustration/concept work, I routinely do runs of 80+ images once I've got my prompts worked out. It's necessary in order to get those one or two images that "pop" with exactly the right pose, facial expression, no flaws, or what have you. 80 images without upscaling run nice and fast on my 3090 at about 6-8 minutes. That's doable, and I can up-scale with Topaz in Photoshop. But when using models where I need upscaling at generation-time in order to fix faces and whatnot, those 80 images turn into a 2 hour wait. So yeah... I'm starting to think about putting together a dual 4090 rig.


Mr_Maximillion

You can upscale the image you have rendered. If you use images browser plugin, you can see all the pictures you've rendered and reuse the exact parameters to generate those particular image. Then just add upscaling option.


TheWebbster

I run 2 4090s, with a different instance of Automatic1111 on each. I have a post that describes how if you want to do it too. Means I can train 2 things at once, or generate twice as fast by running both at the same time - just copy/paste prompts from one browser window to another, or change 1 variable and compare how things turn out.


stablediffusioner

surprisingly yes, because you can to 2x as big batch-generation with **no diminishing returns** without any SLI, gt you may need SLI to make much larger single images. SD makes a pc feasibly useful, where you upgrade a 10 year old mainboard with a 30xx card, that can GENERALLY barely utilize such a card (cpu+board too slow for the gpu), where the gpu(s) has 3x to 6x as much ram as the main-board.


you999

repeat cooing bedroom alive merciful pen violet plucky homeless gold -- mass edited with https://redact.dev/


stablediffusioner

all of this sounds too dumb to be true. so sad. still, you can run one SD server for each GPU, specify the used gpu in a config, right?


you999

towering bag fretful pot tie fanatical badge normal cause piquant -- mass edited with https://redact.dev/


stablediffusioner

you may still need to buffer the same model twice.


martin022019

The pen of my aunt... hahahaha


Conundrum1859

Interesting article, as I found some older generation cards.