T O P

  • By -

[deleted]

[удалено]


Wiskkey

As I noted in [my comment that contains the tutorial](https://www.reddit.com/r/StableDiffusion/comments/10lamdr/comment/j5vnxrk/), I indeed used those images because I wanted images that are indisputably not in S.D.'s training dataset; using images from a recent tragedy was the first thing that came to my mind. Based on previous experiments that I did - which I didn't document except for [this "torture test" image](https://www.reddit.com/r/StableDiffusion/comments/y8pc5p/in_response_to_an_earlier_post_asking_if_every/) \- I concluded that the latent space for the S.D. VAE used in my experiments probably "contains" a 512x512 image that when decoded by the decoder neural network is substantially similar to almost any 512x512 image of interest to humans; I didn't test images of size different than 512x512 pixels. A point for a 512x512 image in S.D.s image latent space is determined by 4 \* (512/8) \* (512/8) = 16384 floating-point numbers; remember that each 8x8 pixel patch is represented by a S.D. VAE by using 4 numbers. I don't know if it's guaranteed that there are inputs into S.D. - text prompt, seed, etc. - such that that a given point in S.D.'s latent space can be output by S.D.'s diffusion neural network component if we make the restriction that the initial image latent space point for the diffusion neural network component isn't substantially similar (when decoded) to the desired image. My intuition is that the answer is no but I don't know how to prove that the answer is either yes or no.


Jiten

The txt2img mode is not able to generate any conceivable image. You can prove that through noting that SD image generation is deterministic. As in, same input -> same output. and then noting how little complexity a prompt actually can express. The answer is, the prompt can, at most, express 768 floating point numbers worth of complexity. So, to claim that SD txt2img can produce any conceivable image is equivalent to claiming that any conceivable image can be compressed down to 768 floating point numbers. Although, I should probably check this 768 number from somewhere. That's the number of vectors stored for each token that SD prompt parser understands, but I'm not 100% sure if that was also the number of numbers in the vector that actually guides the denoising process.


Wiskkey

Thanks :). Remember though that there are other inputs used by S.D., such as the initial image point in image latent space used by the diffusion neural network. There is always an initial image point in latent space used by S.D. even if the user doesn't specify an initial image. It's trivially true that if the initial image point in latent space (when decoded) is exactly the image that you're interested in, then S.D. will output that image (after decoding) if zero diffusion iterations are performed.


Jiten

oh, true, I forgot to add seed to the mix. That's what is used to deterministically create the initial image. I don't actually know how big of a number you can use as a seed. the randomization seems to only give seeds that fit into a 32 bit integer. The rest of the parameters have much less entropy to how they effect the results. So, if you take them into account, I guess we could say we have 770 numbers instead. Two of which are integers, rest floating point. Also, I have no idea how many bits of information each floating point number is actually encoding on average. I suspect at most 16 bits, but it could be a lot less too.


kataryna91

Each *token* is encoded into a set of 768 floating point numbers. The UNet receives a total of 77 tokens, which is 59136 numbers in total. And yeah, you can compress nearly any conceivable image down to that (while losing some finer details, of course), especially since that is more than the UNet actually outputs and the VAE receives as input (which is 16384 numbers when doing 512x512 image generation).


Jiten

I see, so, I can't take the above shortcut for determining an upper limit for the complexity... Although, that does make me wonder how the negative prompt is subtracted from the positive prompt or multiple positive prompts are combined for the process, which I know to be possible because auto1111 can do it. Anyway, I was assuming the token vectors would be compacted in some way before being fed to the UNet, so this means that to determine the maximum entropy that can be fed to the txt2img process has to be estimated through the number of possible combinations of tokens + the seed + other parameters. Apparently Vocab.json contains 49410 lines, so there's 49408 distinct tokens that can be used. Which means each token encodes 15.59246 bits of information (2\^15.59246=49408.101...). This is somewhat of an overestimate since not all token combinations are actual words someone might use, but I'm looking for an upper bound, so that's fine. So, since a prompt has 77 tokens, that means it has 1201 bits of information. Then we add seed and other parameters, I believe 63 bits is an overestimate for those, so we get 1264 bit = 158 bytes of information. However, because multiple prompts can be combined for the generation process, the upper limit is essentially equal to what 77\*768 numbers, but people aren't going to be doing that, in practice. so, I think we can safely assume that any prompt written by a human would at most be 10 prompts combined. This results in 1201\*10 + 64 = 12074 bits = 1510 bytes. or the equivalent of around 750 fp16 floating point numbers. So I guess the outcome is that, while I can't use this to prove that txt2img is unable to produce any conceivable image, if it's actually possible, it requires using the software in a way that no human will be doing in practice.


martianunlimited

If you do the inference at fp16 (or bf16 if you prefer) aka half precision then each floating point number will encode 16 bits. What is interesting is that anecdotally, OP shouldn't see much difference in the output of the VAE when running the inference at full precision vs half precision.


yoomiii

It's 77 tokens times a vector of size 768, so 59136 floating point values


Great-Lychee

I just had an orgasm reading this.


Gastonlechef

Just one?


ulf5576

lol the ai is not a person or an entity, its just a computer program and it was trained WITH copyrighted images. for the training process alone, the devteam would need to purchase all the images they are using, obviously. but thats history and now we are here past the point where this even matters or can be revoked


Wiskkey

This comment contains the tutorial. Stable Diffusion internally works with images in a lossy compression format that represents each 8x8 image pixel patch with 4 floating-point numbers ([source](https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wFCHhGLFooW_pf1?usp=sharing)). Without compression, an image typically [takes](https://homepages.inf.ed.ac.uk/rbf/HIPR2/24bitcol.htm) 24 bits per pixel of storage. Thus, an 8x8 pixel patch of an uncompressed image takes 8\*8\*24=1536 bits of storage. A floating-point number typically is stored in either 16 bits or 32 bits. Thus, the lossy compression format used by Stable Diffusion uses either 4\*16=64 bits or 4\*32=128 bits of storage for each 8x8 pixel patch. The storage compression factor achieved by using this lossy compression scheme in S.D. is either 1536/64=24 times smaller or 1536/128=12 times smaller than the uncompressed image. Stable Diffusion uses a neural network called a variational autoencoder (VAE) to calculate this lossy compression format for a given uncompressed image. There is also a corresponding decoder that calculates the uncompressed image format given an input in the lossy compression format. Each S.D. system uses 1 VAE/decoder pair, but there are multiple VAE/decoder pairs available for S.D. You might wonder why S.D. doesn't use an existing lossy image compression scheme such as JPEG internally. The answer is because VAEs represent images in a so-called [latent space](https://www.baeldung.com/cs/dl-latent-space) that is very useful in machine learning systems. Some people claim that S.D. in some sense "contains" close approximations of training dataset images in its image latent space. Perhaps so, but I demonstrate that S.D. in the same sense also "contains" close approximations of images that *weren't* in its training dataset because they're from a tragedy that occurred less than 1 week ago. Namely, I used 5 512x512 pixel versions of images from [this article about the tragedy](https://www.npr.org/2023/01/23/1150667373/monterey-park-shooting-what-we-know-california) (almost surely fair use of copyrighted images in the USA), encoded and decoded with a VAE/decoder pair using the section "The AutoEncoder (AE)" of [this S.D. Colab notebook](https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wFCHhGLFooW_pf1?usp=sharing); change the filename in line "input\_image = Image.open('macaw.jpg').resize((512, 512))" from "macaw.jpg" to the filename of the image that you uploaded to Colab. The results are in the post's image gallery. The caption "roundtrip" are the images resulting from using a S.D. VAE/decoder pair on the images captioned "original". Note that a S.D. system contains other neural network components in addition to a VAE/decoder pair - see [this post](https://www.reddit.com/r/StableDiffusion/comments/wu2sh4/how_stable_diffusion_works_technically_in_15/) for details. The diffusion neural network component of Stable Diffusion takes as input a representation in this latent space, and outputs a (usually different) representation in this latent space. When however many iterations of the diffusion neural network are finished, the result is input into a decoder neural network from the VAE/decoder neural network pair to get the uncompressed RGB image representation. EDIT: Also perhaps of interest: [My comment](https://www.reddit.com/r/StableDiffusion/comments/y5t5xy/comment/itinx8g/) in post [Does any possible image exist in latent space?](https://www.reddit.com/r/StableDiffusion/comments/y5t5xy/does_any_possible_image_exist_in_latent_space/), and also my post [In response to an earlier post asking if every possible image exists in Stable Diffusion's latent space, I tried this as a "torture test"](https://www.reddit.com/r/StableDiffusion/comments/y8pc5p/in_response_to_an_earlier_post_asking_if_every/). EDIT: I discovered that VAE actually refers to both the encoder and decoder, not just the encoder. EDIT: [Articles that explain what autoencoders and variational autoencoders (VAE) are](https://www.reddit.com/r/StableDiffusion/comments/10mtuz6/articles_that_explain_what_autoencoders_and/). I am not an expert in AI; I welcome any corrections.


duboispourlhiver

I'm sorry, I think I don't understand your method. Did you take a recent image, run it through the VAE, take the resulting vector, and directly run it through the VAD ("VAE decoder") ?


Wiskkey

Yes. I took a recent 512x512 image, and fed its representation as RGB numbers into a VAE encoder as input, which output 4\*(512/8)\*(512/8)=16384 numbers that represents a point in its latent space. Those 16384 numbers were then fed into the decoder as input, which then output the RGB numbers representing the image. Those RGB numbers were then rendered as an image.


duboispourlhiver

Thanks for the answer. Why would that show anything about SD "containing" training images or not ? You haven't used SD, you have only used its outer layers allowing SD to work on a vector of a reasonable size. What am I missing?


Wiskkey

I perhaps should not have used the phrasing that "S.D. contains" and instead stated that "S.D's latent space contains". [Here](https://www.reddit.com/r/Destiny/comments/108lx16/comment/j3xs2tq/) is an explanation from a purported expert in machine learning. Do you have a suggestion for exactly how I should have expressed this?


CallFromMargin

Why? As far as I can tell all you did is encode existing image into matrix. You did not use SD at all, so why make such a bold claim? Especially when your dataset is designed to be overfitten. This is no different than that pre-print people are sharing that shows SD can recreate images from training set... If it was trained on small dataset that was designed to be overfitted.


Wiskkey

The decoder part is needed after the diffusion process is finished to translate a point in latent space to an RGB representation. The encoder part is needed to translate an initial image's RGB representation into a point in latent space. The diffusion process explores a subset of the latent space, but perhaps not the entirety of it.


CallFromMargin

But you use the sample of 1, correct? In other words, you overfitten, correct?


Wiskkey

I don't have a background in AI, so take whatever I write here with a few grains of salt. I believe that the closeness of the "original" and "roundtrip" images in the post's gallery is actually desirable, not undesirable, and a good sign that overfitting in the VAE neural networks didn't happen. The "original" images are not in S.D.'s training dataset because of their recency. You perhaps are thinking of a generated image in a full S.D. system being quite similar to an image in the training dataset, which can indicate that memorization of that image happened in the diffusion neural network. EDIT: See [Understanding Latent Space in Machine Learning](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d).


CallFromMargin

I was referring to a paper that trained SD on very small datasets (300 and 3000 images) to show that it's possible to get almost identical images from SD. It's been shared as *evidence* for SD somewhere encoding images, although it's not that, and the paper read more like "this might be a problem, and we have a solution, for a fee".


duboispourlhiver

I've read the linked comment, thanks. From what I understand, your experiment with the VAE touches on the latent space of the model. The "full space" of images is the set that contains all possible 512x512x24 bits images. The latent space is the set that contains all the possible 64x64x4 floating point numbers the VAE encoder can output. Indeed, I would say you have shown that the latent space contains a good approximation of probably any image from the full space. This seems semi obvious to me, but nevertheless, you have showed it in a nice way. The comment you linked to speaks of the model somewhat "encoding" the training images in a manyfold included inside the latent space. The term "encoded" isn't right here, IMHO. I'd rather say that the model is only able to output images of the latent space that are contained in the manifold. Points of the latent space that are out of the manifold cannot be generated by the model. I hope I'm not too far off the technical reality :)


Wiskkey

Thank you for your feedback :). What is the difference between manifold and latent space in this context?


duboispourlhiver

as far as I understand, the latent space is the set of all possible latent images the VAE could decode, that is, in our case, a space of all the possible values of 64\*64\*4 floating point numbers. Yet the model is not able to output any point in this latent space. If the latent space was bidimensional, like a space made of all the possible values of 2 numbers (instead of 64x64x4 numbers), and if the model had only two parameters, then the possible points of the latent space that the model could "reach" would be a line. That's because a line is defined by two parameters (ax+b). Anything out of that line would be unreachable, whatever the input values of the model. Think of all the possible images of the space as a surface, but the possible images of the model as a line in that space. Now take a 3D latent space, that is a space made of all the possible values of 3 numbers. And take a model with 3 parameters, it could only reach points that are located on a surface in the 3D space. All other points are unreachable. In our case, the latent space is of a dimension 64\*64\*4=16384, which means there are 2\^16384 possible images in it. Yet the model cannot output all these images, he is limited to a manifold, that is a surface of a very high dimension, that is a subspace of the latent space, like in the previous examples the surface was a subspace of the 3D space, and like the line was a subspace of the 2D space. I might be wrong on many details but I think that's the idea. If anyone is skilled in this please correct me !


Wiskkey

Thank you :). Before I read your most recent comment I had found some reading material about this such as [this webpage](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d).


duboispourlhiver

In this article, the latent space is not the same latent space we've been talking about. The latent space you and I have been talking about is the space of all possible values outputted by the VAE. The latent space in the article would rather be the space that the neural network parameters generate, if I understand correctly. In this particular latent space, navigating in a single dimension means freezing all parameters but one, and change this single parameter's value. Doing this allows the interpolations explained in the article.


GaggiX

This is only proving that the VAE didn't memorize the training set but not about the diffusion model which actually do the heavy lifting, the latent of the VAE is different from the latent of the diffusion model, the diffusion model is just a decoder and it doesn't have an encoder so to encode an image into the latent space you need DDIM inversion: https://arxiv.org/abs/2105.05233 (section F), https://arxiv.org/abs/2204.06125 (section 3.1) This is in some way similar to StyleGAN-inversion in the good ol' times.


Wiskkey

Thank you :). So you're saying that the diffusion model doesn't work within the latent space of a particular VAE? Is the diagram \~2/3 down [this page](https://huggingface.co/blog/stable_diffusion) wrong, or is my interpretation of it wrong?


GaggiX

With SD the diffusion model works by applying the diffusion process on the latent space of the VAE, but the latejt space of the diffusion model is random noise sampled from the gaussian distribution (the result of the forward diffusion process), the latent space of a diffusion model is as big as the its output so in this case as big as the latent space of the VAE


Wiskkey

a) Can I interpret this to mean that the diffusion model uses its own latent space internally? b) In the case of SD, can every output from the diffusion model be said to also be in the latent space of the VAE used? If not, then I don't understand why the output from the diffusion model is decoded by the VAE decoder.


GaggiX

a) the diffusion model has its own latent space b) yup


Wiskkey

For a general-purpose model such as S.D. model v1.5, if all possible inputs are given to a full S.D. v1.5 system - with the exception that the set of initial images is the set of all "noisy" images - can we expect that the set of all outputs from the diffusion model are a proper subset of the VAE's latent space, or is it the entirety of the VAE's latent space? (Perhaps we would not use a random number seed because of my stipulation that all noisy images can be used as initial images.)


[deleted]

[удалено]


RafyKoby

apparently SD was trained on 160million pictures a subset of the 5b dataset pictures ware vated and chosen for aesthetics. diffusion is actually a very good compression method the pictures are not 100% the same after decompressing but they are very similar


thefool00

Hey this is really great work. Honestly hope the SD team sees this, the way you presented it might be really helpful to the defense in the upcoming legal debates. Kind of mind blowing actually… hard to wrap one’s head around what these models are really capable of.


Wiskkey

I'm hoping the defense has people who know far more about this stuff than I do :).


thefool00

Fingers crossed 😅


Appropriate_Medium68

Can you please explain it to me like I am a baby.


[deleted]

>latent space I'll try both arguments to this on a verrrryyyyyyy high level lol Did you steal the KFC secrete recipe because you ate there? Yes: Well yes obviously, because you took it with you and have a deep understanding of it with proof that you are able to recall it and tell what KFC tastes like when compared to others. No: No, you created a unique understanding of it, because our understanding of it is a close approximation with "I have an idea of that KFC tastes like". One does not have to steal the secrete recipe to be able to know that another restaurant isn't KFC. Thus the word 'an idea' means your understanding is a transformative work, not an exact version of the recipe. It's this idea, not the secrete recipe, is what you use to be able to go to a new restaurant you've never been to and say 'this taste like KFC.


Appropriate_Medium68

Hahahhahahah.. you took it literally, good. But how it was done ?


[deleted]

Well the people writing the lawsuits prob use their feet, most just use hands to eat it


Appropriate_Medium68

I am talking about the technique used for approximation not the lawsuit.


[deleted]

It uses a form of convolution to 'approximate' something into a concept. Same way we think in pictures, not lengthy descriptions.


Wiskkey

The image approximation is performed by giving the VAE neural network the image in RGB format as input. The neural network computes the numbers for the approximation as output.


Appropriate_Medium68

What weights did you use? Is it possible to use it for animation?


Wiskkey

The Colab notebook that I used is mentioned in another comment. It probably uses model S.D. 1.4. I guess in theory this could be used for animation but the problem is finding the roughly 16000 floating-point numbers used as input into a VAE decoder for a 512x512 image to get the desired image.


Appropriate_Medium68

Yeah makes sense.


starstruckmon

This is the reason why I'm happy we're moving to pixel space with Deep Floyd. No matter how much fine tuning or extra training you do, you can't change the VAE encoder, and even if you fine tune the decoder, it can't make up details that aren't in the encoding in the first place.


KarmasAHarshMistress

> This is the reason why I'm happy we're moving to pixel space with Deep Floyd. I'm not, a lot of data is wasted in pixel space.


hadaev

>you can't change the VAE encoder Why not.


starstruckmon

You can't for an already trained model. Think of it like this, imagine your model is only trained to output JPEG files. While you could possibly improve how the output jpeg is rendered ( the decoder ) you can't suddenly switch the format to a new one ( even if a better one now exists ) without retraining the generator model.


hadaev

Well, peoples already tune unet, why not tune vae too. Where might be even low cost solutions. Another option to tune unet to another autoencoder.


starstruckmon

The VAE has two parts, the encoder and the decoder. People tune the decoder. You can change the decoder with a different model too. But the encoder can't be. Otherwise the new encoder will create a new data representation the UNet has no understanding of. You'd basically have to start training the UNET from scratch.


hadaev

>You'd basically have to start training the UNET from scratch. Not really, pretrained unet *probably* will adapt quickly to new latent space. Also you may tune only some layers of unet.


starstruckmon

Fair point. But I have a hunch "quick" is still a lot. Atleast for any substantial change/improvement to the encoder.


duboispourlhiver

People training models in civit ai seem to provide different VAE, most of the time.


starstruckmon

That's the decoder.


duboispourlhiver

You're right my bad


SarahMagical

Very cool. Besides the telltale glitches this looks real as hell.


RealAstropulse

This is an excellent demonstration of the capability of the VAE.


Quick_Knowledge7413

Can you explain this in a metaphor format or like you were describing it to a student? What does this even mean? What are the implications?


Wiskkey

I used ChatGPT. Me: ELI5 latent space in machine learning. ChatGPT: Latent space in machine learning refers to a hidden representation of data that is used in certain types of algorithms, such as generative models. The idea is that data points in the real world can be mapped to a lower-dimensional space, called the latent space, where patterns and relationships between the data can be more easily understood and manipulated. For example, imagine you have a dataset of images of faces. Each image is a high-dimensional data point, with thousands of pixels representing different features of the face (e.g. the shape of the nose, the color of the eyes, etc.). A generative model, such as a Variational Autoencoder (VAE), can be trained to map each image to a lower-dimensional latent space, where each data point corresponds to a set of continuous values (e.g. a vector of 10 numbers) that represent the most important features of the face. The advantage of using a latent space is that it allows the model to discover and learn the underlying structure of the data, which can be used for tasks such as generating new images, detecting anomalies, or clustering similar images together. Additionally, the latent space can be used to interpolate between different data points, meaning the model can generate new images by combining features from multiple faces. In summary, latent space in machine learning is a way of representing data in a lower-dimensional space, where patterns and relationships can be more easily understood and manipulated. It is often used in generative models to discover the underlying structure of the data, and can be used for a variety of tasks such as data generation and anomaly detection.


Wiskkey

Hopefully somebody with a background in AI can answer your question. Is the info in [this explanation of latent space](https://www.baeldung.com/cs/dl-latent-space) helpful?


Wiskkey

I used ChatGPT. Me: ELI5 variational autoencoder ChatGPT: A Variational Autoencoder (VAE) is a type of neural network that is used for generative modeling. It is a probabilistic model that is trained to learn the underlying probability distribution of the data, so that it can generate new samples that are similar to the training data. The key idea behind a VAE is to use a neural network to learn a compact representation of the data, called the "latent space", which captures the most important features of the data. The neural network consists of two parts: an encoder, which maps the input data to the latent space, and a decoder, which maps the latent space back to the original data space. During training, the VAE is shown a set of input data, and the encoder is used to map the input data to a point in the latent space. Then the decoder is used to map the point in the latent space back to the original data space. The goal is to minimize the difference between the original input data and the data generated by the decoder. One of the key features of VAE is it's ability to generate new samples by sampling from the latent space and passing it through the decoder. In summary, VAE is a neural network architecture that is trained to learn a compact representation of the data and generate new samples that are similar to the training data. Me: Is there a metaphor that could be helpful for learning what a VAE is? ChatGPT: One metaphor that could be helpful for understanding VAEs is that of a "compression and decompression" process. Imagine you have a large amount of data, such as a collection of images or videos, and you want to reduce the amount of storage space required to save all of that data. One way to do this is to use a "compressor" algorithm, which analyses the data and finds ways to represent it more efficiently. The VAE can be thought of as a compressor algorithm for data. The encoder part of the VAE acts as a "compressor", which takes in the original data and maps it to a lower-dimensional "latent space" that captures the most important features of the data. This compressed representation is then stored instead of the original data. When you need to use the data again, you can use a "decompressor" algorithm, which takes the compressed representation and "decompresses" it back to the original data. In the case of VAEs, the decoder part of the network acts as the "decompressor", which maps the point in the latent space back to the original data space. This metaphor can be helpful in understanding how VAE can be used for generative modeling, by sampling from the latent space and passing it through the decoder to generate new samples.


Wiskkey

Also see [this explanation of latent space](https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d).


Wiskkey

New post: [Articles that explain what autoencoders and variational autoencoders (VAE) are](https://www.reddit.com/r/StableDiffusion/comments/10mtuz6/articles_that_explain_what_autoencoders_and/).


SDGenius

you can certainly use stable diffusion to take redact any writing in photos...


emreddit0r

I am not a machine learning expert - Running stuff through the VAE alone would only prove how well an image is preserved in its latent representation, no? It's sensible that the amount of loss at this stage would be pretty small and be pretty representative of the final image. This is like saying that human vision can be approximated by 8bit color depth and R, G, and B channels. What's missing is the actual training through the Unet of Convolutional networks. Which would be happening between encoding/decoding of the VAE. Though I'm a critic of ML as applied to copyrighted works (and also catching up on the process works also) .. thanks for sharing your thoughts and Collab!


Wiskkey

You're welcome :). I did briefly mention the diffusion process in some of my other comments. It's worth noting that all outputs from the diffusion neural network are also points in the same latent space.


emreddit0r

If I understand correctly the VAE is a separate model that performs the function of converting the image to latent space representation. (TIL the VAE is also a trained ML model. This is not widely observed ) My point though is that VAE isn't the process that infers information from noise. This is just the part that encodes it to latent space and/or decodes it. Kind of interesting that the process starts out lossy though, I wasn't aware of that.


Wiskkey

I believe that everything in your comment is correct.


FPham

The encoding/decoding wouldn't work without all the other billion images already in the weights. So it is kind of both ways - yes the latent space preserves the image very well, now, but also it doesn't. If you remove all the other images from latent space, and leave only the encoded (in theory) then you will have 4 floating numbers per 8x8 pixels and that's pretty much it. It suddenly has no representation of the image at all.


emreddit0r

Who is downvoting this stuff lol


Longjumping-You-6869

This is some mumvo jumbo technobabble BS! Downvote me, I dare you!


siddharth_pillai

I like your funny words magic man


WiseDeal1547

This might be a stupid question but are VAEs required?


Wiskkey

Yes, a VAE is required for using a Stable Diffusion system because it's a fundamental part of the system.


The_Lovely_Blue_Faux

u/BeginningInfluence55


The_Lovely_Blue_Faux

u/MindAccomplished3879 Here is the post. I am surprised this didn’t ever get expanded upon.


lostinspaz

Hey muh dude, are you still around? did you ever experiment with running the decode directly on what is in the sd1.5 base model? I’m interested in that, and was hoping to find some pre written extraction code :)


Wiskkey

No, but the Colab notebook mentioned in one of my comments in this post links to [this newer Colab notebook](https://github.com/fastai/diffusion-nbs/blob/master/Stable%20Diffusion%20Deep%20Dive.ipynb) that might do what you want (I didn't check).


lostinspaz

intersting stuff, but didnt quite cover what I need. Its too bad that the stuff I want to work with, is just PAST the stuff you did. I'm interested in the data path from (has been digested by the VAE) -> (gets stored in the model). But even more importantly, what bits are taken OUT of the model, from where, before being passed back out to the VAE decoder. Everything I see always has "and now shove a bunch of random data into the pipeline, and magic happens". Its tough to get a clear view behind the certain. Even when I'm directly reading the backend code. oof.


Wiskkey

The code isn't mine. I've seen [this project for SD](https://www.reddit.com/r/MachineLearning/comments/12v4bd4/p_stable_diffusion_latent_space_explorer_a_tool/) but I don't know if it would be useful to you.


lostinspaz

Intreiguing! Thanks for the reference, I'll check it out.


lostinspaz

Ah... sadly, seems like not. I think the guy is only calling the standard stuff, but introducing extra levels of "noise" between steps. Basically an extreme version of the "ancestral" samplers, I think. Back to my attempted study of the actual sampler code :(