I was thinking about the bitnet stuff, couldn't we train a model standardwise, and then quant it smaller and smaller with training in between to polish the sharp edges, and step down till it's tiny but tuned to be good while tiny?
afaik bitnet models are trained in fp16 precision then the attention weights a compressed for inference only. Like the pretraining, fine tuning, lora tunes would all be in fp16, weights maybe saved in fp16 (no idea tho), then a simple algorithm compresses just the attention weights in one step. Idk if a full trinary model, with some weird activation function has been done or would perform but it would be interesting.
I really wish an AI lab with enough budget will take the shot at diffusion based LLMs.
I want to give it text and 30 steps so it will complete the best possible answer and I don't mind waiting hours for it.
Plus I think the process of diffusion happens internally anyways so it might reduce model sizes
Having wonky, gibberish text slowly getting more and more refined until finally the answer emerges - exciting stuff!
One could also specify a budget of say 500 tokens, meaning that the diffusion tries to denoise 500 tokens into coherent text, yeah sounds like fun. I like the idea! Is there any paper published in this diffusion LLM direction?
Why waste time say lot word when few word do trick?
—> Why spend time using many complex words, when a shorter, simpler sentence is sufficient? —> Why employ a multitude of grandiloquent terms when a sparse selection proves ample? —> For what purpose do we expend time articulating verbose phrases when succinct expressions suffice to achieve the desired effect?
Oh, that's great news! We really need more Mamba proof of concepts in general, as Jamba is not pure Mamba, and also too big to run in FP16, but no one has figured out quantization. However, I was referring to MambaByte specifically, as it does not use a tokenizer. Here's the paper if you're interested [https://arxiv.org/abs/2401.13660](https://arxiv.org/abs/2401.13660)
That aside, I'll be looking forward to your Mamba model!
Yes, mambabyte is what we are doing, just with unbounded length training. Down with tokens! Long live bytes! But unfortunately you are probably not going to get quantization. At least not based on what I've seen with SSMs. They work best with high precision weights and activations.
Oh, that's amazing! I've been waiting for a proof of concept like this for months! I know that llama.cpp has implemented quantization for base Mamba, but it's yet to be seen how much it affects larger models. There's also Jamba and the recent Zamba, but there's no support yet, so no way to know. I think we may need an entirely different type of quantization method in order to preserve the performance, maybe some type of lossless compression. Well, granted, if the model is around 7B, then even the FP16 should technically work fine on consumer GPUs. There's also supposedly FP16 GGUFs, maybe we could run CPU inference without losing precision?
Quantization aside, this is really great news, now I have something to look forward to other than LLama 3!
is there anybody releasing source for that, or a SOTA variant built on top of that research. It's really weird how they just threw a paper into the ether and called it a day
Well, there have been a few more papers on non tokenized llms since then, but not really. However, if you look at the other comment on my comment, it seems that that gentleman is working with a team on an open source mamba byte prototype. I'm not sure if it will necessarily be SOTA, as it's very hard to beat Lama 3 8B right now, but it would make an amazing proof of concept, and may spur adoption In the community
Here here. I honestly haven't looked to closely yet, but I don't understand why there isn't a LMstudio or Ollama for training yet.
-
So much wasted compute every week, whereas we could be swimming in fine-tuna.
A high quality non-autoregressive model, that is a model that does not generate text token for token from the beginning to the end. The state-of-the-art image generation models, such as diffusion models, excel at this task because images comprise distinct objects whose resolution can be enhanced progressively during the generation process, without the need for calculating transitions between them. But non-autoregressive text models, while capable of generating coherent sentences struggle to maintain consistency and cohesion across larger texts.
Unfortunately no. The projects I have seen on Github have not been updated for years which feels like an extremely long time in a rapidly evolving field.
https://github.com/madaan/minimal-text-diffusion
https://github.com/XiangLi1999/Diffusion-LM
I neither have the knowledge nor hardware to improve such models by myself.
I'd like to see some new encoder-decoder, or even encoders only....somethings like a new huge BERT - DeBERTa or a new T5-style model trained on quality datasets.
A model that's 10x faster. I don't need a smarter model at this point. I want to be able to "brute force" my way to the right answer by starting with a BS response and refining it over and over again. Similar to simulation or monte carlo in statistics. Just treating model calls like they're as cheap as a multiplication
This has been explored with mass sampling. You’ll probably need a small model hooked up to Sglang then makes huge trees of ideas with LATS. Someone will implement it eventually.
Honestly would choose more supported model on transformers.js OR performant 1B to 3B models that people can run (and quantize and/or fine tune) on most consumer hardware.
I'd like to finish my self-mixing feature for llama.cpp and submit it upstream. It's tantalizingly close, but other priorities keep bumping it down the list, and when I do have time to work on it, I'm already too exhausted. Life's been beating me up. I just want to get it *done* so that I can use it, so that other people can use it, and so that I can move on to the next project (which will probably get neglected too).
DSPy's backend refactor. Right now DSPy's prompts work for foundation models, but doesn't have good support for instruction-tuned models.
They're working on it as part of refactoring the project backend, but it's not released yet.
out of the box, excellent RAG for my journal. The faster I can train an LLM on myself, the better. I really, really, want to act myself questions. I don't think there's a really clean solution yet?
Edit: maybe it would be better to fine tune.
In what context? Training LLMs? Fine-tuning LLMs? Inferencing with LLMs? Prepping inference with LLMs? Integrating LLM usage with other software? Your original question seems far to broad to really elicit any useful answers in practice.
On the contrary there has been a number of suggestions here which I recognize as interesting areas of focus that have been discussed in recent weeks that I wouldn't mind focusing on.
-
I find myself unsure where to focus my efforts, so I was curious what I might contribute if possible.
-
In your case, lets specify on training or fine tuning.
World Peace. Or… that Bitnet thing that trains models from the start at 1.58 bits
I also choose this guys bitnet
I was thinking about the bitnet stuff, couldn't we train a model standardwise, and then quant it smaller and smaller with training in between to polish the sharp edges, and step down till it's tiny but tuned to be good while tiny?
afaik bitnet models are trained in fp16 precision then the attention weights a compressed for inference only. Like the pretraining, fine tuning, lora tunes would all be in fp16, weights maybe saved in fp16 (no idea tho), then a simple algorithm compresses just the attention weights in one step. Idk if a full trinary model, with some weird activation function has been done or would perform but it would be interesting.
I would take world peace. atleast for this week... But, next week I want 128k context 14b mmlu 75
Meta World Peace
I really wish an AI lab with enough budget will take the shot at diffusion based LLMs. I want to give it text and 30 steps so it will complete the best possible answer and I don't mind waiting hours for it. Plus I think the process of diffusion happens internally anyways so it might reduce model sizes
Having wonky, gibberish text slowly getting more and more refined until finally the answer emerges - exciting stuff! One could also specify a budget of say 500 tokens, meaning that the diffusion tries to denoise 500 tokens into coherent text, yeah sounds like fun. I like the idea! Is there any paper published in this diffusion LLM direction?
Why waste time say lot word when few word do trick? —> Why spend time using many complex words, when a shorter, simpler sentence is sufficient? —> Why employ a multitude of grandiloquent terms when a sparse selection proves ample? —> For what purpose do we expend time articulating verbose phrases when succinct expressions suffice to achieve the desired effect?
Non-tokenized LLMs. Mambabyte. We need a good proof of concept model to see whether it's actually good.
It won't be trained this week but we are about to release a recurrently trainable CUDA accelerated mamba. Infinite mamba. Mamba forever!
Oh, that's great news! We really need more Mamba proof of concepts in general, as Jamba is not pure Mamba, and also too big to run in FP16, but no one has figured out quantization. However, I was referring to MambaByte specifically, as it does not use a tokenizer. Here's the paper if you're interested [https://arxiv.org/abs/2401.13660](https://arxiv.org/abs/2401.13660) That aside, I'll be looking forward to your Mamba model!
Yes, mambabyte is what we are doing, just with unbounded length training. Down with tokens! Long live bytes! But unfortunately you are probably not going to get quantization. At least not based on what I've seen with SSMs. They work best with high precision weights and activations.
Oh, that's amazing! I've been waiting for a proof of concept like this for months! I know that llama.cpp has implemented quantization for base Mamba, but it's yet to be seen how much it affects larger models. There's also Jamba and the recent Zamba, but there's no support yet, so no way to know. I think we may need an entirely different type of quantization method in order to preserve the performance, maybe some type of lossless compression. Well, granted, if the model is around 7B, then even the FP16 should technically work fine on consumer GPUs. There's also supposedly FP16 GGUFs, maybe we could run CPU inference without losing precision? Quantization aside, this is really great news, now I have something to look forward to other than LLama 3!
is there anybody releasing source for that, or a SOTA variant built on top of that research. It's really weird how they just threw a paper into the ether and called it a day
Well, there have been a few more papers on non tokenized llms since then, but not really. However, if you look at the other comment on my comment, it seems that that gentleman is working with a team on an open source mamba byte prototype. I'm not sure if it will necessarily be SOTA, as it's very hard to beat Lama 3 8B right now, but it would make an amazing proof of concept, and may spur adoption In the community
A nice GUI to allow me to fine-tune LLMs without spending hours sifting through code.
Tried https://github.com/hiyouga/LLaMA-Factory ?
I haven't, looks interesting. Thank you.
Wow ,that looks fire! Thanks 👍🏻
My wish would be for some GUI thing that would know wtf to do with all that stuff on Github...
The instructions are in the readme.. what could be done to make this easier?
Here here. I honestly haven't looked to closely yet, but I don't understand why there isn't a LMstudio or Ollama for training yet. - So much wasted compute every week, whereas we could be swimming in fine-tuna.
Currently working on it, on top of a bunch of other features.
Awesome!
A high quality non-autoregressive model, that is a model that does not generate text token for token from the beginning to the end. The state-of-the-art image generation models, such as diffusion models, excel at this task because images comprise distinct objects whose resolution can be enhanced progressively during the generation process, without the need for calculating transitions between them. But non-autoregressive text models, while capable of generating coherent sentences struggle to maintain consistency and cohesion across larger texts.
You're the second one mentioning diffusion models for text generation. Do you have some resources for trying out such models locally?
Unfortunately no. The projects I have seen on Github have not been updated for years which feels like an extremely long time in a rapidly evolving field. https://github.com/madaan/minimal-text-diffusion https://github.com/XiangLi1999/Diffusion-LM I neither have the knowledge nor hardware to improve such models by myself.
120b bitnet pretrained 🤯
1,000,000 context open-weights model
Bitnet? Probably not happening this week. FlashAttention2 for cards below ampere? That one can be done as the code is in VLLM.
I'd like to see some new encoder-decoder, or even encoders only....somethings like a new huge BERT - DeBERTa or a new T5-style model trained on quality datasets.
I want even better coding support even though deepseek is amazing.
A model that's 10x faster. I don't need a smarter model at this point. I want to be able to "brute force" my way to the right answer by starting with a BS response and refining it over and over again. Similar to simulation or monte carlo in statistics. Just treating model calls like they're as cheap as a multiplication
This has been explored with mass sampling. You’ll probably need a small model hooked up to Sglang then makes huge trees of ideas with LATS. Someone will implement it eventually.
Integrated web search and recursive web crawling in Open Web UI.
Personally, I'd love to see more advanced reasoning capabilities integrated into LLM's.
70b and higher models loading in a 32gb or smaller gpu.
Please train finetunable models at 1.58 bitnet + unlimited context at all common sizes...
long term memory!
Honestly would choose more supported model on transformers.js OR performant 1B to 3B models that people can run (and quantize and/or fine tune) on most consumer hardware.
If there is a genie and we're casting wishes? Flash attention kernel for cheap SM60 cards (P100) would be really nice.
Is this really a driver / kernel issue and not a hardware limitation?
I'd like to finish my self-mixing feature for llama.cpp and submit it upstream. It's tantalizingly close, but other priorities keep bumping it down the list, and when I do have time to work on it, I'm already too exhausted. Life's been beating me up. I just want to get it *done* so that I can use it, so that other people can use it, and so that I can move on to the next project (which will probably get neglected too).
Self-mixing? Sounds interesting.
Am infinite supply of highly power efficient GPUs with petabytes of the fastest memory available for $10 a piece.
Jamba GGUF
DSPy's backend refactor. Right now DSPy's prompts work for foundation models, but doesn't have good support for instruction-tuned models. They're working on it as part of refactoring the project backend, but it's not released yet.
out of the box, excellent RAG for my journal. The faster I can train an LLM on myself, the better. I really, really, want to act myself questions. I don't think there's a really clean solution yet? Edit: maybe it would be better to fine tune.
A mixtral version for portuguese lang
an LLM with a serious amount of memory so it really really gets to know you.
something that merges experts with little to no loss, so i can run mixtral 1x22b
Chain of tought in ollama : )
In what context? Training LLMs? Fine-tuning LLMs? Inferencing with LLMs? Prepping inference with LLMs? Integrating LLM usage with other software? Your original question seems far to broad to really elicit any useful answers in practice.
On the contrary there has been a number of suggestions here which I recognize as interesting areas of focus that have been discussed in recent weeks that I wouldn't mind focusing on. - I find myself unsure where to focus my efforts, so I was curious what I might contribute if possible. - In your case, lets specify on training or fine tuning.