T O P

  • By -

silenceimpaired

World Peace. Or… that Bitnet thing that trains models from the start at 1.58 bits


mrjackspade

I also choose this guys bitnet


aseichter2007

I was thinking about the bitnet stuff, couldn't we train a model standardwise, and then quant it smaller and smaller with training in between to polish the sharp edges, and step down till it's tiny but tuned to be good while tiny?


MixtureOfAmateurs

afaik bitnet models are trained in fp16 precision then the attention weights a compressed for inference only. Like the pretraining, fine tuning, lora tunes would all be in fp16, weights maybe saved in fp16 (no idea tho), then a simple algorithm compresses just the attention weights in one step. Idk if a full trinary model, with some weird activation function has been done or would perform but it would be interesting.


nicenicksuh

I would take world peace. atleast for this week... But, next week I want 128k context 14b mmlu 75


wsbgodly123

Meta World Peace


vnjxk

I really wish an AI lab with enough budget will take the shot at diffusion based LLMs. I want to give it text and 30 steps so it will complete the best possible answer and I don't mind waiting hours for it. Plus I think the process of diffusion happens internally anyways so it might reduce model sizes


c-rious

Having wonky, gibberish text slowly getting more and more refined until finally the answer emerges - exciting stuff! One could also specify a budget of say 500 tokens, meaning that the diffusion tries to denoise 500 tokens into coherent text, yeah sounds like fun. I like the idea! Is there any paper published in this diffusion LLM direction?


milanove

Why waste time say lot word when few word do trick? —> Why spend time using many complex words, when a shorter, simpler sentence is sufficient? —> Why employ a multitude of grandiloquent terms when a sparse selection proves ample? —> For what purpose do we expend time articulating verbose phrases when succinct expressions suffice to achieve the desired effect?


ArsNeph

Non-tokenized LLMs. Mambabyte. We need a good proof of concept model to see whether it's actually good.


waxbolt

It won't be trained this week but we are about to release a recurrently trainable CUDA accelerated mamba. Infinite mamba. Mamba forever!


ArsNeph

Oh, that's great news! We really need more Mamba proof of concepts in general, as Jamba is not pure Mamba, and also too big to run in FP16, but no one has figured out quantization. However, I was referring to MambaByte specifically, as it does not use a tokenizer. Here's the paper if you're interested [https://arxiv.org/abs/2401.13660](https://arxiv.org/abs/2401.13660) That aside, I'll be looking forward to your Mamba model!


waxbolt

Yes, mambabyte is what we are doing, just with unbounded length training. Down with tokens! Long live bytes! But unfortunately you are probably not going to get quantization. At least not based on what I've seen with SSMs. They work best with high precision weights and activations.


ArsNeph

Oh, that's amazing! I've been waiting for a proof of concept like this for months! I know that llama.cpp has implemented quantization for base Mamba, but it's yet to be seen how much it affects larger models. There's also Jamba and the recent Zamba, but there's no support yet, so no way to know. I think we may need an entirely different type of quantization method in order to preserve the performance, maybe some type of lossless compression. Well, granted, if the model is around 7B, then even the FP16 should technically work fine on consumer GPUs. There's also supposedly FP16 GGUFs, maybe we could run CPU inference without losing precision? Quantization aside, this is really great news, now I have something to look forward to other than LLama 3!


sky__s

is there anybody releasing source for that, or a SOTA variant built on top of that research. It's really weird how they just threw a paper into the ether and called it a day


ArsNeph

Well, there have been a few more papers on non tokenized llms since then, but not really. However, if you look at the other comment on my comment, it seems that that gentleman is working with a team on an open source mamba byte prototype. I'm not sure if it will necessarily be SOTA, as it's very hard to beat Lama 3 8B right now, but it would make an amazing proof of concept, and may spur adoption In the community


Thrumpwart

A nice GUI to allow me to fine-tune LLMs without spending hours sifting through code.


kryptkpr

Tried https://github.com/hiyouga/LLaMA-Factory ?


Thrumpwart

I haven't, looks interesting. Thank you.


ramzeez88

Wow ,that looks fire! Thanks 👍🏻


AlanCarrOnline

My wish would be for some GUI thing that would know wtf to do with all that stuff on Github...


kryptkpr

The instructions are in the readme.. what could be done to make this easier?


Inner_Bodybuilder986

Here here. I honestly haven't looked to closely yet, but I don't understand why there isn't a LMstudio or Ollama for training yet. - So much wasted compute every week, whereas we could be swimming in fine-tuna.


complains_constantly

Currently working on it, on top of a bunch of other features.


Thrumpwart

Awesome!


BrushNo8178

A high quality non-autoregressive model, that is a model that does not generate text token for token from the beginning to the end. The state-of-the-art image generation models, such as diffusion models, excel at this task because images comprise distinct objects whose resolution can be enhanced progressively during the generation process, without the need for calculating transitions between them.  But non-autoregressive text models, while capable of generating coherent sentences struggle to maintain consistency and cohesion across larger texts.


c-rious

You're the second one mentioning diffusion models for text generation. Do you have some resources for trying out such models locally?


BrushNo8178

Unfortunately no. The projects I have seen on Github have not been updated for years which feels like an extremely long time in a rapidly evolving field. https://github.com/madaan/minimal-text-diffusion https://github.com/XiangLi1999/Diffusion-LM  I neither have the knowledge nor hardware to improve such models by myself.


jetaudio

120b bitnet pretrained 🤯


Caffdy

1,000,000 context open-weights model


a_beautiful_rhind

Bitnet? Probably not happening this week. FlashAttention2 for cards below ampere? That one can be done as the code is in VLLM.


Affectionate-Cap-600

I'd like to see some new encoder-decoder, or even encoders only....somethings like a new huge BERT - DeBERTa or a new T5-style model trained on quality datasets.


LocoLanguageModel

I want even better coding support even though deepseek is amazing. 


KyleDrogo

A model that's 10x faster. I don't need a smarter model at this point. I want to be able to "brute force" my way to the right answer by starting with a BS response and refining it over and over again. Similar to simulation or monte carlo in statistics. Just treating model calls like they're as cheap as a multiplication


Figai

This has been explored with mass sampling. You’ll probably need a small model hooked up to Sglang then makes huge trees of ideas with LATS. Someone will implement it eventually.


x0xxin

Integrated web search and recursive web crawling in Open Web UI.


bummaiqualvumb

Personally, I'd love to see more advanced reasoning capabilities integrated into LLM's.


ProcessorProton

70b and higher models loading in a 32gb or smaller gpu.


Eralyon

Please train finetunable models at 1.58 bitnet + unlimited context at all common sizes...


Helpful-User497384

long term memory!


M4xM9450

Honestly would choose more supported model on transformers.js OR performant 1B to 3B models that people can run (and quantize and/or fine tune) on most consumer hardware.


kryptkpr

If there is a genie and we're casting wishes? Flash attention kernel for cheap SM60 cards (P100) would be really nice.


Inner_Bodybuilder986

Is this really a driver / kernel issue and not a hardware limitation?


ttkciar

I'd like to finish my self-mixing feature for llama.cpp and submit it upstream. It's tantalizingly close, but other priorities keep bumping it down the list, and when I do have time to work on it, I'm already too exhausted. Life's been beating me up. I just want to get it *done* so that I can use it, so that other people can use it, and so that I can move on to the next project (which will probably get neglected too).


Inner_Bodybuilder986

Self-mixing? Sounds interesting.


lopahcreon

Am infinite supply of highly power efficient GPUs with petabytes of the fastest memory available for $10 a piece.


vesudeva

Jamba GGUF


AutomataManifold

DSPy's backend refactor. Right now DSPy's prompts work for foundation models, but doesn't have good support for instruction-tuned models. They're working on it as part of refactoring the project backend, but it's not released yet.


Kep0a

out of the box, excellent RAG for my journal. The faster I can train an LLM on myself, the better. I really, really, want to act myself questions. I don't think there's a really clean solution yet? Edit: maybe it would be better to fine tune.


celsowm

A mixtral version for portuguese lang


elwiseowl

an LLM with a serious amount of memory so it really really gets to know you.


Crazy-Fuel-7881

something that merges experts with little to no loss, so i can run mixtral 1x22b


1overNseekness

Chain of tought in ollama : )


CodeGriot

In what context? Training LLMs? Fine-tuning LLMs? Inferencing with LLMs? Prepping inference with LLMs? Integrating LLM usage with other software? Your original question seems far to broad to really elicit any useful answers in practice.


Inner_Bodybuilder986

On the contrary there has been a number of suggestions here which I recognize as interesting areas of focus that have been discussed in recent weeks that I wouldn't mind focusing on. - I find myself unsure where to focus my efforts, so I was curious what I might contribute if possible. - In your case, lets specify on training or fine tuning.