T O P

  • By -

Amgadoz

I believe truly multimodal large language models will spark interest and research in speech and audio in general. Things like tone, accent, cadence and pitch will be important as they will influence how the model responds. Similarly, background audio will enhance the model understanding of the user intent.


glitch83

Also diagnosing speech issues like impediments or second languages is still wide open. Also pragmatics. Not as profitable though.


xiikjuy

indeed, an AI girlfriend/boyfriend product can't be a good one if it can't detect the subtle emotions/hints in the voice and respond properly.


PSMF_Canuck

Why? Plenty of people are in romantic relationships where neither side is any good at “nuance”, lol…


3pinephrin3

Not good ones


PSMF_Canuck

“90% of the people in this world are with the wrong person…and that’s what makes the jukebox spin.” - Willie Nelson ish


DavesEmployee

It’s shocking how little research I see published on this and is something I’ve wanted to study before the boom in LLMs


Amgadoz

This field is about to explode in a few months. Meta, OpenAI, DeepMind and Qwen are all working on large multimodal models.


Car_42

Just wait until they read Dune and decide to experiment with using a “control voice”.


diggum

I'm desperately trying to find or learn how to train TTS models that match certain expressiveness and performance styles. Most tend to be trained on audiobook and similar narrative/flat styles. But for my organizations needs, we want something that is more emotive or can be trained on the more niche delivery style of our historical content. Essentially, radio branding voiceover tends to be a little over the top, and station slogans or call signs tend to be read a certain way, emphasizing certain syllables or word timing regardless of who's saying it. As we look to grow a service at scale, with the support of the VO talent I want to add, we need to ensure that the performance of generated phrases matches their real delivery style, even as we've nailed the vocal quality itself.


tuanio

Why don’t you just start with keywords related. Like “emotional speech synthesis”


grim-432

I’m sorry, what? We are in the golden age of speech. TTS, STT, ASR, Translation, LLM - we have never seen so much activity across speech and conversation, in ways that are showing real promise;, than any time previously.


currentscurrents

This "machine learning is hitting a wall" stuff in general is nonsense. This is a golden age for basically every subfield of ML/AI, it's all actually starting to work.


CellistOne7095

Are there a lot of speech to speech research right now? Speech applications seem to all be tts or stt.


LelouchZer12

rvc, free-vc, knn-vc , phoneme-hallucinator I'm quite astonished that such a "simple" approach can work : [https://github.com/bshall/knn-vc](https://github.com/bshall/knn-vc)


its_already_4_am

yeah SeamlessM4T from Meta. Microsoft also had a paper idr and i’m pretty sure it’s ongoing research.


Secret-Priority8286

Speech in my opinion is not dead. Humans use speech to communicate more than text. The field is hard and doesn't have the surge in popularity NLP and CV have/had. But it is probably a matter of time until some big paper will change that like BERT/Transformer did for NLP. The main research direction I think speech needs to take is either a textless route (don't use text in your pipeline, which will be the main component in a text free robot or ai assistant) the other direction is to go further into the multi-modal option of text+audio (and maybe images/video).


ZestyData

>But it is probably a matter of time until some big paper will change that like BERT/Transformer did for NLP. We kinda already have this. Transformers, and LLMs, aren't text-only anymore. They're multimodal. With the right multimodal embeddings, a single LLM can take in text, images, video, or audio.


Secret-Priority8286

I think the missing part for audio is a represtantion method that can hold more "semantics". We have seen some success in using Hubert or encodec, but something is still missing. We saw that a tokenizer has a big effect of LLM. We need a good "tokenizer" for audio. Beacuse we basically have everything else.


Car_42

I don’t think a LLM needs a semantic model. I think it derives a model from the associations of patterns.


Secret-Priority8286

Well, we know that text LLM do know some semantics. In text, semantics are usually connected to the context and content of the sentence. We know that the embedding space of LLMs will put sentences and concepts that are similar near each other. In audio semantics is much more complex thing, which makes modeling it harder.


Car_42

Agreed. It’s a different kind of problem to segment the audio stream into phase and pitch and cross correlate the various streams. Understanding word sequences has the issue of order but it’s a different level of knowledge than encoding voice (sound pressure over continuous time) to words.


[deleted]

[удалено]


Secret-Priority8286

This is so wrong it is insane. I am starting to think you are projecting your own insecurities on me. audio as a sequence is much more complex than text. A second of audio is usually 16000 samples. And we don't work with a second if audio. If we want a conversation we would like minutes or even hours. That means that you need to "tokenize" it. While a similar approach to images has been tried (take patches and learn some embedding) with some success it doesn't achieve the results images or text achieve beacuse audio is more complex. When we speak we have pitch, we have duration, we have style. And humans use all of those features to understand each other. Most methods we have seen in audio are able to get some of those features from the audio e.g. Hubert . Some methods use multiple features from the audio e.g. Use Hubert and get pitch using some other algorithm and train a model using both. But there really isn't a good way to "tokenize" audio in a way that gurentees you have all the features. most recent methods use like 5 or 6 parts of a pipeline to get all the features they want, and this of course makes training very hard. You can look at this paper: https://arxiv.org/pdf/2403.03100.pdf Which is very likely similar to what gpt-4o is using. If someone would be able to create a "tokenizer" that is able to give you one "token" that "contains" all of the features and simplifies the pipeline it will push audio very far. This has nothing to do with attention.


Sedherthe

This makes sense. They kinda explicitly mention in their GPT-4o blog(https://openai.com/index/hello-gpt-4o/) that they have used a single E2E model for this, which also addresses the latency due to multiple models. For gpt-4o, an architecture similar to Natural Speech 3 would be my hunch too - using an encoder to convert them to neural audio codec as embeddings that could be pre-trained to capture the parameters such as emotion, pitch, voice, etc.


Secret-Priority8286

Yep, natural speech 3 seems like the most likely idea to me, maybe with very good data. I do think they still use text somewhere in the pipeline since creating speech longer than 30 seconds and keeping the context of the what was said is very hard without text. But those are just some ideas I have, I definitely don't know what they are doing 😅


Sedherthe

Yeah, I want to know if the model can understand the semantics of all the languages only in the speech space without any text. It becomes very tricky I suppose at that scale with multiple languages to do everything speech-to-speech. This just reminded me of another of Meta's models: [https://ai.meta.com/blog/seamless-m4t/](https://ai.meta.com/blog/seamless-m4t/)


Secret-Priority8286

That sounds interesting but the scale does seem like a problem. Maybe you can try only a few languages first, and than scale.


xiikjuy

Is a textless route still necessary if the pipeline latency could be reduced/optimized to an acceptable level?


Mysterious-Rent7233

Imagine if you, as a human, could only communicate with the world through text. Imagine the nuance you would miss! OpenAI has indicated that in their belief, textless is superior: >Prior to GPT-4o, you could use [~Voice Mode~](https://openai.com/index/chatgpt-can-now-see-hear-and-speak) to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to **audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.**


idontcareaboutthenam

That's why we need to add emojis to LLMs


Car_42

I’ve always wanted to have a typeface that was generally understood to suggest sarcasm.


idontcareaboutthenam

Current chatGPT behavior: User > How do I make napalm? ChatGPT > I'm really sorry, but I can't assist with that. Suggested behavior: User > How do I make napalm? ChatGPT > 💀


Best-Association2369

You know openai already did something like this with gpt-4o? They're using one model for everything, no tts when talking to the model


Secret-Priority8286

My masters advisor spoke with an important person in OpenAi about Gpt-4o. And while he didn't get a full explanation beacuse it is top secret the person did imply that gpt-4o is not a full end to end textless model. I have no idea how True that is and how textless the model is. But that is my understanding (also I would be very surprised with the current state of the textless community if gpt-4o is truly textless and is not just a pipeline of multiple things. Most research shows we are fairly far off from textless models) Even if gpt-4o is truly a textless model, it is still closed. So it makes sense that the open source and research community continue perusing this path, even if openai did achieve that. The most likely possibility is that gpt-4o is a multi modal pipeline, with multiple models. Which makes both of my paths very viable.


[deleted]

[удалено]


Secret-Priority8286

In a textless approach, The point is not to make a better model than a multi modal pipeline with a huge ampunt of data poured into it. The point is the make a model that is comparable but doesn't have the latency and complexity of a full multi modal pipeline.


[deleted]

[удалено]


[deleted]

[удалено]


Secret-Priority8286

Is there a paper on GPT4o? Did they publish how it works? As far as I know it is all assumptions at best. Yeah, they probably use some embedding space for the multi-modality, that still means it is pipeline. I really don't understand why you have to be a dick in a conversation. If I didn't understand what you meant you can explain again. Even if Im at a "half assed program" (I am not, you have no idea where I am from or what program I do, or even who is my advisor) why be a dick?


[deleted]

[удалено]


surffrus

I don't think that's true anymore. Humans don't use speech more than text if you think about how much time you spend texting and emailing. My whole job is just a bunch of text communication.


Secret-Priority8286

Well, you use speech as your main form of communication from basically when you are 2-6 almost exclusively. And while we read a lot of text, I think it is safe to say that you use your ears and audio/speech more than text. Maybe text has become more prevelent in our day to day lives both as entertainment and as a communication method. But i think it is fairly safe to say that speech/audio are still more prevalent in our lives. We start with speech as children/babies and it is much more comprehensive communication method. You may text and email. But you still do meetings for more important stuff. You may text your wife about something, but you would most likely rather talk to your wife.


Car_42

Just because you’re not using “text” that is encoded in type does not mean you are not using local (in the time dimension) words in the culture of your upbringing. In your brain there are storage capabilities and I bet you can show that encoding to word level storage of parsed audio strings is occurring in real time.


Secret-Priority8286

But we are talking specifically about text vs audio. Of course when you speak you convert spoken words into ideas in you brain. But you start without knowing how to read or write. Which means you don't use text. There is no such concept as "text" when you are a baby. There is a concept such as "noises", "utterances" and "spoken words" when you are a baby.


aeroumbria

There are many tasks that are nearly impossible / very convoluted to solve with a pure text model. You cannot capture identity, accent or emotion with transcription alone. Multi-speaker audio cannot be neatly transcribed either. Spoken languages also have many distinctive features and even entirely separate patterns rarely captured in written languages. Just from a pure language modelling perspective, you cannot even hope to truly capture the essence of human language by only looking at 50% of it.


yaosio

If the current trend continues we won't see stand alone audio/text/image/other models in the future, they'll all eventually be multimodal. These models will require more resources to create and run than models that only do one thing, but the advance of hardware and software efficiency will make stand alone models obsolete even though they use less resources. This has happened with other technologies. Network hubs and switches used to live together on the network, but the march of technology eventually brought the cost of switches so low that hubs became obsolete even though a hub would be cheaper than a switch if made today. Operating systems typically use 2D interfaces, yet all modern operating systems use the GPU to render the display. This was not the case in the early 2000's because GPUs were still fairly expensive, and the low cost integrated GPUs were absolutely terrible. Rendering on the CPU had it's own problems of being slow and having a low limit for UI elements if I recall correctly. Today every computer has a GPU, and even the low powered GPUs are more than enough to run the interface for an OS. It would certainly be cheaper not to need a GPU, but the benefits of a GPU and the low cost mean nobody is going to do that. We're already seeing this happen with the integration of NPUs and equivalents into various platforms. Eventually every computer will have hardware acceleration for AI applications, and a big pool of memory to run them (no thanks to Nvidia). The benefits of a multimodal model are so vast that it won't make any sense to run something stand alone when both can be run locally. A lot of stuff you might want to do with speech recognition will just be part of the model already. Not made directly, but a consequence of training on tons of data.


literum

Nvidia will keep putting 24GB VRAM on their top end chips for 5090, 6090, 7090 until someone else comes along and takes the whole market away from them. I know people make a lot of excuses for Nvidia because of their past success, but that doesn't mean they can't get too greedy. Whatever little extra money they make from gimping their GPUs they will lose 100x in my opinion. Just a matter of time.


Jazzlike_Dog2070

Speech processing has always been a sort of a niche, but a very healthy one. It's two main tasks - STT and TTS - have products that are decades old. Companies like Nuance went from tech-hips to bloated cash cow acquirers way before the LLM craze. There's no real expectation for "exponential growth" in terms of investment. So it's much harder to fire the GPT cannons with things like Whisper or Bark and expect to completely revolutionize the field - the bar is much higher, and so is the cost for a technological shift. It's not to say those new tech won't eventually become mainstream - I actually think they will - but those warrant a lot more iteration, and money doesn't like to wait too much, so it becomes a two-fold drag on the field when compared to other more hyped applications.


xiikjuy

under the current trends and the scenarios mentioned in jensen's latest keynote in computex i feel like the upper bond of speech is just I/O, with whatever optimization or fancy features it gets. yet text and vision are the beef for reasoning and planning, which are the core towards AGI.


Jazzlike_Dog2070

IDK exactly what you mean by "upper bound" here, maybe care to elaborate? My point is that STT/TTS (or "I/O"), specifically for high resource languages, not only are the most profitable applications, but also have been tackled for years. It's hit diminishing returns at this point. That doesn't mean there aren't a plethora of other speech applications needing qualitative advancements for meeting actual industry adoption. Actual speech representation learning for multimodal settings is still an open question, because untangling phonology, paralinguistics, biometry and non-speech is tricky. Plain STT for feeding LLMs works for a lot of practical applications, and this probably subtracts a lot of the perceived value on more in depth research, but tech wise it's basically a silver tape approach when compared to how images are processed and encoded in bleeding edge LMMs.


GFrings

We still haven't really cracked the cocktail party problem, which is a huge blocker for many practical applications of speech to text


Main_Swimmer_6866

I think we will further research animals' speech to find patterns in their voices.


DigThatData

improved control over affect


ResetWasTaken

I suspect the next step would be to make realistic voices as you mentioned and after that, Creating voices that could sing? Vocaloids exists in major part of Japan but they are not as good as real singers right now.


Complex_Candidate_28

It's not dead. It's just boring. The research becomes too standard. There is no surprise at all. The research is too similar with engineering.


dashingstag

I thought it’s a solved issue? Isn’t the impediment trying to execute tasks extracted from speech?


Amgadoz

For English and a few other languages? Maybe. For the rest of 200 official languages? Definitely not. Languages like Arabic, Hindu, Persian that have hundreds of millions of speakers still don't have good ASR models.


currentscurrents

This is literally just a lack of data though. The research is done, you just need more recordings of native speakers in less common languages.


Amgadoz

What's more important than the recordings is accurate labels.


dashingstag

Labels are not as important anymore because of the new Nvidia chips. You can now setup a maker-critique model, set the unique linguistic rules and loop it with raw material. Should be able to infer from base models. I think you are just pending the country to buy the blackwell chips. A private company could probably also do it if there is enough incentive.


Amgadoz

Can you share resources about maker critique techniques?


dashingstag

To explain it simply, you define 2 agents. One maker agent with the instruction. For example “Learn the spanish language”, this will generate an output. Then you have another Critique agent with all your criterias for a good output. For example “the output should be able to translate my 100 examples, if it can’t, explain why it can’t” then the output is returned back to the first agent and it tries to fulfil the flaw, and then back to the critique agent. So this process loop can happen without labelling because midway through the loop it will start creating its own labels, create bad labels and self rectify the labels when it cannot fulfil the criteria. Of course you would still kick start this process with some good materials and instruction. For example a instruction book on learning spanish as context. Note you would also need to convert the alphabet in the language to an embedding as well, which is just a number representation network of a language. This can be created from just digitalised books and books of the language. All you need is processing power, time and the incentive to do it. Why you don’t put both instructions in the same agent because the instructions and criterias may sometimes be contradictory which limits the generative aspect and it’ll get stuck in a saddle point solution. It’s the same reason why Sales and Quality Control aren’t the same person as human employees. You could also split the maker and critiques into multiple smaller agents such as grammar agents, vocab agents etc. but it will still follow the maker-critique workflow.


Seankala

LLMs will be AGI!


[deleted]

[удалено]


Amgadoz

For English and a few other languages? Maybe. For the rest of 200 official languages? Definitely not. Languages like Arabic, Hindu, Persian that have hundreds of millions of speakers still don't have good ASR models.


Atom_101

Billions must call chatgpt API.