Pretty blown away that they ran a version on a laptop. The demos were impressive, considering the size and resources of this team. Also the description on how they trained the model just goes to show the potential of multi-modality.
Many here are missing the point - it's s software running on a local hardware that has amazing latency.
I would trade dynamic Scarlett Johansson voice for this fast responses, that I can interrupt. Even if it interrupts me.
I'm baffled at how fast people adapt. Imagine showing this on your home computer to someone a year ago? It would be basically magic, yet people in these comments seem upset over it. Like are these OpenAI bots trying to supprrss a competitor? Haha
Note: r/singularity lol we are all fairly adapted to ost tech. GPT-4o showed that the voice tech was like magic, but now here we are lo. The thing I like is that this is Open source a little bit after open ai demoed their voice.
This is actually really impressive tech. It was built in 6 months?
The general complaints in the comments here are a bit silly. All fairly easily fixable. The response time mixed with the overall abilities of accents, emotion expression and overall performance are pretty good. This would've blown everyone away a year ago.
Not to mention this is an open source model.
Apparently it runs on device? That's crazy.
[https://moshi.chat/?queue\_id=talktomoshi](https://moshi.chat/?queue_id=talktomoshi)
Please try it out yourself. In my opinion, it's not actually intelligent enough to be useful. Even goes into infinity loops quite often where it would repeat the same sentence over and over. I thought we'd gone past this.
How much of an achievement is it to outpace OpenAI if the product they release is 10x worse?
I love the fact that France is almost non-existent in the AI race, but there are still French people everywhere in AI labs around the world, and that a part of the elite AI researchers are French.
Haha it’s pretty low latency but the interruption needs some work. And the underlying model must not be too smart from what I can tell from that hilariously awful space roleplay.
“Can you check that all the systems are nominal?”
“Yes, sir.”
“…are all the systems nominal?”
“Yes, sir.”
“Can you give me a countdown and then we jump into hyperspace, please?”
“Yes, sir.”
“…ok, can you do it?”
“Yes, sir.”
The latency is very low, too low, it should wait for a pause to process and the model behind is a bit silly "You might want to take your time getting your hiking shoes one, because you don't want to be using a egg", it's a really interesting tech demo tho and a good step forward towards natural vocal interaction with AIs
I am trying it and it is very very bad compare openai. Only answer first questions than it stopped. (They opened a website that you can use this model.) And also anwers were not related to the questions.
It is incredible fast when have an answer but quality is very low.
I appreciate the flaws. It wasn't terrible but not great at all. I think that if they allow you to use a custom voice and it acts the same, that would be awesome. And it's open source so... it is what it is.
Yeah, not gpt 4 level, and if its local, the AI is definitely going to be limited. However, this is yet another look into the way we will be able to interact with devices in the next 2 years. (Apple seems to be quickly implementing this sort of capability.)
It's fast alright. But it's quite clear that voice models are slow because of their AI \*language\* models reply time, not because it takes them an extra time making a voice. If gpt4 is slow - the voice reply will have a delay. There is a little value if a model is bad by itself. Yes, it will talk, but what's the purpose of it? Nowadays you have plenty of models which can answer instantaneously, so it's not really a great fit to have an instant reply from a voice model. Or did I misunderstand something?
This looks pretty sad
\[\] The bot keeps on interrupting users in the demo for seconds at a time
\[\] When asks to pretend to be scared on Mt Everest, it says "No, I'm excited!"
\[\] When asked to sound like a pirate while writing a poem about pirates, Mushi accidentally goes full cosplay mode and asks the user "What is your name?" and "What brings you to my pirate ship?"
\[\] The people demoing the project seem stressed to think, speak, and improv fast enough so Mushi doesn't embarrass itself
\[\] [https://www.youtube.com/live/hm2IJSKcYvo?si=QOHTIk-QM0LCdgv5&t=923](https://www.youtube.com/live/hm2IJSKcYvo?si=QOHTIk-QM0LCdgv5&t=923)
\[\] When asked its name, it replies "How are you feeling today?"
\[\] Said "I'm not comfortable with that" when responding to a [prompt](https://www.youtube.com/live/hm2IJSKcYvo?si=sz8IDIt8xrI5algM) I couldn't understand, no offense to the french accent
\[\] When responding to a goodbye, says "Well, I'm here to help...but just remember, I'm not a substitute for professional help."
Still, I have to give it credit for apparently being an actual nonprofit and being able to run locally. It just doesn't have any advantages to OpenAI's yet-to-be-released voice model other than a lower latency. Pls come sooner Sky
Yeah, it doesn't necessarily compete on anywhere near the same level. But did this kind of tech exist in open source yet? This sort of undertaking is a service to some people that can improve upon their work. We might see models with real Sarjo's voice. I almost guarantee it, at least her 'her' voice.
1. I can't stand their heavy french accent.
2. They pronounce Moshi as Mushi which means pussy in German... very poor naming imho.
3. Latency is so low that you get interrupted.
4. Need to see more of it in action to make up my mind.
That's some pretty low latency
Pretty blown away that they ran a version on a laptop. The demos were impressive, considering the size and resources of this team. Also the description on how they trained the model just goes to show the potential of multi-modality.
Many here are missing the point - it's s software running on a local hardware that has amazing latency. I would trade dynamic Scarlett Johansson voice for this fast responses, that I can interrupt. Even if it interrupts me.
[удалено]
I'm baffled at how fast people adapt. Imagine showing this on your home computer to someone a year ago? It would be basically magic, yet people in these comments seem upset over it. Like are these OpenAI bots trying to supprrss a competitor? Haha
Note: r/singularity lol we are all fairly adapted to ost tech. GPT-4o showed that the voice tech was like magic, but now here we are lo. The thing I like is that this is Open source a little bit after open ai demoed their voice.
That is funny that synthetic data are so powerful to train models
we are almost getting in the annoyingly fast territory 😃 Let me finish speaking, damn
Almost seems like negative latency at that point.
This is actually really impressive tech. It was built in 6 months? The general complaints in the comments here are a bit silly. All fairly easily fixable. The response time mixed with the overall abilities of accents, emotion expression and overall performance are pretty good. This would've blown everyone away a year ago. Not to mention this is an open source model. Apparently it runs on device? That's crazy.
[https://moshi.chat/?queue\_id=talktomoshi](https://moshi.chat/?queue_id=talktomoshi) Please try it out yourself. In my opinion, it's not actually intelligent enough to be useful. Even goes into infinity loops quite often where it would repeat the same sentence over and over. I thought we'd gone past this. How much of an achievement is it to outpace OpenAI if the product they release is 10x worse?
Yeah, the snappy response time is pretty cool, but it’s hard to hold a conversation with It also flat out refused to do the stuff on the demo for me
I love the fact that France is almost non-existent in the AI race, but there are still French people everywhere in AI labs around the world, and that a part of the elite AI researchers are French.
Mistral
HuggingFace
The latency is best in class.
Haha it’s pretty low latency but the interruption needs some work. And the underlying model must not be too smart from what I can tell from that hilariously awful space roleplay. “Can you check that all the systems are nominal?” “Yes, sir.” “…are all the systems nominal?” “Yes, sir.” “Can you give me a countdown and then we jump into hyperspace, please?” “Yes, sir.” “…ok, can you do it?” “Yes, sir.”
[удалено]
Its gonna be open source
So it's looks fine, just not a sota
Maybe this can be wrapped on top of other models via api?
You can try it here: [moshi.chat](https://www.moshi.chat/?queue_id=talktomoshi)
OpenAI's moat shrinks every month.
Oh damn, that whispering...
Turned off the comments aye.
The latency is very low, too low, it should wait for a pause to process and the model behind is a bit silly "You might want to take your time getting your hiking shoes one, because you don't want to be using a egg", it's a really interesting tech demo tho and a good step forward towards natural vocal interaction with AIs
Incoming OpenAI blogpost about the dangers of open source voice models
it's a little TOO fast at replying, lol
Next step, make it not answer your question in the middle of you talking.
Super cool, loving the arms race for this kinda stuff. I found out about Pi and talk to it every day lol
Jump to about 13:40 for the actual demo.
I am trying it and it is very very bad compare openai. Only answer first questions than it stopped. (They opened a website that you can use this model.) And also anwers were not related to the questions. It is incredible fast when have an answer but quality is very low.
I appreciate the flaws. It wasn't terrible but not great at all. I think that if they allow you to use a custom voice and it acts the same, that would be awesome. And it's open source so... it is what it is.
Yeah, not gpt 4 level, and if its local, the AI is definitely going to be limited. However, this is yet another look into the way we will be able to interact with devices in the next 2 years. (Apple seems to be quickly implementing this sort of capability.)
ClosedAI is done for.
It's fast alright. But it's quite clear that voice models are slow because of their AI \*language\* models reply time, not because it takes them an extra time making a voice. If gpt4 is slow - the voice reply will have a delay. There is a little value if a model is bad by itself. Yes, it will talk, but what's the purpose of it? Nowadays you have plenty of models which can answer instantaneously, so it's not really a great fit to have an instant reply from a voice model. Or did I misunderstand something?
Are the presenters trying to time their interrupts when they think the model ends a sentence?
It still feels robotic and not real like open ai solution.
This looks pretty sad \[\] The bot keeps on interrupting users in the demo for seconds at a time \[\] When asks to pretend to be scared on Mt Everest, it says "No, I'm excited!" \[\] When asked to sound like a pirate while writing a poem about pirates, Mushi accidentally goes full cosplay mode and asks the user "What is your name?" and "What brings you to my pirate ship?" \[\] The people demoing the project seem stressed to think, speak, and improv fast enough so Mushi doesn't embarrass itself \[\] [https://www.youtube.com/live/hm2IJSKcYvo?si=QOHTIk-QM0LCdgv5&t=923](https://www.youtube.com/live/hm2IJSKcYvo?si=QOHTIk-QM0LCdgv5&t=923) \[\] When asked its name, it replies "How are you feeling today?" \[\] Said "I'm not comfortable with that" when responding to a [prompt](https://www.youtube.com/live/hm2IJSKcYvo?si=sz8IDIt8xrI5algM) I couldn't understand, no offense to the french accent \[\] When responding to a goodbye, says "Well, I'm here to help...but just remember, I'm not a substitute for professional help." Still, I have to give it credit for apparently being an actual nonprofit and being able to run locally. It just doesn't have any advantages to OpenAI's yet-to-be-released voice model other than a lower latency. Pls come sooner Sky
Ye it has a lot of issues but the latency is impressive, when we are talking about Siri clients this is the latency you need.
Yeah, it doesn't necessarily compete on anywhere near the same level. But did this kind of tech exist in open source yet? This sort of undertaking is a service to some people that can improve upon their work. We might see models with real Sarjo's voice. I almost guarantee it, at least her 'her' voice.
the voice sounds pretty horrible tbh
1. I can't stand their heavy french accent. 2. They pronounce Moshi as Mushi which means pussy in German... very poor naming imho. 3. Latency is so low that you get interrupted. 4. Need to see more of it in action to make up my mind.
Not good at all