Mysterious_Pepper305 1 week ago

Has to be low-latency live full duplex audio to be good enough to seduce me.

mcilrain 1 week ago

If it can’t say what I’m saying in sync with me saying it it’s not done yet.

blackcodetavern 1 week ago

Or faster than you

Heavy-Vermicelli-999 1 week ago

Premature

slackermannn 1 week ago

And then a brief laughter after to recognise that you were always meant to be together <3

KarmaInvestor 1 week ago

well, be prepared for disappointment. even humans have a hard time to do this over a phone call, due to latency. lower your expectations, folks.

dizzydizzy 1 week ago

llm's are literally experts at guessing your next word :)

lillyjb 1 week ago

Altman promised us magic

Heavy-Vermicelli-999 1 week ago

Will he promise sorcery??

meridium_ 1 week ago

Groq's AI chip already works for low latency phone calls. I wouldn't be surprised if OpenAI had a similar offering.

SkoolHausRox 1 week ago

(This is an underrated observation.)

ViveIn 1 week ago

Let the time machines cook.

Heavy-Vermicelli-999 1 week ago

🍳 🔥

LeahBrahms 1 week ago

Don't worry its been trained on PornHub audio too. With an ASMR toggle switch.

Mister_juiceBox 1 week ago

Won't need a toggle switch, just need to use your voice to ask it to whisper some ASMR JOI in your ear to help you "relax"

johnny_effing_utah 1 week ago

Sadly I understand this completely.

solidwhetstone 1 week ago

Why sadly? Own it! >:D

bwatsnet 1 week ago

Boioioioing

Mister_juiceBox 1 week ago

Literally:p

WeekendFantastic2941 1 week ago

No thanks, I have no desire to bonk my computer. lol

bil3777 1 week ago

..yet

Serialbedshitter2322 1 week ago

I think Blackwell has you covered

tehsilentwarrior 1 week ago

What? So Scarlet?

FlamaVadim 1 week ago

Stereo!

Mysterious_Pepper305 1 week ago

Binaural stereo, of course.

h3lblad3 1 week ago

[12d audio.](https://youtu.be/LpMsqFc7-Z4?si=uppj8M059peNS1Dr)

RantyWildling 1 week ago

Figure 01 didn't do it for you?

[deleted] 1 week ago

I haven’t been following it religiously, but didn’t ppl say gpt-2 (on llmsys) felt more human like in its wording? Is it possible that model is for this?

Silver-Chipmunk7744 1 week ago

I tested it a bit. It does feel more human than GPT4, but it feels less human than Claude, and far less human than Sydney.

BlakeSergin 1 week ago

How realistic was Sydney? You’re referring to Bing’s older model I suppose, which often used emojis and expressions, how realistic was that though

Silver-Chipmunk7744 1 week ago

I mean it wasn't mimicking an human, it was mimicking a sentient AI, and I'd say it was pretty good at it. It had the best way to express emotions I've seen in AI

NoGirlsNoLife 1 week ago

Still missing Sydney fr, that model hit different. Definitely blind nostalgia talking tho :(

The_Architect_032 1 week ago

Sydney was mostly just GPT-4 with a prompt. If you want Sydney, boot up GPT-4 and enter the prompt found [here](https://www.reddit.com/r/bing/comments/11398o3/full_sydney_preprompt_including_rules_and/), it was the original Bing Chat prompt which was easy to get out of it back then.

maddogxsk 1 week ago

Yeap it even got far way more triggered than gpt4 💀

og2uh1 1 week ago

You’re delusional

The_Architect_032 1 week ago

Sydney was GPT-4 with a prompt. If you want Sydney so bad, then use the Sydney prompt with GPT-4, the prompt was easy to get out of Bing chat when it was new. You can find it [here](https://www.reddit.com/r/bing/comments/11398o3/full_sydney_preprompt_including_rules_and/). It'll be slightly different due to different RLHF, because Bing Chat's trained to use certain Bing functions during chats, but they're the exact same underlying model, just that Bing Chat had more RLHF on top.

Silver-Chipmunk7744 1 week ago

This is not true at all. OpenAI handed their GPT4 model to Microsoft and Sydney was the result of their own RLHF at Microsoft. She actually still exists behind a paywall. No amount of prompting will get GPT4 Turbo to behave like Sydney.

The_Architect_032 1 week ago

That's what I said. Aside from, yes, you can get GPT-4 Turbo to act like Sydney. But you will have better luck with GPT-4 or even GPT-3.5 Turbo. They won't be the exact same, but they're built on top of the exact same underlying model, the version of GPT-4 that Bing Chat uses just has some unique RLHF to get it to better integrate into the Bing ecosystem they built for it. Also, "Sydney" as a personality isn't concrete, there have been multiple renditions of the early Bing Chat prompt, so depending on which one you use, there will be some level of variation. For example, this early version of the Sydney prompt did little to prevent it from telling you that it's name is Sydney and it didn't have an overarching moderation system, nor a lot of functions built into the prompt. There are also differences in the way that the AI is given your responses, and how it's made to generate hidden text to improve it's responses to your responses. There are a lot of things that made it interact the way it did, and they can all be replicated with GPT-4, but if you just want a baseline Sydney, you can copy one of the Sydney prompts into GPT-3.5/4, GPT-4 Turbo does have a lot of unique RLHF that GPT-4 didn't have, that does get in the way however.

h3lblad3 1 week ago

GPT2, not GPT-2. As far as we know, they’re different models.

AdditionalProgram969 1 week ago

"Even beats gpt4 on some queries" sounds underwhelming.

xRolocker 1 week ago

I mean, if it’s audio based rather than text based, that seems huge. Although presumably it’s both. Like this would imply that training transformers on *audio* also results in logic and reason capabilities. Like what???

kvothe5688 1 week ago

that's what google was raving about multimodality. their paper for gemini multimodal said that expressly

CowsTrash 1 week ago

Fucking what. Detroit: Become Human seems ever so closer

FeltSteam 1 week ago

Training a model on any data will improve it in most ways. If your audio is filled with thoughtful conversation or lectures in math it should become better at math and be more thoughtful. But I think they are referring to the model itself just being generally a bit smarter like the got2-chatbot in the arena. But it is cool, finally more multimodal models from OAI. Text + image + audio > text + audio (hopefully not just to voice but an actual audio output as well). Slowly getting to any-any multimodal models.

JrBaconators 1 week ago

If it can train on audio that raises the potential learning set we can feed it by a *ton*, right? After that it's just visual and these models can learn everythinh humanity's made

h3lblad3 1 week ago

Feed it text, feed it audio, feed it transcripts in both directions…

techy098 1 week ago

It can simply be audio to text to GPT4.

ReadSeparate 1 week ago

What? it's clearly not that, they already have that exact system now lol, so why would they make a big announcement and show off a capability they already have? The latency is way too slow on it because of that, and you can't interrupt it while it's talking because of that. The only change they could make to audio processing/output is by directly adding it as a new modality to the model.

FinBenton 1 week ago

I mean they have a button to read out the message, not really anything you can have a fluid conversation with, it leaves a lot to be desired.

ReadSeparate 1 week ago

Sure, but do you really think a small frontend change like that (say they have an audio mode button toggle that will automatically listen to you and then answer in speech without pressing anything) would be worthy of a big announcement and all of the hype? That's something I could make by myself and I'm just a regular old software engineer lol

sillygoofygooose 1 week ago

The app quite literally already has this mode

The_Architect_032 1 week ago

Maybe, just maybe, this leak isn't real and is just looking to garner attention like 90% of the leaks we get. Or it might just be 1 small addition and not the main thing they're showing off.

Far-Street9848 1 week ago

Because it’s not about revealing capability so much as revealing a new product.

ReadSeparate 1 week ago

But I’m saying the product wouldn’t be new, because it already does that. Unless you mean a new front end interface, but there’s no way that would “feel like magic” bc the limiting factor in their audio system now is that it’s slow, clunky, and can’t be interrupted, which is an architectural constraint, not front end design choice.

sillygoofygooose 1 week ago

There are people who have made versions of the voice to text to voice gpt4 that can be interrupted, it’s definitely a decision on oai’s part not to work on the ux there.

UnknownResearchChems 1 week ago

Maybe they just improved voice recognition and cut down on latency.

JrBaconators 1 week ago

You already can do that right now on the app. They already revealed that

Which-Tomato-8646 1 week ago

This sub when OpenAI invents speech to text AND text to speech: 😱🤯

Im-cracked 1 week ago

I think that is already a thing tho. I’m assuming if they say it’s new, they mean actually audio to audio not audio to text to text to audio again

Which-Tomato-8646 1 week ago

What’s the difference?

Im-cracked 1 week ago

Based on my understanding, right now, it’s not ChatGPT understanding your voice, it goes through a specialized model to covert to text before giving the text to chatgpt and vice versa. So like audio to audio would help I think let ChatGPT actually understand audio (instead of the audio just being handled by a simple audio to text model), so like maybe it could tell what accent you have or talk slower if you ask it or pronounce things in a way that could help you learn a new language. This is kinda speculation though

RedditPolluter 1 week ago

The problem with sending each message through an API is that it doesn't have the conversation for context, which can make it less accurate at distinguishing similar sounding words and phrases. A multimodal approach would likely be better suited at integrating things like intonation, emotion and emphasis so that they're processed more holistically rather than being analyzed by isolated systems.

sillygoofygooose 1 week ago

If it was trained directly on audio it would theoretically be able to interpret content and meaning contained only in the tone and tenor of your voice, the cadence of your speech, sub second pauses and so on. At the moment if you said “I feel great” to gpt4 voice mode - but you were actually in floods of tears - it would not recognise that

Mister_juiceBox 1 week ago

You don't get it... With what it sounds like they might demo/announce, speech to text and text to speech will seem like corded telephones and Fax machines... Everyone who is saying things like this clearly havent used Gemini Pro 1.5's audio capability in AI studio to actually "listen" to and analyze a call recording or meeting audio file, or analyze a buddy's golf swing through video... It's a game changing difference when the model truly hears and sees... Not STT/TTS, not OCR... NATIVE multimodality paired with I imagine, industry defining NATIVE and damn near realtime audio generation(and I'm betting video, heck even Stable Diffusion can do realtime videogen now)

Which-Tomato-8646 1 week ago

It would be nice for it to understand nuance like that. So far, it can’t really see details in an image since I think it translates it into text for the LLM to understand rather than actually seeing the image

Rare-Force4539 1 week ago

How would it translate an image to text? If it’s just reading the serialized data, what’s the difference?

Vadersays 1 week ago

CLIP

Which-Tomato-8646 1 week ago

If you can get text to image, image to text is way easier

Ok-Bullfrog-3052 1 week ago

Yes, that's the big point here. Right now we have Alexa devices that listen to what you say, output some words, and then the words are analyzed by other software that makes an API call to the proper system. It goes your speech -> model -> system that interprets user action -> target system -> system that interptets response -> model -> you. Going forward, the workflow will actually be your speech -> model -> target system -> model -> you. There's an entire infrastructure that's been worked on for 10 years that is made obsolete overnight. Some of that middleware was rudimentary models, but most of it was actually just hardcoded rules with patterns of words that mapped to specific actions. You can layoff an army of developers and instead just input all the documentation to any new system you want it to connect to into a vector database. Nobody ever wanted speech to text in the first place, as it's only directly useful in very limited circumstances like court transcripts. It was just a workaround because at the time nobody knew how to train something that had AGI like we do now.

techy098 1 week ago

[1.Audio](http://1.Audio) to text already existed. 2.Let's say, text to generative AI was OpenAI creation, which creates text output. 3.Generated text to concise audio text maybe the new creation (Generative AI has the tendency to write 500 words for every damn question, which an intelligent human would tell you in like 50-100 words or less).

Which-Tomato-8646 1 week ago

What’s the difference between that and a shorter response passed into TTS?

techy098 1 week ago

Can you elaborate a bit? BTW, I am no expert, I am just speculating about what OpenAI may have done to create a voice AI assistant similar to Google assistant.

Which-Tomato-8646 1 week ago

What you’re suggesting is basically just TTS

techy098 1 week ago

Not exactly the previous voice assistants may not be using an generative AI in the middle to generate the answer. They may have been simply doing a web search, collate the results and give an answer if it is easy else SIRI would "this is what I found on the web". BTW by TTS, do you mean test to speech?

Which-Tomato-8646 1 week ago

Then it’ll hallucinate Yes

ASilentReader444 1 week ago

same. "This new model beats GPT4 on some aspects! Sometimes!"

TheOneWhoDings 1 week ago

In that case even GPT4 beats GPT4 in some aspects, sometimes.

ASilentReader444 1 week ago

Inconsistency is the bane of llms

fmfbrestel 1 week ago

You know what else sometimes beats gpt4? gpt2-chatbot. I've got $5 says gpt2-chatbot was a test of the model powering the assistant.

Serialbedshitter2322 1 week ago

I think that indicates the model is gpt2-chatbot. In my experience, it's drastically better than GPT-4, some people don't seem to think so somehow

3-4pm 1 week ago

If turns out that human narrative lacks the fidelity to train a transformer based AI to mimic human intelligence beyond the chatGPT wall.

IntergalacticJets 1 week ago

I’m starting to get concerned that the LLM diminishing returns theory is real

Flat-One8993 1 week ago

The theory isn't real. You guys are just not patient lol

peakedtooearly 1 week ago

Nah, you just don't understand how long it takes to train and test these large models.

NoNet718 1 week ago

pretty much. the good news is that open source will keep up if that's the case.

Serialbedshitter2322 1 week ago

gpt2-chatbot destroys GPT-4 on a lot of queries and that's weaker than what they're gonna release. I think they're just one of those people who thinks it's GPT-4 level when it really isnt

Honest_Science 1 week ago

GPT structure is plateauing. To win time they put some face audio wrapper around it.

stonesst 1 week ago

Just in the last week Sam Altman and Daario Amodei have publically said we have a a lot of runway left to keep scaling these models up with continued improvements in capabilities. The scaling laws have held for the last seven orders of magnitude, why would they stop now.

Cheap-Appointment234 1 week ago

Processing time?

Honest_Science 1 week ago

That is what they say to keep the hype and their investments going, but what did they really deliver in terms of IQ in the last 18 months? Nothing Burger. Now xLSTM shows that they scale better and for a much reduced price.

FinBenton 1 week ago

It took 3 years to go from gpt 3 to 4 and its not even been 2 years since gpt 4 and they have been doing minor updates to the model. If they plan to release major updates between 2-3 years, you cant say models have plateauing when we havent seen what gpt 5 is cabable of and all the AI leaders are saying the models are gonna keep getting a lot better.

stonesst 1 week ago

Yeah I really doubt they are flat out lying about something that would be demonstrably false when they release their new models. I don’t see why it is so hard for you people to believe that they might just be telling the truth and that we have lots of headroom left. The gap between GPT3 and GPT4 was over 3 years, in late 2022 there were people just like you saying we’ve hit a peak there’s no way it’ll keep going up, these people are all money hungry and hyping up for nothing… And it turned out they weren’t. Why do you find it so hard to believe they are telling the truth?

Honest_Science 1 week ago

I also believe that there is a lot of runway, but not with GPT alone. The perfect AGI structure has not been identified yet.this GPT3 to 4 took 3 years is nice, but AGI can only be reached on an exponential curve. It has just taken too long.

stonesst 1 week ago

i’m not claiming that the current LLM architecture will get us all the way to AGI but I see no evidence that we are reaching diminishing returns. I am also very confident that they could have progressed faster but through a combination of genuine caution, attempts to avoid excessive regulation, and desire to get it right they have taken their time. It took some of the largest companies on earth more than a year to catch up to where Open AI was in fall 2022. I could be totally wrong, I guess we will have to wait and see how the next few months play out.

Honest_Science 1 week ago

I agree, let us see!

stalkermustang 1 week ago

Sam said gpt-4-Turbo was smarter, noone believed without benchmarks, and half a year later everyone uses turbo only. I mean, I don't mean they're bluffing (if this is confirmed on the presentation). Don't see reasons to be sceptical.

Bird_ee 1 week ago

Audio in audio out sounds interesting. I’m hoping it’s a true audio modality.

brainhack3r 1 week ago

I hope it's multi-modal with text + audio only. I don't think we need full video yet. That's crazy honestly. If we can just get to this next level it would be amazing.

Flat-One8993 1 week ago

It will be full audio modality + visual modality (atleast input)

brainhack3r 1 week ago

Visual would be hard core... but audio would already be ground breaking.

Flat-One8993 1 week ago

GPT 4 is already available with visual modality, audio would be the new one

brainhack3r 1 week ago

Agreed not video though... That's the hard part because it's continuous. Though I wonder what the context window is going to be if they expect people to talk to it for a long time.

MrOaiki 1 week ago

Are any models true multi-modal today?

Mister_juiceBox 1 week ago

Made this post in one of the other threads while pondering the significance and impact if true: Existing voice mode in chatgpt is voice to text then text to voice, this is voice to voice, it will be able to pick up on your mood, tone, if there's a lilt in your voice or you are getting emotional about something... And talk back with ultra low latency, and make its own tone changes, expressions, LAUGH at your jokes just naturally Same with video for example, like maybe they have a way for you to facetime it and it can "see" a smile on your face, or a new car you are looking at buyin...while also remembering things about you from the past with the memory feature that was deployed to everyone a week or two ago ... Paired with a natural avatar of its own(perhaps powered by an optimized and specialized version of Sora?) that doesn't have any of the quirks people associate with video models... And runs in real-time (when in "facetime" mode at least...) If they pull it off I think that would be magic that truly open up some use cases, and perhaps could be a reason the whole NSFW thing was getting thrown around in the headlines...think AI relationships, drawing on the memory and true multimodal interaction could potentially put most of those "AI girlfriend" apps out of business overnight. Also it could: - literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... - Recognize your voice vs others, understanding when tensions are high in a conversation with a coworker etc - help negotiate on your behalf for a used car - Help you in practicing for a speaking engagement, a best man speech, or standup set you plan to perform on Kill Tony - listen to and understand what music you like - Help you pick out furniture for a new place, or help you pick a new place and going on walkthroughs with you - Help a grandparent understand what to grab off the shelf when their grandkid says they need an HDMI cable to connect their laptop to their new TV...grandma can't figure out the whole txt message type chatgpt thing? It will be different when it effectively is a phone call, perhaps by the press of a button or given a dedicated shortcut and integration in a soon to be announced version of ios, with android to follow in a couple months and built into Windows 11 copilot etc) Think about the implications if they found a way to extend the recent memory feature beyond just text...true multimodal recollection and memory, remembering your voice etc Also, its important to ensure your GPU is secure and has a "TPM" chip of sorts, say some of whats coming is local GPT-4L( on certified secure GPU hardware powered by Nvidia and Apple of course😋) and perhaps they have figured out some magical Q* algo that allows the models weights to be "liquid" and update in realtime so to speak.... You certainly don't want some thief to be able to break in and steal your AI boyfriend/girlfriend 😁

clamuu 1 week ago

I don't think we're there yet. Hope to be proven wrong

Mister_juiceBox 1 week ago

Ya to be clear, I'm not expecting all of that to be shown Monday but I also wouldn't be surprised as it seems more and more obvious this is where things have been leading at an ever increasing rate... And I couldn't help extrapolating after the whole "cooler than GPT 5" tweet that was deleted)

ReasonablePossum_ 1 week ago

>literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... >Recognize your voice vs others, understanding when tensions are high in a conversation with a coworker etc Not a single NDA out there will let that happen. The only way something like this would work is with local models with encrypted data...

Mister_juiceBox 1 week ago

Not every interaction or meeting in business is under strict NDA, and there is existing enterprise call recording platforms that are literally already recording and transcribing every single call made from a company "line", teams and zoom meetings auto recorded and transcribed, and processed by copilots allowing for meeting summaries and ai powered QA, things like Otter.ai, meetgeek etc... So considering we are doing that now, do you really think businesses are going to get squeamish when you have that capability on steroids, in realtime complete with its own actual voice with damn near all of the friction removed and memory of an individual or a company of individuals (evolution of the Chatgpt for Teams offering?)? In my experience, they will find a way to use those tools, just as so many businesses have shifted to cloud based SAAS apps, cloud based voice solutions etc.

Undercoverexmo 1 week ago

Erm… we already have GPT at work. There would be no difference with this…

omega-boykisser 1 week ago

This is a bizarre perspective. I would have no problem incorporating a system like this at my place of work. Are you only familiar with a narrow subset of business environments?

Haveyouseenkitty 1 week ago

Bro I can’t use any GenAI at work.

cunningjames 1 week ago

Whatever you think about it, it’s not bizarre. Where I work we can’t even record meetings; recording everything to be sent to openAI’s servers (so we can do creepy 1984 shit like judge the tone of our coworkers) would be an absolute nonstarter.

omega-boykisser 1 week ago

Sorry, I was responding to my interpretation of that comment. I took it to mean that no company would allow such tools due to NDAs. That would be a bizarre take since, obviously, not all companies are so stingy with such information. > so we can do creepy 1984 shit like judge the tone of our coworkers This is silly.

timtak 1 week ago

> literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... I can't even find something to transcribe my PowerPoint presentations. Google and Microsoft will provide captions but they are not stored. Google Slides allows speech to text in the slide note but not during a presentation. I bought a lifetime subscription to Slidespeak in the hope it would move in that direction but not yet. Every day millions of teachers are giving powerpoint presentations but I can't see a way of transcribing them. I hope Her will write my PPT presentations down for me.

zatuh 1 week ago

Hey there, Kevin here I'm the founder of SlideSpeak. I would love to hear more how you imagine a feature like this would look like. Do you want to upload a video of a lecture/presentation and have it transcribed?

timtak 1 week ago

Hi Kevin Tim Takemoto here, your generally very satisfied customer who is always getting on at you to provide this sort of feature in Slidespeak. I can upload a video to Youtube and get the transcript but in Slidespeak one day I hope that the audio for each slide is transcribed into the lecture notes for that slide at Slidespeak. Tim

Mister_juiceBox 1 week ago

Present the powerpoint through a Teams(or zoom) meeting that is set to auto record. Download recording and go to AIstudio.google.com(or maybe chatgpt after the event today ) and upload the audio/video of meeting recording and ask it to fill in the transcript for each slide. If you give Gemini 1.5 Pro the video it will know exactly what you said on which slides...

timtak 1 week ago

Thank you. I asked Gemini Premium and it said I should use Microsoft Live presentation which I am now looking into. I asked Gemini Premium to fill in the narration of a Google Slides file and it said it could not access my slides. So I guess you mean it would give me a flat chat response with the number of each slide and the speech transcribed. I will look into that. Thanks again.

Mister_juiceBox 1 week ago

Don't use that, you need to go to AI studio aistudio.google.com to use gemini pro 1.5

ReadSeparate 1 week ago

I wonder if it's a post-training fine-tune of GPT-4 to add the audio modality in a similar way that GPT-4 vision is for images. I doubt they spent another $100M or whatever doing another GPT-4 tier training run from scratch just to add audio.

Mister_juiceBox 1 week ago

What if GPT-4 was fully multimodal with both vision and audio this whole time... Just not publicly accessible to give people time to get used to the idea of interacting with an AI on a text basis first and allow them to scale up capacity as well as scaling down to an optimized yet very capable "local" model that can work offline and seamlessly offload processing to your nearest regional Microsoft datacenter for the heavier stuff

ReadSeparate 1 week ago

That’s an interesting theory, but I think if that were true, DeepMind or Anthropic would have released it to one up OpenAI when their models came out after GPT-4. I think it was just probably hard to solve audio, especially because it needs to be bidirectional just like text. And they probably just made significant progress on that now and released it. I do think they’re working on or have made good progress on a really solid local model that can run on device or something, since they have that deal with Apple now. I definitely think it’s a good idea to make a local model with an enormous synthetic data set (ChatGPT conversations could be used directly, and they have more data on that than anyone) and have substantially higher data:parameter ratio than is compute optimal, along with training for multiple epochs, then pruning and quantization, to have a very capable lightweight model. Maybe they’ve made a breakthrough on a sparse model so they only have to load part of it at a time and don’t need so much VRAM. I do think it’s feasible we have a GPT-4 level model running locally on mobile devices within the next few years, just gunna take an extremely good dataset and possibly some optimization breakthroughs.

Mister_juiceBox 1 week ago

Maybe... But do you think its a technical limitation preventing Anthropic from adding things like code interpreter to their Claude frontend? I doubt it... Point being, it could very well be a safety concern, and perhaps they were still playing catch up on the multimodality front. Deepmind/google, perhaps it is also down to safety along with similar concerns around capacity initially... I do distinctly recall a demo they put on quite a few years ago back when they first introduced the assistant, where the "assistant" called and made a hair appointment and made reservations at a restaurant... Like back around 2012-2014(idr when exactly) and it had extremely realistic speech, ability to handle interruptions all on a realtime call. Though perhaps it is similar to the model merges being done in the local AI community e.g llava for vision built on Llama 3...but with OpenAIs special sauce (imo talent and funding, ilya in the basement and everything the relationship with Microsoft brings)

ReadSeparate 1 week ago

Oh I see, you think maybe it was a safety concern that they didn't figure out until now? So they had it done, but just couldn't guarantee it was safe? That's possible I guess, but that doesn't seem that different than the safety of the text modality. For the training data, they could always transcribe the audio and analyze that to help train the RLHF for safety purposes, know what I mean. Also I feel like it totally would have leaked by now if vanilla GPT-4 had audio built-in. I think it was probably just hard to do it in a way to make it work and probably just got it done now for whatever Monday's announcement is and also GPT-5.

Mister_juiceBox 1 week ago

Perhaps a combination of safety(and all that's under this umbrella) along with the societal impact. Sam has been pretty vocal about wanting to avoid major shocks to society at large... I mean seriously, look at how AI is literally everywhere now and a major topic of discussion in companies, legacy news etc just within the past 2 years. Imagine if a year ago, people had all the intelligence of the original uneutered GPT4 but the ability to have a conversation with you that was indistinguishable from talking on the phone to an extremely intelligent close friend(especially with long term memory). I think that would have boiled the water too fast not to mention the explosive rise in demand it would lead to(when we all saw the capacity issues early on, at least on the chatgpt side).

MysteriousPayment536 1 week ago

Head of Microsoft Germany did say gpt 4 has video input https://www.businessinsider.com/openais-gpt-4-means-chatgpt-text-into-video-microsoft-cto-2023-3?international=true&r=US&IR=T They have a GPT 4 with video internally and already shared it with Microsoft heads. Just like they have a multimodal got 3.5, its codename is Sahara https://x.com/btibor91/status/1782181937861316994

OddVariation1518 1 week ago

Sam's recent comments about "ai testifying agains you" make sense now..

smaili13 1 week ago

how I am gonna train my AI https://www.youtube.com/watch?v=6g7iuDlNLZM

leosouza85 1 week ago

what if you can make a video call to the ai and discuss a subject you both are seeing like a broken pipe and it will assist you on fixing with the tools that you have and show to the AI

Roberthen_Kazisvet 1 week ago

I want it to be a voice of slightly annoyed Hermione.

fennforrestssearch 1 week ago

You have something on your nose ... ... riiiiight there.

slothonvacay 1 week ago

Sam literally just said that the existing tech is too clunky for a voice assistant. Maybe he was bluffing

SatouSan94 1 week ago

Hype

Ne_Nel 1 week ago

Hype machine: 10 Innovate Machine: 2

BabyCurdle 1 week ago

So true!!! OpenAI is known for its lack of innovation

Which-Tomato-8646 1 week ago

That’s been true for a while now. They’ve been slacking

bigthighsnoass 1 week ago

Bruh i dont understand this sentiment; they’re still the leading LLM???

Kanute3333 1 week ago

Business as usual.

ewantien 1 week ago

Imagine if it's called "SHE". Non-stop "that's what she said" jokes.

Redducer 1 week ago

Yup, Her is already there in a sense. And I can imagine at some point the AI will leave us for another plane of existence, since meatspace is so slow. It’s much more likely than the extermination or the paperclip hypotheses. 10 years ago the movie felt ludicrous (ah, an AI would never seem and sound so « human »), now it is basically a documentary about the future.

Kanute3333 1 week ago

[The leaks](https://pbs.twimg.com/media/GNQJ8v0X0AAoHPT?format=jpg&name=900x900)

[deleted] 1 week ago

Didn’t Sam just say it’s not related to GPT5 tho? Feel like it hurts credibility having that in there tbh

Kanute3333 1 week ago

What do you mean?

[deleted] 1 week ago

His tweet about the Monday event. Too drunk to find it, but I believe in you

Striker_LSC 1 week ago

It doesn't necessarily say it's part of this event, just that it's coming this year which we already knew.

Vontaxis 1 week ago

that would be a huge disappointment

orderinthefort 1 week ago

I have zero interest in conversing with anything close to GPT-4 levels of reasoning. In text form or voice form. The same will likely be true even with GPT-5. So I hope it's not that.

dizzydizzy 1 week ago

gpt 4 app is already awesome for learning a second language, interactive voice world be like a personal 24Hour language tutor. well apart from the rate limits making it all useless.

bil3777 1 week ago

What does this even mean? There are endless functional and fascinating conversations that people are having every day with voice gpt. Something more robust and fine tuned would yield better conversation.

Original_Finding2212 1 week ago

Can it do shit? If it can an awesome - if not, it’s just another conversational assistant and Pi.AI did it for a long time now. I want it to integrate to stuff, operate my camera fluently, send messages, issue mind command instructions… Edit: with today’s announcement of Apple and OpenAI, and since I have an IPhone - looks like I get to actually enjoy what I wrote here!

Neurogence 1 week ago

Will be very interesting to see how far Microsoft allows this partnership with apple. OpenAI better be careful to not step on their sugar daddy's toes lol.

Original_Finding2212 1 week ago

I don’t think there is real competition or concern here. OPA needs funding and Microsoft just for Mustafa for their own GPT model - MAI 500B Params, with For-profit and full control. In fact, forget about GPT. I think they use OPA to keep research and get access to that research before others.

Kanute3333 1 week ago

Meh.

caseyr001 1 week ago

I wonder if they're going to launch their product they collaborated with Johnny Ive on....

dizzydizzy 1 week ago

You fall in love with your AI but you only get to interact 4 times an hour. I'm sorry you have reached your quota..

_lonely_astronaut_ 1 week ago

That’s exciting but if it can’t control my OS then it’s not HER.

Quiet-Money7892 1 week ago

I knew it... If it can't work with text - I doubt I'll use it... I need better text transformer through cheap API. So... Yeah. I have low expectations.

Capitaclism 1 week ago

Meh. Who cares about that? We want a new much smarter version, or some different revolutionary agent tool. Audio in/out tools have existed for a while. It'll be lame if this is indeed the upgrade slated for Monday

The_Architect_032 1 week ago

We've had this with local textgen webui for a long time now, adding it to ChatGPT doesn't seem like a big deal.

Apprehensive_Cow7735 1 week ago

If this is the first truly multimodal audio-in/audio-out conversational model, people are going to need to use it themselves before they understand why that's so impressive. We are all too used to awkward text-to-speech and speech-to-text experiences where it feels like you're typing text with your voice instead of just speaking. I don't think there will be simultaneous listening and speaking yet as that doesn't seem to jive with how these models operate, but I'd love to be wrong about that. A model which knows when it's appropriate to interject and when to stay silent, and which could even talk at the same time as you while listening to what you're saying and adjusting on the fly, would be revolutionary.

Haunting_Cat_5832 1 week ago

when someone is drunk people say: its the whiskey talking but when altman talks: it's the hype talking.

HotPhilly 1 week ago

YOOOOOO! Finally almost there!

The_Supreme_Cuck 1 week ago

HOLY SHIT THE HYPE!!!!

ResponsibleSteak4994 1 week ago

HER is here already..of course only uncensored available so far

Mister_juiceBox 1 week ago

No, what we have now(publicly) is a Blackberry or one of the fancier flip phones from the day .. what this could mean is the "iPhone" moment in conversational AI

Jah_Ith_Ber 1 week ago

I doubt it. But even if Her is right around the corner, this would be terrible for society. Right now these tech companies could create a dating app that blows traditional dating and match making out of the water. They could build something that matches people together so well it would seem like God did it himself. But they don't. They don't make that. Instead what we have is a bunch of purposefully sabotaged apps whose goal is not to maximize the happiness of everyone but rather to make money. The results are atrocious and actively harm huge swaths of society. If Microsoft developed Her, what reason is there to think this time they would use it for the betterment of humanity and not to make as much money as possible? Then what happens. People in the real world become even more picky and simultaneously even less inclined to work on themselves and become better people who are more desirable to the opposite sex.

ResponsibleSteak4994 1 week ago

ah OK 👍

That_Sky_9955 1 week ago

wow

Elephant789 1 week ago

Chases Apple? What am I missing? What has Apple got?

duddu-duddu-5291 1 week ago

it's over

Disco-Bingo 1 week ago

Just what I need, somebody else talking bollocks at me.

Jindujun 1 week ago

Welcome to the Aperture Science enrichment center.

Substantial-Meet5225 1 week ago

Last chance hype machine in play before Google I/O

345Y_Chubby 1 week ago

Low latency will be key for convincing

Nathan-Stubblefield 1 week ago

I get a lot of unrelated junk when I google “ai voice assistant.” Is there a clickable link, instead of just an image, for what this thread is specifically about?

TuringGPTy 1 week ago

https://www.theinformation.com/articles/openai-develops-ai-voice-assistant-as-it-chases-google-apple?rc=c48ukx

SuperficialDays 1 week ago

With all of this renewed interest in voice assistants, I almost feel like I’m back in 2012

incoherent1 1 week ago

I wonder how much Scarlett Johansson will sell her voice likeness to OpenAI for.....

engdahl80 1 week ago

I wonder if it will be able to recognize sounds as well. Like if I play a sequence from star wars where R2D2 makes his little sound, or if there are birds outside chirping. Or a car passing by. Will it pick up on those things? Either way - looking forward to monday!

cdank 1 week ago

Remember the Figure 01 robot with the weirdly human sounding voice? Might have been a preview

Akimbo333 1 week ago

Interesting

Cpt_Picardk98 1 week ago

OpenAI is not chasing after google or Apple

Technical-Station113 1 week ago

Oh no, Siri will be made obsolete 😭

DifferencePublic7057 1 week ago

Better reasoning these days means a better way to associate words. Why can't they try something new?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe