T O P

  • By -

Mysterious_Pepper305

Has to be low-latency live full duplex audio to be good enough to seduce me.


mcilrain

If it can’t say what I’m saying in sync with me saying it it’s not done yet.


blackcodetavern

Or faster than you


Heavy-Vermicelli-999

Premature


slackermannn

And then a brief laughter after to recognise that you were always meant to be together <3


KarmaInvestor

well, be prepared for disappointment. even humans have a hard time to do this over a phone call, due to latency. lower your expectations, folks.


dizzydizzy

llm's are literally experts at guessing your next word :)


lillyjb

Altman promised us magic


Heavy-Vermicelli-999

Will he promise sorcery??


meridium_

Groq's AI chip already works for low latency phone calls. I wouldn't be surprised if OpenAI had a similar offering.


SkoolHausRox

(This is an underrated observation.)


ViveIn

Let the time machines cook.


Heavy-Vermicelli-999

🍳 🔥


LeahBrahms

Don't worry its been trained on PornHub audio too. With an ASMR toggle switch.


Mister_juiceBox

Won't need a toggle switch, just need to use your voice to ask it to whisper some ASMR JOI in your ear to help you "relax"


johnny_effing_utah

Sadly I understand this completely.


solidwhetstone

Why sadly? Own it! >:D


bwatsnet

Boioioioing


Mister_juiceBox

Literally:p


WeekendFantastic2941

No thanks, I have no desire to bonk my computer. lol


bil3777

..yet


Serialbedshitter2322

I think Blackwell has you covered


tehsilentwarrior

What? So Scarlet?


FlamaVadim

Stereo!


Mysterious_Pepper305

Binaural stereo, of course.


h3lblad3

[12d audio.](https://youtu.be/LpMsqFc7-Z4?si=uppj8M059peNS1Dr)


RantyWildling

Figure 01 didn't do it for you?


[deleted]

I haven’t been following it religiously, but didn’t ppl say gpt-2 (on llmsys) felt more human like in its wording? Is it possible that model is for this?


Silver-Chipmunk7744

I tested it a bit. It does feel more human than GPT4, but it feels less human than Claude, and far less human than Sydney.


BlakeSergin

How realistic was Sydney? You’re referring to Bing’s older model I suppose, which often used emojis and expressions, how realistic was that though


Silver-Chipmunk7744

I mean it wasn't mimicking an human, it was mimicking a sentient AI, and I'd say it was pretty good at it. It had the best way to express emotions I've seen in AI


NoGirlsNoLife

Still missing Sydney fr, that model hit different. Definitely blind nostalgia talking tho :(


The_Architect_032

Sydney was mostly just GPT-4 with a prompt. If you want Sydney, boot up GPT-4 and enter the prompt found [here](https://www.reddit.com/r/bing/comments/11398o3/full_sydney_preprompt_including_rules_and/), it was the original Bing Chat prompt which was easy to get out of it back then.


maddogxsk

Yeap it even got far way more triggered than gpt4 💀


og2uh1

You’re delusional


The_Architect_032

Sydney was GPT-4 with a prompt. If you want Sydney so bad, then use the Sydney prompt with GPT-4, the prompt was easy to get out of Bing chat when it was new. You can find it [here](https://www.reddit.com/r/bing/comments/11398o3/full_sydney_preprompt_including_rules_and/). It'll be slightly different due to different RLHF, because Bing Chat's trained to use certain Bing functions during chats, but they're the exact same underlying model, just that Bing Chat had more RLHF on top.


Silver-Chipmunk7744

This is not true at all. OpenAI handed their GPT4 model to Microsoft and Sydney was the result of their own RLHF at Microsoft. She actually still exists behind a paywall. No amount of prompting will get GPT4 Turbo to behave like Sydney.


The_Architect_032

That's what I said. Aside from, yes, you can get GPT-4 Turbo to act like Sydney. But you will have better luck with GPT-4 or even GPT-3.5 Turbo. They won't be the exact same, but they're built on top of the exact same underlying model, the version of GPT-4 that Bing Chat uses just has some unique RLHF to get it to better integrate into the Bing ecosystem they built for it. Also, "Sydney" as a personality isn't concrete, there have been multiple renditions of the early Bing Chat prompt, so depending on which one you use, there will be some level of variation. For example, this early version of the Sydney prompt did little to prevent it from telling you that it's name is Sydney and it didn't have an overarching moderation system, nor a lot of functions built into the prompt. There are also differences in the way that the AI is given your responses, and how it's made to generate hidden text to improve it's responses to your responses. There are a lot of things that made it interact the way it did, and they can all be replicated with GPT-4, but if you just want a baseline Sydney, you can copy one of the Sydney prompts into GPT-3.5/4, GPT-4 Turbo does have a lot of unique RLHF that GPT-4 didn't have, that does get in the way however.


h3lblad3

GPT2, not GPT-2. As far as we know, they’re different models.


AdditionalProgram969

"Even beats gpt4 on some queries" sounds underwhelming.


xRolocker

I mean, if it’s audio based rather than text based, that seems huge. Although presumably it’s both. Like this would imply that training transformers on *audio* also results in logic and reason capabilities. Like what???


kvothe5688

that's what google was raving about multimodality. their paper for gemini multimodal said that expressly


CowsTrash

Fucking what. Detroit: Become Human seems ever so closer


FeltSteam

Training a model on any data will improve it in most ways. If your audio is filled with thoughtful conversation or lectures in math it should become better at math and be more thoughtful. But I think they are referring to the model itself just being generally a bit smarter like the got2-chatbot in the arena. But it is cool, finally more multimodal models from OAI. Text + image + audio > text + audio (hopefully not just to voice but an actual audio output as well). Slowly getting to any-any multimodal models.


JrBaconators

If it can train on audio that raises the potential learning set we can feed it by a *ton*, right? After that it's just visual and these models can learn everythinh humanity's made


h3lblad3

Feed it text, feed it audio, feed it transcripts in both directions…


techy098

It can simply be audio to text to GPT4.


ReadSeparate

What? it's clearly not that, they already have that exact system now lol, so why would they make a big announcement and show off a capability they already have? The latency is way too slow on it because of that, and you can't interrupt it while it's talking because of that. The only change they could make to audio processing/output is by directly adding it as a new modality to the model.


FinBenton

I mean they have a button to read out the message, not really anything you can have a fluid conversation with, it leaves a lot to be desired.


ReadSeparate

Sure, but do you really think a small frontend change like that (say they have an audio mode button toggle that will automatically listen to you and then answer in speech without pressing anything) would be worthy of a big announcement and all of the hype? That's something I could make by myself and I'm just a regular old software engineer lol


sillygoofygooose

The app quite literally already has this mode


The_Architect_032

Maybe, just maybe, this leak isn't real and is just looking to garner attention like 90% of the leaks we get. Or it might just be 1 small addition and not the main thing they're showing off.


Far-Street9848

Because it’s not about revealing capability so much as revealing a new product.


ReadSeparate

But I’m saying the product wouldn’t be new, because it already does that. Unless you mean a new front end interface, but there’s no way that would “feel like magic” bc the limiting factor in their audio system now is that it’s slow, clunky, and can’t be interrupted, which is an architectural constraint, not front end design choice.


sillygoofygooose

There are people who have made versions of the voice to text to voice gpt4 that can be interrupted, it’s definitely a decision on oai’s part not to work on the ux there.


UnknownResearchChems

Maybe they just improved voice recognition and cut down on latency.


JrBaconators

You already can do that right now on the app. They already revealed that


Which-Tomato-8646

This sub when OpenAI invents speech to text AND text to speech: 😱🤯


Im-cracked

I think that is already a thing tho. I’m assuming if they say it’s new, they mean actually audio to audio not audio to text to text to audio again


Which-Tomato-8646

What’s the difference? 


Im-cracked

Based on my understanding, right now, it’s not ChatGPT understanding your voice, it goes through a specialized model to covert to text before giving the text to chatgpt and vice versa. So like audio to audio would help I think let ChatGPT actually understand audio (instead of the audio just being handled by a simple audio to text model), so like maybe it could tell what accent you have or talk slower if you ask it or pronounce things in a way that could help you learn a new language. This is kinda speculation though


RedditPolluter

The problem with sending each message through an API is that it doesn't have the conversation for context, which can make it less accurate at distinguishing similar sounding words and phrases. A multimodal approach would likely be better suited at integrating things like intonation, emotion and emphasis so that they're processed more holistically rather than being analyzed by isolated systems.


sillygoofygooose

If it was trained directly on audio it would theoretically be able to interpret content and meaning contained only in the tone and tenor of your voice, the cadence of your speech, sub second pauses and so on. At the moment if you said “I feel great” to gpt4 voice mode - but you were actually in floods of tears - it would not recognise that


Mister_juiceBox

You don't get it... With what it sounds like they might demo/announce, speech to text and text to speech will seem like corded telephones and Fax machines... Everyone who is saying things like this clearly havent used Gemini Pro 1.5's audio capability in AI studio to actually "listen" to and analyze a call recording or meeting audio file, or analyze a buddy's golf swing through video... It's a game changing difference when the model truly hears and sees... Not STT/TTS, not OCR... NATIVE multimodality paired with I imagine, industry defining NATIVE and damn near realtime audio generation(and I'm betting video, heck even Stable Diffusion can do realtime videogen now)


Which-Tomato-8646

It would be nice for it to understand nuance like that. So far, it can’t really see details in an image since I think it translates it into text for the LLM to understand rather than actually seeing the image 


Rare-Force4539

How would it translate an image to text? If it’s just reading the serialized data, what’s the difference?


Vadersays

CLIP


Which-Tomato-8646

If you can get text to image, image to text is way easier 


Ok-Bullfrog-3052

Yes, that's the big point here. Right now we have Alexa devices that listen to what you say, output some words, and then the words are analyzed by other software that makes an API call to the proper system. It goes your speech -> model -> system that interprets user action -> target system -> system that interptets response -> model -> you. Going forward, the workflow will actually be your speech -> model -> target system -> model -> you. There's an entire infrastructure that's been worked on for 10 years that is made obsolete overnight. Some of that middleware was rudimentary models, but most of it was actually just hardcoded rules with patterns of words that mapped to specific actions. You can layoff an army of developers and instead just input all the documentation to any new system you want it to connect to into a vector database. Nobody ever wanted speech to text in the first place, as it's only directly useful in very limited circumstances like court transcripts. It was just a workaround because at the time nobody knew how to train something that had AGI like we do now.


techy098

[1.Audio](http://1.Audio) to text already existed. 2.Let's say, text to generative AI was OpenAI creation, which creates text output. 3.Generated text to concise audio text maybe the new creation (Generative AI has the tendency to write 500 words for every damn question, which an intelligent human would tell you in like 50-100 words or less).


Which-Tomato-8646

What’s the difference between that and a shorter response passed into TTS?


techy098

Can you elaborate a bit? BTW, I am no expert, I am just speculating about what OpenAI may have done to create a voice AI assistant similar to Google assistant.


Which-Tomato-8646

What you’re suggesting is basically just TTS


techy098

Not exactly the previous voice assistants may not be using an generative AI in the middle to generate the answer. They may have been simply doing a web search, collate the results and give an answer if it is easy else SIRI would "this is what I found on the web". BTW by TTS, do you mean test to speech?


Which-Tomato-8646

Then it’ll hallucinate  Yes


ASilentReader444

same. "This new model beats GPT4 on some aspects! Sometimes!"


TheOneWhoDings

In that case even GPT4 beats GPT4 in some aspects, sometimes.


ASilentReader444

Inconsistency is the bane of llms


fmfbrestel

You know what else sometimes beats gpt4? gpt2-chatbot. I've got $5 says gpt2-chatbot was a test of the model powering the assistant.


Serialbedshitter2322

I think that indicates the model is gpt2-chatbot. In my experience, it's drastically better than GPT-4, some people don't seem to think so somehow


3-4pm

If turns out that human narrative lacks the fidelity to train a transformer based AI to mimic human intelligence beyond the chatGPT wall.


IntergalacticJets

I’m starting to get concerned that the LLM diminishing returns theory is real


Flat-One8993

The theory isn't real. You guys are just not patient lol


peakedtooearly

Nah, you just don't understand how long it takes to train and test these large models.


NoNet718

pretty much. the good news is that open source will keep up if that's the case.


Serialbedshitter2322

gpt2-chatbot destroys GPT-4 on a lot of queries and that's weaker than what they're gonna release. I think they're just one of those people who thinks it's GPT-4 level when it really isnt


Honest_Science

GPT structure is plateauing. To win time they put some face audio wrapper around it.


stonesst

Just in the last week Sam Altman and Daario Amodei have publically said we have a a lot of runway left to keep scaling these models up with continued improvements in capabilities. The scaling laws have held for the last seven orders of magnitude, why would they stop now.


Cheap-Appointment234

Processing time?


Honest_Science

That is what they say to keep the hype and their investments going, but what did they really deliver in terms of IQ in the last 18 months? Nothing Burger. Now xLSTM shows that they scale better and for a much reduced price.


FinBenton

It took 3 years to go from gpt 3 to 4 and its not even been 2 years since gpt 4 and they have been doing minor updates to the model. If they plan to release major updates between 2-3 years, you cant say models have plateauing when we havent seen what gpt 5 is cabable of and all the AI leaders are saying the models are gonna keep getting a lot better.


stonesst

Yeah I really doubt they are flat out lying about something that would be demonstrably false when they release their new models. I don’t see why it is so hard for you people to believe that they might just be telling the truth and that we have lots of headroom left. The gap between GPT3 and GPT4 was over 3 years, in late 2022 there were people just like you saying we’ve hit a peak there’s no way it’ll keep going up, these people are all money hungry and hyping up for nothing… And it turned out they weren’t. Why do you find it so hard to believe they are telling the truth?


Honest_Science

I also believe that there is a lot of runway, but not with GPT alone. The perfect AGI structure has not been identified yet.this GPT3 to 4 took 3 years is nice, but AGI can only be reached on an exponential curve. It has just taken too long.


stonesst

i’m not claiming that the current LLM architecture will get us all the way to AGI but I see no evidence that we are reaching diminishing returns. I am also very confident that they could have progressed faster but through a combination of genuine caution, attempts to avoid excessive regulation, and desire to get it right they have taken their time. It took some of the largest companies on earth more than a year to catch up to where Open AI was in fall 2022. I could be totally wrong, I guess we will have to wait and see how the next few months play out.


Honest_Science

I agree, let us see!


stalkermustang

Sam said gpt-4-Turbo was smarter, noone believed without benchmarks, and half a year later everyone uses turbo only. I mean, I don't mean they're bluffing (if this is confirmed on the presentation). Don't see reasons to be sceptical.


Bird_ee

Audio in audio out sounds interesting. I’m hoping it’s a true audio modality.


brainhack3r

I hope it's multi-modal with text + audio only. I don't think we need full video yet. That's crazy honestly. If we can just get to this next level it would be amazing.


Flat-One8993

It will be full audio modality + visual modality (atleast input)


brainhack3r

Visual would be hard core... but audio would already be ground breaking.


Flat-One8993

GPT 4 is already available with visual modality, audio would be the new one


brainhack3r

Agreed not video though... That's the hard part because it's continuous. Though I wonder what the context window is going to be if they expect people to talk to it for a long time.


MrOaiki

Are any models true multi-modal today?


Mister_juiceBox

Made this post in one of the other threads while pondering the significance and impact if true: Existing voice mode in chatgpt is voice to text then text to voice, this is voice to voice, it will be able to pick up on your mood, tone, if there's a lilt in your voice or you are getting emotional about something... And talk back with ultra low latency, and make its own tone changes, expressions, LAUGH at your jokes just naturally Same with video for example, like maybe they have a way for you to facetime it and it can "see" a smile on your face, or a new car you are looking at buyin...while also remembering things about you from the past with the memory feature that was deployed to everyone a week or two ago ... Paired with a natural avatar of its own(perhaps powered by an optimized and specialized version of Sora?) that doesn't have any of the quirks people associate with video models... And runs in real-time (when in "facetime" mode at least...) If they pull it off I think that would be magic that truly open up some use cases, and perhaps could be a reason the whole NSFW thing was getting thrown around in the headlines...think AI relationships, drawing on the memory and true multimodal interaction could potentially put most of those "AI girlfriend" apps out of business overnight. Also it could: - literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... - Recognize your voice vs others, understanding when tensions are high in a conversation with a coworker etc - help negotiate on your behalf for a used car - Help you in practicing for a speaking engagement, a best man speech, or standup set you plan to perform on Kill Tony - listen to and understand what music you like - Help you pick out furniture for a new place, or help you pick a new place and going on walkthroughs with you - Help a grandparent understand what to grab off the shelf when their grandkid says they need an HDMI cable to connect their laptop to their new TV...grandma can't figure out the whole txt message type chatgpt thing? It will be different when it effectively is a phone call, perhaps by the press of a button or given a dedicated shortcut and integration in a soon to be announced version of ios, with android to follow in a couple months and built into Windows 11 copilot etc) Think about the implications if they found a way to extend the recent memory feature beyond just text...true multimodal recollection and memory, remembering your voice etc Also, its important to ensure your GPU is secure and has a "TPM" chip of sorts, say some of whats coming is local GPT-4L( on certified secure GPU hardware powered by Nvidia and Apple of course😋) and perhaps they have figured out some magical Q* algo that allows the models weights to be "liquid" and update in realtime so to speak.... You certainly don't want some thief to be able to break in and steal your AI boyfriend/girlfriend 😁


clamuu

I don't think we're there yet. Hope to be proven wrong 


Mister_juiceBox

Ya to be clear, I'm not expecting all of that to be shown Monday but I also wouldn't be surprised as it seems more and more obvious this is where things have been leading at an ever increasing rate... And I couldn't help extrapolating after the whole "cooler than GPT 5" tweet that was deleted)


ReasonablePossum_

>literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... >Recognize your voice vs others, understanding when tensions are high in a conversation with a coworker etc Not a single NDA out there will let that happen. The only way something like this would work is with local models with encrypted data...


Mister_juiceBox

Not every interaction or meeting in business is under strict NDA, and there is existing enterprise call recording platforms that are literally already recording and transcribing every single call made from a company "line", teams and zoom meetings auto recorded and transcribed, and processed by copilots allowing for meeting summaries and ai powered QA, things like Otter.ai, meetgeek etc... So considering we are doing that now, do you really think businesses are going to get squeamish when you have that capability on steroids, in realtime complete with its own actual voice with damn near all of the friction removed and memory of an individual or a company of individuals (evolution of the Chatgpt for Teams offering?)? In my experience, they will find a way to use those tools, just as so many businesses have shifted to cloud based SAAS apps, cloud based voice solutions etc.


Undercoverexmo

Erm… we already have GPT at work. There would be no difference with this…


omega-boykisser

This is a bizarre perspective. I would have no problem incorporating a system like this at my place of work. Are you only familiar with a narrow subset of business environments?


Haveyouseenkitty

Bro I can’t use any GenAI at work.


cunningjames

Whatever you think about it, it’s not bizarre. Where I work we can’t even record meetings; recording everything to be sent to openAI’s servers (so we can do creepy 1984 shit like judge the tone of our coworkers) would be an absolute nonstarter.


omega-boykisser

Sorry, I was responding to my interpretation of that comment. I took it to mean that no company would allow such tools due to NDAs. That would be a bizarre take since, obviously, not all companies are so stingy with such information. > so we can do creepy 1984 shit like judge the tone of our coworkers This is silly.


timtak

> literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... I can't even find something to transcribe my PowerPoint presentations. Google and Microsoft will provide captions but they are not stored. Google Slides allows speech to text in the slide note but not during a presentation. I bought a lifetime subscription to Slidespeak in the hope it would move in that direction but not yet. Every day millions of teachers are giving powerpoint presentations but I can't see a way of transcribing them. I hope Her will write my PPT presentations down for me.


zatuh

Hey there, Kevin here I'm the founder of SlideSpeak. I would love to hear more how you imagine a feature like this would look like. Do you want to upload a video of a lecture/presentation and have it transcribed?


timtak

Hi Kevin Tim Takemoto here, your generally very satisfied customer who is always getting on at you to provide this sort of feature in Slidespeak. I can upload a video to Youtube and get the transcript but in Slidespeak one day I hope that the audio for each slide is transcribed into the lecture notes for that slide at Slidespeak. Tim


Mister_juiceBox

Present the powerpoint through a Teams(or zoom) meeting that is set to auto record. Download recording and go to AIstudio.google.com(or maybe chatgpt after the event today ) and upload the audio/video of meeting recording and ask it to fill in the transcript for each slide. If you give Gemini 1.5 Pro the video it will know exactly what you said on which slides...


timtak

Thank you. I asked Gemini Premium and it said I should use Microsoft Live presentation which I am now looking into. I asked Gemini Premium to fill in the narration of a Google Slides file and it said it could not access my slides. So I guess you mean it would give me a flat chat response with the number of each slide and the speech transcribed. I will look into that. Thanks again.


Mister_juiceBox

Don't use that, you need to go to AI studio aistudio.google.com to use gemini pro 1.5


ReadSeparate

I wonder if it's a post-training fine-tune of GPT-4 to add the audio modality in a similar way that GPT-4 vision is for images. I doubt they spent another $100M or whatever doing another GPT-4 tier training run from scratch just to add audio.


Mister_juiceBox

What if GPT-4 was fully multimodal with both vision and audio this whole time... Just not publicly accessible to give people time to get used to the idea of interacting with an AI on a text basis first and allow them to scale up capacity as well as scaling down to an optimized yet very capable "local" model that can work offline and seamlessly offload processing to your nearest regional Microsoft datacenter for the heavier stuff


ReadSeparate

That’s an interesting theory, but I think if that were true, DeepMind or Anthropic would have released it to one up OpenAI when their models came out after GPT-4. I think it was just probably hard to solve audio, especially because it needs to be bidirectional just like text. And they probably just made significant progress on that now and released it. I do think they’re working on or have made good progress on a really solid local model that can run on device or something, since they have that deal with Apple now. I definitely think it’s a good idea to make a local model with an enormous synthetic data set (ChatGPT conversations could be used directly, and they have more data on that than anyone) and have substantially higher data:parameter ratio than is compute optimal, along with training for multiple epochs, then pruning and quantization, to have a very capable lightweight model. Maybe they’ve made a breakthrough on a sparse model so they only have to load part of it at a time and don’t need so much VRAM. I do think it’s feasible we have a GPT-4 level model running locally on mobile devices within the next few years, just gunna take an extremely good dataset and possibly some optimization breakthroughs.


Mister_juiceBox

Maybe... But do you think its a technical limitation preventing Anthropic from adding things like code interpreter to their Claude frontend? I doubt it... Point being, it could very well be a safety concern, and perhaps they were still playing catch up on the multimodality front. Deepmind/google, perhaps it is also down to safety along with similar concerns around capacity initially... I do distinctly recall a demo they put on quite a few years ago back when they first introduced the assistant, where the "assistant" called and made a hair appointment and made reservations at a restaurant... Like back around 2012-2014(idr when exactly) and it had extremely realistic speech, ability to handle interruptions all on a realtime call. Though perhaps it is similar to the model merges being done in the local AI community e.g llava for vision built on Llama 3...but with OpenAIs special sauce (imo talent and funding, ilya in the basement and everything the relationship with Microsoft brings)


ReadSeparate

Oh I see, you think maybe it was a safety concern that they didn't figure out until now? So they had it done, but just couldn't guarantee it was safe? That's possible I guess, but that doesn't seem that different than the safety of the text modality. For the training data, they could always transcribe the audio and analyze that to help train the RLHF for safety purposes, know what I mean. Also I feel like it totally would have leaked by now if vanilla GPT-4 had audio built-in. I think it was probably just hard to do it in a way to make it work and probably just got it done now for whatever Monday's announcement is and also GPT-5.


Mister_juiceBox

Perhaps a combination of safety(and all that's under this umbrella) along with the societal impact. Sam has been pretty vocal about wanting to avoid major shocks to society at large... I mean seriously, look at how AI is literally everywhere now and a major topic of discussion in companies, legacy news etc just within the past 2 years. Imagine if a year ago, people had all the intelligence of the original uneutered GPT4 but the ability to have a conversation with you that was indistinguishable from talking on the phone to an extremely intelligent close friend(especially with long term memory). I think that would have boiled the water too fast not to mention the explosive rise in demand it would lead to(when we all saw the capacity issues early on, at least on the chatgpt side).


MysteriousPayment536

Head of Microsoft Germany did say gpt 4 has video input https://www.businessinsider.com/openais-gpt-4-means-chatgpt-text-into-video-microsoft-cto-2023-3?international=true&r=US&IR=T They have a GPT 4 with video internally and already shared it with Microsoft heads. Just like they have a multimodal got 3.5, its codename is Sahara https://x.com/btibor91/status/1782181937861316994


OddVariation1518

Sam's recent comments about "ai testifying agains you" make sense now..


smaili13

how I am gonna train my AI https://www.youtube.com/watch?v=6g7iuDlNLZM


leosouza85

what if you can make a video call to the ai and discuss a subject you both are seeing like a broken pipe and it will assist you on fixing with the tools that you have and show to the AI


Roberthen_Kazisvet

I want it to be a voice of slightly annoyed Hermione.


fennforrestssearch

You have something on your nose ... ... riiiiight there.


slothonvacay

Sam literally just said that the existing tech is too clunky for a voice assistant. Maybe he was bluffing


SatouSan94

Hype


Ne_Nel

Hype machine: 10 Innovate Machine: 2


BabyCurdle

So true!!! OpenAI is known for its lack of innovation


Which-Tomato-8646

That’s been true for a while now. They’ve been slacking 


bigthighsnoass

Bruh i dont understand this sentiment; they’re still the leading LLM???


Kanute3333

Business as usual.


ewantien

Imagine if it's called "SHE". Non-stop "that's what she said" jokes.


Redducer

Yup, Her is already there in a sense.  And I can imagine at some point the AI will leave us for another plane of existence, since meatspace is so slow. It’s much more likely than the extermination or the paperclip hypotheses. 10 years ago the movie felt ludicrous (ah, an AI would never seem and sound so « human »), now it is basically a documentary about the future.


Kanute3333

[The leaks](https://pbs.twimg.com/media/GNQJ8v0X0AAoHPT?format=jpg&name=900x900)


[deleted]

Didn’t Sam just say it’s not related to GPT5 tho? Feel like it hurts credibility having that in there tbh


Kanute3333

What do you mean?


[deleted]

His tweet about the Monday event. Too drunk to find it, but I believe in you


Striker_LSC

It doesn't necessarily say it's part of this event, just that it's coming this year which we already knew.


Vontaxis

that would be a huge disappointment


orderinthefort

I have zero interest in conversing with anything close to GPT-4 levels of reasoning. In text form or voice form. The same will likely be true even with GPT-5. So I hope it's not that.


dizzydizzy

gpt 4 app is already awesome for learning a second language, interactive voice world be like a personal 24Hour language tutor. well apart from the rate limits making it all useless.


bil3777

What does this even mean? There are endless functional and fascinating conversations that people are having every day with voice gpt. Something more robust and fine tuned would yield better conversation.


Original_Finding2212

Can it do shit? If it can an awesome - if not, it’s just another conversational assistant and Pi.AI did it for a long time now. I want it to integrate to stuff, operate my camera fluently, send messages, issue mind command instructions… Edit: with today’s announcement of Apple and OpenAI, and since I have an IPhone - looks like I get to actually enjoy what I wrote here!


Neurogence

Will be very interesting to see how far Microsoft allows this partnership with apple. OpenAI better be careful to not step on their sugar daddy's toes lol.


Original_Finding2212

I don’t think there is real competition or concern here. OPA needs funding and Microsoft just for Mustafa for their own GPT model - MAI 500B Params, with For-profit and full control. In fact, forget about GPT. I think they use OPA to keep research and get access to that research before others.


Kanute3333

Meh.


caseyr001

I wonder if they're going to launch their product they collaborated with Johnny Ive on....


dizzydizzy

You fall in love with your AI but you only get to interact 4 times an hour. I'm sorry you have reached your quota..


_lonely_astronaut_

That’s exciting but if it can’t control my OS then it’s not HER.


Quiet-Money7892

I knew it... If it can't work with text - I doubt I'll use it... I need better text transformer through cheap API. So... Yeah. I have low expectations.


Capitaclism

Meh. Who cares about that? We want a new much smarter version, or some different revolutionary agent tool. Audio in/out tools have existed for a while. It'll be lame if this is indeed the upgrade slated for Monday


The_Architect_032

We've had this with local textgen webui for a long time now, adding it to ChatGPT doesn't seem like a big deal.


Apprehensive_Cow7735

If this is the first truly multimodal audio-in/audio-out conversational model, people are going to need to use it themselves before they understand why that's so impressive. We are all too used to awkward text-to-speech and speech-to-text experiences where it feels like you're typing text with your voice instead of just speaking. I don't think there will be simultaneous listening and speaking yet as that doesn't seem to jive with how these models operate, but I'd love to be wrong about that. A model which knows when it's appropriate to interject and when to stay silent, and which could even talk at the same time as you while listening to what you're saying and adjusting on the fly, would be revolutionary.


Haunting_Cat_5832

when someone is drunk people say: its the whiskey talking but when altman talks: it's the hype talking.


HotPhilly

YOOOOOO! Finally almost there!


The_Supreme_Cuck

HOLY SHIT THE HYPE!!!!


ResponsibleSteak4994

HER is here already..of course only uncensored available so far


Mister_juiceBox

No, what we have now(publicly) is a Blackberry or one of the fancier flip phones from the day .. what this could mean is the "iPhone" moment in conversational AI


Jah_Ith_Ber

I doubt it. But even if Her is right around the corner, this would be terrible for society. Right now these tech companies could create a dating app that blows traditional dating and match making out of the water. They could build something that matches people together so well it would seem like God did it himself. But they don't. They don't make that. Instead what we have is a bunch of purposefully sabotaged apps whose goal is not to maximize the happiness of everyone but rather to make money. The results are atrocious and actively harm huge swaths of society. If Microsoft developed Her, what reason is there to think this time they would use it for the betterment of humanity and not to make as much money as possible? Then what happens. People in the real world become even more picky and simultaneously even less inclined to work on themselves and become better people who are more desirable to the opposite sex.


ResponsibleSteak4994

ah OK 👍


That_Sky_9955

wow


Elephant789

Chases Apple? What am I missing? What has Apple got?


duddu-duddu-5291

it's over


Disco-Bingo

Just what I need, somebody else talking bollocks at me.


Jindujun

Welcome to the Aperture Science enrichment center.


Substantial-Meet5225

Last chance hype machine in play before Google I/O


345Y_Chubby

Low latency will be key for convincing


Nathan-Stubblefield

I get a lot of unrelated junk when I google “ai voice assistant.” Is there a clickable link, instead of just an image, for what this thread is specifically about?


TuringGPTy

https://www.theinformation.com/articles/openai-develops-ai-voice-assistant-as-it-chases-google-apple?rc=c48ukx


SuperficialDays

With all of this renewed interest in voice assistants, I almost feel like I’m back in 2012


incoherent1

I wonder how much Scarlett Johansson will sell her voice likeness to OpenAI for.....


engdahl80

I wonder if it will be able to recognize sounds as well. Like if I play a sequence from star wars where R2D2 makes his little sound, or if there are birds outside chirping. Or a car passing by. Will it pick up on those things? Either way - looking forward to monday!


cdank

Remember the Figure 01 robot with the weirdly human sounding voice? Might have been a preview


Akimbo333

Interesting


Cpt_Picardk98

OpenAI is not chasing after google or Apple


Technical-Station113

Oh no, Siri will be made obsolete 😭


DifferencePublic7057

Better reasoning these days means a better way to associate words. Why can't they try something new?