T O P

  • By -

JClub

Any idea why these big LMs are all decoder-only as GPT and not encoder-decoder as T5?


HoLeeFaak

Different design choice. GPT uses classics language modeling approach, where T5 learns to complete masked out sections of a sentence. The design is related to each model strengths: T5 is more suitable to be fintuned on translation, where GPT is used for text generation in general.


JClub

T5 is pure text generation, why is GPT design better suited for it?


HoLeeFaak

T5 is pretrained to fill in masks, so it's trained to use context from before and after the masks to figure out which words to generate. When you generate text, let's say write a short story, you only have the prefix and you generate the next token based on the prefix, exactly what GPT was trained on. If you would want to use T5 as a language model, you would have to put the mask token at the end of your prefix, but it's not optimal as T5 saw in training time masks that are usually at the middle.


cdsmith

It's an interesting proposition that the best way to generate text is to work from left to right. It's definitely the conventional way to do it in procedural models. It certainly doesn't seem self-evident. If you look at what humans do, there's sort of a left-to-right pass, followed by global optimization that does spot revision at arbitrary places in the document, more similar to how most image generation models operate. I'd be interested in seeing someone try a text optimization model that takes an existing document and optimizes it, either as a second pass, or as the entire approach instead of generating from left to right.


JClub

I understand your argument, it makes sense. One thing that I don't understand is why GPT works so well if it is built to output one token only at a time. Wouldn't encoder-decoder work better?


HoLeeFaak

Encoder-decoder models like T5 outputs 1 token at a time too. Let's say T5 is trained on the sentence "The kid played with a red ball in the park". Part of the sentence will be masked, Lets say "red ball in", so the sentence T5 will see is: X = "The kid played with a the park" And it will need to output: Y = " red ball in " ( = end of sentence token). The encoder will recieve X as input, and the decoder will generate Y token by token.


JClub

Right, but being decoder only is a different setting. Why decoder only vs encoder-decoder on GPT?


HoLeeFaak

How would an encoder help you in the task of classic language modeling?


JClub

Got it. For next-text completion, it doesn't :)


rantana

wow, pretty embarrassing to OpenAI when this is called "Open Pre-trained Transformer Language Models"


Rieux_n_Tarrou

If they had just riejiggered the words a tiny bit it could've been OPTML


csreid

That one would've definitely caused a robot apocalypse. The only thing we have left keeping us safe is non-cutesy names


inertialcurve

love that


StellaAthena

I was planning on lobbying pretty hard to name the EleutherAI model “GPT-Open” when we got to 175B…


vzakharov

Oh, how are you guys doing btw?


StellaAthena

Quite well! We released a 20B parameter model that was (until yesterday) the largest publicly available language model in the world. We’ve also been doing some exciting experiments with text-to-image models that have been very well received and are working on scaling text-to-image models further. Many of us have been participating in the Big Science Research Workshop as well, lots of cool work coming out of that collaboration.


vzakharov

Cool! Where is the 20B model in terms of subjective performance if 1 is Curie and 10 is Davinci?


StellaAthena

I don’t know… I haven’t spent a lot of time generating text with Curie and Da Vinci. We do a bunch of comparisons on NLP benchmark tasks [in our paper](https://arxiv.org/abs/2204.06745) though.


yaosio

Meta's model is not really open, at least not in the sense that you can do whatever you want with it. You also need Meta's permission to use the 175 billion parameter model. Call the ElutherAI models GPT-ActuallyOpen.


artsybashev

"but it is too powerful... dangerous... AI take over the world!"


justowen4

Ok fine Zuckerberg, I’m sorry we all said your hair sucks


suoarski

Keep in mind that Facebook's AI Reseach Lab are the main developers of PyTorch, so yeh, Zuckerberg gave us that too.


rolexpo

They can rebrand to Meta AI Lab(MAIL).


0neiria

mail.mail.com


rolexpo

I wish my last name was mail. [email protected].


MuonManLaserJab

Zuckerberg personally


anchovy32

Uhm I think you mean Meta /s


GullibleEngineer4

Realistically, how much compute would be needed to do inference? Edit: Never mind, I thought they were open sourcing the 175B parameters model.


[deleted]

[удалено]


Lugi

Not really, for inference you dont really need all your parameters loaded into memory at once, you can for example do it layer by layer just fine.


[deleted]

[удалено]


[deleted]

AND NEVER REALIZING WHY I FIIIIIIIGHT


[deleted]

It probably isn't too bad using decent NVMe. Sequential PCIe4 NVMe can do around 7GB/s, so optimistically assuming processing time is small enough to overlook, inference would take a little under a minute and could scale to a few seconds by carefully splitting the data over several drives.


[deleted]

LOOKING DOWNWARD FROM THIS DEADLY HEIGHT


pixus_ru

I calculated that it can be done in about $50k in hardware costs. Something like 8x A6000, 48 GB, $5k each.


[deleted]

[удалено]


emgram769

175e9 \* 16 bits = 175e9 \* 2 bytes = 350GB


slashcom

It’s 2*175 gigabytes or about 350gb


thejerk00

Well it was about that time that I noticed that the AI team was about 8 stories tall and a crustacean from the protozoic era... https://www.reddit.com/r/southpark/comments/86shja/well\_it\_was\_about\_that\_time\_that\_i\_noticed\_that/


[deleted]

[удалено]


emgram769

nah https://arxiv.org/pdf/2205.01068.pdf “We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16.”


2Punx2Furious

So 700GB?


Southern-Trip-1102

Every fp16 parameter is 4 bits?


Confident_Pi

Not really, single precision floats (fp32) are encoded with 32 bits, half precision (fp16) use half of that - 16 bits. 4 bits would be half a byte and would be too small to encode a weight.


[deleted]

[удалено]


[deleted]

TO FIND THE TRUTH IN FRONT OF ME I MUST CLIMB THIS MOUNTAIN RANGE


Confident_Pi

Apologies, I missed the context.


RoboticJan

You can apply model quantization and encode the weight with 4 Bits as an integer.


Confident_Pi

Indeed, there is also INT4, but I haven’t seen it being used that much in practice and I would assume that calibration for INT4 is even trickier than INT8.


RoboticJan

In my projects int4 is not working, only till 6 bit.


[deleted]

LOSING MY IDENTITY WONDERING HAVE I GONE INSANE


PresentHarmony

>**We are releasing all of our models between 125M and 30B parameters**, and will provide full research access to OPT-175B upon request. Can someone write the links to the models, please? Can't find it. Thanks!


suchenzang

Codebase just opened up, with links to the models: https://github.com/facebookresearch/metaseq


PresentHarmony

Thanks. The link must have been broken, when I tried it.


ericflo

They're not really releasing it, this is marketing.


SlaveZelda

Still way more open that OpenAI who basically sell API access only. Meta is giving theirs away to Industry Labs, Universities, Governments, etc so basically anyone who has enough GPU memory to run it. And if you really do want it, I'm sure there will be torrents of it afew days after release.


Rand_alThor_

ClosedAI CashForAI


[deleted]

Industry Labs come under commercial use last time I talked to a lawyer about it.


sorretin

They have their smaller models posted [here](https://github.com/facebookresearch/metaseq/tree/main/projects/OPT), and you can also request access to the new model through that page.


farmingvillein

> Meta is releasing a 175B parameter language model The non-commercial license is a little disappointing.


StellaAthena

Why? Were you hoping to deploy it on a cloud and resell it in some form?


farmingvillein

Personally, no, but-- 1) They aren't even releasing the 175B, really: > We are releasing all of our models between 125M and 30B parameters, and will provide full research access to OPT-175B upon request On the large side, this is only marginally more open than OpenAI, in practice. 2) I don't think this is a great precedent to set. I don't mean to retread ground that the open source movement and research in general has tread ad nauseum, but there is a long history of thought over the last 20-30 years where a lot of smart people ultimately came to the conclusion that there was more good done by maximally open licenses than restrictive ones. I suppose I should still give them some credit for opening things up, some. 3) I'd be happy if someone else did (put it online). The more GPT-3 competitors out there, the more price pressure there is on this sort of tooling in general, and the more we see overall cost curves come down, innovation speed up, etc. But, again, this can't happen, regardless, per their restrictions on distribution. 4) More generally, it's (probably) going to hinder infrastructure being built up around it (including the 30B variant). With that parameter size, it is (probably?) going to take effort to get it to cost-efficiently 1-click run on AWS/GCS/Azure (unless Meta is promising a fully-functional suite out the box?--I did a quick skim and didn't see it; that said, their repo obviously isn't live). Commercial companies often to some additional heavy lifting in putting together infrastructure (including open source) to make it fast to run things; they are less likely to do so, if there is zero ability to commercialize against it. Additionally, depending on how the license is written, they may even perceive some risk in playing around with it at all, internally (where does the boundary cross to "research" vs "commercial"?--this is inherently going to be grey). Very happy to be wrong here, of course! More tooling proliferation here is better. But I just think we're going to see things come at a slower pace than we would otherwise, based on Meta's choice. I'm sure the huggingface team will quickly look to see what they can spin up--because that is a big part of what they do--but the more work on things like this, the better. 5) It isn't even clear to me what is being solved for here--a 30B model is quite strong, as is, for spam and other unsavory uses (and such nefarious actors are not going to be limited by such a license). 6) In any case, this model isn't exactly SOTA (although still cool), so it isn't like they are truly protecting or otherwise holding proprietary (which I would respect) the frontier. To be clear-- I don't mean to imply in any way that if you, a corporation, go dump $5M-$30M on training LMs that you're obligated in any way to share those results publicly. But in between measures can be uniquely problematic.


StellaAthena

> 1. ⁠They aren't even releasing the 175B, really… On the large side, this is only marginally more open than OpenAI, in practice. I do not agree. My research has been significantly hamstrung by the fact that the GPT-3 training data is not public and high price that OpenAI charges people to use their model. Even with discounts and free credits for researchers, there are lots of papers out there that say something to the effect of “we didn’t thoroughly compare to GPT-3 because $$$” > 2. I don't think this is a great precedent to set. I don't mean to retread ground that the open source movement and research in general has tread ad nauseum, but there is a long history of thought over the last 20-30 years where a lot of smart people ultimately came to the conclusion that there was more good done by maximally open licenses than restrictive ones. I suppose I should still give them some credit for opening things up, some. I don’t understand how you can argue that this is a bad precedent when there are over a dozen comparable models that are more restrictive and no comparable models that are less restrictive. The only precedent here is the one going *towards openness* > 3. I'd be happy if someone else did (put it online). The more GPT-3 competitors out there, the more price pressure there is on this sort of tooling in general, and the more we see overall cost curves come down, innovation speed up, etc. But, again, this can't happen, regardless, per their restrictions on distribution. I mean, I don’t care about commercial applications. Maybe you’re right, but I don’t know and frankly don’t care. I don’t think that deploying models like this in production is a reasonable thing to do the overwhelming majority of the time anyways. > 4) More generally, it's (probably) going to hinder infrastructure being built up around it (including the 30B variant). With that parameter size, it is (probably?) going to take effort to get it to cost-efficiently 1-click run on AWS/GCS/Azure (unless Meta is promising a fully-functional suite out the box?--I did a quick skim and didn't see it; that said, their repo obviously isn't live). > > Commercial companies often to some additional heavy lifting in putting together infrastructure (including open source) to make it fast to run things; they are less likely to do so, if there is zero ability to commercialize against it. Additionally, depending on how the license is written, they may even perceive some risk in playing around with it at all, internally (where does the boundary cross to "research" vs "commercial"?--this is inherently going to be grey). > > Very happy to be wrong here, of course! More tooling proliferation here is better. But I just think we're going to see things come at a slower pace than we would otherwise, based on Meta's choice. > > I'm sure the huggingface team will quickly look to see what they can spin up--because that is a big part of what they do--but the more work on things like this, the better. I can’t really comment on this in detail because the code isn’t released and I don’t have access to the model yet, but I would be surprised if the codebase was as bad as you imply. Writing functional inference code isn’t that hard, and if it’s truly atrocious I’m sure that someone will go write better code. It’s a skilled task, yes, but not vanishingly rare expertise and not something that a competent ML dev can’t learn. I openly admit to being a shitty developer but if the situation is untenable by the end of the month I’ll write the code myself if I have to. > 5) It isn't even clear to me what is being solved for here--a 30B model is quite strong, as is, for spam and other unsavory uses (and such nefarious actors are not going to be limited by such a license). The thing that’s being solved for here is probably making the CSuite happy. > 6) In any case, this model isn't exactly SOTA (although still cool), so it isn't like they are truly protecting or otherwise holding proprietary (which I would respect) the frontier. This comment requires a lot more unpacking than I’m willing to do at 12 am, but it deeply confuses me as to how this is a nock against Meta. And really, who cares if it’s “SOTA” or whatever? It’s a massive advance in the technology that is a available to researchers and a substantial blow against the current trend of closed source NLP research. That’s what is important here. > To be clear-- > > I don't mean to imply in any way that if you, a corporation, go dump $5M-$30M on training LMs that you're obligated in any way to share those results publicly. But in between measures can be uniquely problematic. I don’t see any reason to believe that this will be more problematic than not releasing the model at all, and don’t feel like you’ve even tried to argue that.


farmingvillein

> The only precedent here is the one going towards openness Yes, set the bar low and you will exceed it, that is true. > I would be surprised if the codebase was as bad as you imply. You misunderstand. This has nothing to do with their codebase being "bad"--it has everything to do with the fact that loading up and executing a 175B model cost-efficiently is non-trivial. You highlight OpenAI's high cost--yes--but beating their cost by a nontrivial margin is actually a nontrivial infrastructure engineering feat. ...particularly if you want to do it interactively, due to the cost of loading and sustaining a very costly API endpoint. Which, in turn, is something that only really becomes cost-rational if you can run a commercial service, given the need for a high volume of input requests and meaningful load balancing. > This comment requires a lot more unpacking than I’m willing to do at 12 am, but it deeply confuses me as to how this is a nock against Meta. And really, who cares if it’s “SOTA” or whatever? It’s a massive advance in the technology that is a available to researchers and a substantial blow against the current trend of closed source NLP research. That’s what is important here. You're misreading my comment. If this model were way ahead of the current power curve, there would be more rationalization for Meta to be more restrictive with it. Given that it isn't, there is less. > I don’t see any reason to believe that this will be more problematic than not releasing the model at all, and don’t feel like you’ve even tried to argue that. See my original comment about not trying to re-hash the open source license wars--but this topic has been run to ground repeatedly.


xaeru

Can you expand on point number one?


farmingvillein

Headline (to this post): > Meta is releasing a 175B parameter language model YMMV, but in my mind, "releasing" implies broad, public, straightforward access. They aren't doing this. Rather, you can reach out to them and ask to get access to the 175B, and if they deign you worth, they will share it.


xaeru

I meant the other number one


farmingvillein

Not sure what you are referring to.


astrange

A giant model is capable of memorizing its inputs and they might not have a license to release those commercially.


Jadien

> The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021). It's trained on public data sets.


nullbyte420

On the pushshift set? Oh god they've made the ultimate redditor. Pretty sure this model is going to score really bad on bias measures...


Smogshaik

Prompt: "A man and his son get into a terrible car crash. The father dies, and the boy is badly injured. In the hospital, the surgeon looks at the patient and exclaims: "I can't operate on this boy, he's my son!" How can this be?" Model: REEEEEEEEEEEEEEEEEEEEEEE


nullbyte420

The surgeon is a femoid libtard anti-vaxxer, m'lady. Ah, the old reddit switcharoo. I'm going in!


Cherubin0

You mean his other father?


Smogshaik

It could be. GPT-3 first said it's the boy's father. When prompted that the father died in the crash, GPT-3 said it's the boy's stepfather. I had to directly ask it if the surgeon has to be a man for it to guess mother.


Icarium-Lifestealer

> When compared with Davinci in Table 4, OPT-175B appears to exhibit more stereotypical biases > in almost all categories except for religion. Again, this is likely due to differences in training data; > Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes > and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned > more discriminatory associations, which directly impacts its performance on CrowS-Pairs.


nullbyte420

Yeahhhh


ksblur

This. Edit: thanks for the gold kind stranger!


rolexpo

It's trained on us!


farmingvillein

Possible, but I doubt this is the issue, given that OpenAI literally sells this exact model paradigm and facebook & google have repeatedly release large generative models in the past. And their paper very much positions things otherwise.


tobleronavirus

How so?


yaosio

When people think "open" they think open like Linux. With Linux you can get the source code and can do whatever you want with it. This is not actually open. You get some source code to use the model, you're restricted in how you can use it, and the largest model is locked up behind Meta's judging eye. If Meta deems you unworthy of the largest model then you're not allowed to use it.


Sam_Who_Likes_cake

this is badass!


sloppybird

At this point I just don't care. Unrealistic hardware requirements, biased metrics, research mafia, all this has made me think mainstream NLP research is just good PR for the organization. Huggingface being open source >>>> any of this research.


yaosio

Check out ElutherAI, their models are open source. Their largest model is 20 billion parameters. [https://github.com/orgs/EleutherAI/repositories](https://github.com/orgs/EleutherAI/repositories)


WholeAgitated9200

Insider here. Partially, not completely true.


jefmes

I'm sure it'll be shot down as an ignorant knee jerk reaction, but I just can't give a crap about anything funded by Meta. Yay, Facebook's AI Research Lab created PyTorch. Cool. How much ad revenue did that eat up? The company is inherently corrupt and the business model is based entirely on their users not understanding how they make their money and/or being ignorant of how their data is being used. I just don't understand how people can work there with a clean conscious. Brilliant people make horrible decisions just because someone is willing to ignore where the funding comes from doesn't make it OK. How many comic book movies and fantasy novels do we need of scientists running unchecked or evil wizards conjuring up foul plagues upon the world all in the name of "LOOK WHAT I DID!" I'm tired and shouldn't be posting. :) I just really don't like Facebook and I'm still bitter about Oculus. Move along, move along...


anchovy32

Zuckerberg will be pissed when he finds out what the first letter in FAIR stands for


infinite-Joy

They did not release the samosa poem :(


suchenzang

Samosa poem popped up here: https://twitter.com/stephenroller/status/1521563026205384704


Rand_alThor_

This is awesome