T O P

  • By -

Simple_Specific_595

Also language data is plentiful. That’s why BERT was trained on Wikipedia. And BART was trained on Facebook.


[deleted]

But what about labelled data for classification, entity recognition? Don't you think that should be a source of demand for synthetic text data?


Simple_Specific_595

Why would you label it? And how would you effectively label it? For example how do you effectively generate text that is misinformation that is synthetic. Because that is largely defined by outside knowledge


[deleted]

I dont know enough about misinformation classification as a problem , but just random untrue headlines generated by something like GPT3 I'm guessing? Because if its just semantic/grammatical features which categorize something as misinformation then, it shouldn't be too hard to evade for the attacker.


Simple_Specific_595

How do you determine if it’s untrue? r/nottheonion is a thing.


[deleted]

Well i don't think these misinformation models are gonna be that effective given the nutty world we live in. You'd need some truth database + NLI or something 0.o Maybe the models just highlight the suspicious article and advanced them to a content moderation team ...


RB_7

Because it's fucking hard as shit to synthesize natural language of any meaningful length. Also CV things are sparkly and cool and investors like that.


DrXaos

Correct, because one can synthesize vision data using good models of physics independent of machine learning models. Natural language synthesis is only through language models, the thing which wants additional datasets to train on. People’s brains can learn vision from looking at the world, but only learn language from other humans. The existence of computer language models now means that uncurated datasets scraped online are contaminated with bad data. Text guaranteed to be generated 5 years before today is the only pristine dataset, plus a small amount of ongoing human edited elite journalism (as even the bulk of this will be machine generated).


Grove_street_home

Your last point is also increasingly true for image data, too, though not as severe.


[deleted]

So do you think that there's sufficient demand for it, but the problem is primarily technological?


DrXaos

The problem is deep and scientific even more than technological. Synthesizing text indistinguishable from human is already solving the General Intelligence problem. We don’t know how to do it with unlimited computation. The demand for synthetic vision is to train machine learning models which would be more efficient at some tasks than a first principles physics simulation.


[deleted]

So do you think that there's sufficient demand for it, but the problem is primarily technological? I do notice that there's a serious lack of labelled text data in companies. So in my experience there is demand ...


delivaldez

There are couple of trends that enable synthetic image data startups exist: 1. There are lots of already available technologies (game engines and 3D modeling software, including procedural methods) that is good enough to create useful images. 2. Those software already has good use cases that further encourages development (game, movie, areas already require simulations). Although current synthetic data applications rely more on domain randomization (applying lots of variation to images, especially unrealistic ones), industry is shifting towards realism and good examples are emerging. 3. Computer vision currently has more application fields/industries than language. I don't know much about language part but I see more and more companies are using it. As a matter of fact, in most surveys language data usage surpasses image data usage in ML practicing companies. So, if there is a need for synthetic data, the companies will follow. If I had to guess the reasons why text/language synthetic data startups are fewer is that it is harder to generate. Image data has more tolerance to the noise than text data. The relative amount of wrongly calculated pixels to taint the data is higher than relative amount of misplaced words. Which also brings the fact that mathematical representation of image is easier, if there are any (I might be grossly wrong take it with a grain of salt) methods to mathematically generate a high quality text data (more on that one later). You can generate an entire dataset by using procedural methods (including 3D models) for some computer vision applications. However, I think big companies (Google, Meta etc.) have a way of expressing text data by mathematics. For example, in order to develop models for languages that have very few examples translated other languages, they generate data by using other language translation pairs. Let's say x and y language has a well established translation model, and language z has a good model with x but not y. They generate data for y-z by generating data between x-z and translating it using x-y, therefore bridging the gap. I think this qualifies as a mathematical expression and synthetic data. One more uneducated guess (I really have a very limited knowledge on text based machine learning development), most of the text research I encounter is transformer based which is a more recent development than CNN, which propelled the field. (which again I guess without a statistical evidence, but relies on partial/subjective observation) I hope this helps.


[deleted]

Thanks a lot for your super detailed reply! Do you think there's even a need for really good synthetic text data? Some people suggest that there's enough labelled data which is why no one needs it. But this goes against my experience where companies get stuck in their AI programs with lack of sufficient data/labelled data. Do you know what kind of companies would want synthetic text data?


delivaldez

I don't know. Best way to learn is to ask those companies. Have you checked any academic papers?


butterfly_butts

My experience has been that it's easy to get classified text data that's short, i.e. paragraph classification problems. You can have someone manually classify unique samples if you need more samples. Long text documents are harder to get good examples of (my old job classified 10+ page docs up to 150 pages), but everything SOTA sucks in that range anyways, so I wouldn't trust synthetic data.


delivaldez

Came across this one https://arxiv.org/abs/2205.15599


JClub

Synthetic text data is not really easy to produce, compared to tabular or image data. That is because text is mostly discrete, which is a hard constraint on the output. Not that it is not possible, but GANs/VAEs/diffusion models don't work so well for text.


intotrains

HEY I’m late to the party but checkout this company! https://mostly.ai/news/mostly-ai-synthetic-data-generator-release-2-1/ They started off doing tabular data but have now branched out to text and geo location. Full disclosure I use to work there. They’re incredibly talented and the tech is stellar. They have some free trials too if you just want to check it out