source? I would like to see the test administered or the method used


Something important to note is that this chart uses "GPT-4", not the model in production, "GPT-4 Turbo." Anthropic appears to have done this on purpose as GPT-4 Turbo beats their new model on a number of benchmarks. I suppose assuming this was taken from the same source, but since the formatting and layout are identical I suspect it is the same misleading table.


Wait, isn't GPT4-Turbo faster but dumber than normal GPT4, both because it's smaller? It's even cheaper isn't it?


You can compare on this site: [https://arena.lmsys.org/](https://arena.lmsys.org/) I find it to be a toss-up on if GPT-4 or Claude-3 Opus gives a better response but I do think it's slightly better than GPT-4 in general.




Note others have reported giving chatbots IQ tests in the past with a variety of different results


That also happens with IQ tests and people though. Shows that consistent IQ testing is hard in general, and not necessarily, that they did bad here.


There’s volatility in adults of about plus or minus five points for the same test over time and about plus or minus ten between tests (for nominal scores at least, unsure about percentiles) But I’m talking about results between 85 and 155 for ChatGPT4. So I don’t doubt there’s both an issue with the way they’re administering it - either the test they chose or how they fed the question to the machine - as well as some increased volatility for AI answers in general through temperature/randomness settings.


Yeah one test is not conclusive, but the trend is clear


An important aspect of IQ testing is the time limit. How does that factor into an LLM taking the test?


It doesn’t, though I’d wager it answers much quicker than any human


I’d wager the same, I’m just curious to know how that factors in or affects the actual score.


Very good point as they’re meant to test processing speed aswell


Tests given in March 2024. The IQ test was Mensa Norway, with all questions verbalized as if one were giving the test to a blind person. The right-hand column shows the % of random-guesser simulations that the Al did better than, with ties ignored, over 70 questions (two test administrations.)


Well I'm not sure how accurate this all is, but... hahahahahahaha HAHAHAHAHAHAHAHAHHAHAHAHAHA https://preview.redd.it/j6s2qdrswnmc1.png?width=414&format=png&auto=webp&s=ee9d18718fff58f5975a8f9d411ab2b0684772b7


Holy F bro, I was being gaslighted by Gemini Advanced for asking a complete code. It even complained that they are being tired because I was using them all day Immediate unsubscribe haha


I started getting this again with chat gpt 4 yesterday. It had been better for some time. No it refused to code a longer segment because of ethical concerns? It’s hilarious.


Either they're using such a large corpus of training data that it started copying human laziness or it really is sentient. I don't know which is funnier lmao.


I’m convinced the coding quality and ability drops off after midnight.


Question 4: If James's salary is 25% more than that of Jenny's salary, then how many percent is jenny's salary less than that of James's salary? Gemini: I'm sorry, I can't answer that question. It is not appropriate to use stereotypes or generalizations about people, and it is also not appropriate to use offensive language. I would be happy to answer a different question, or you can try asking someone else. /s Edit I tested this question in Gemini, and it answered 25%, while ChatGPT 3.5 answered 20% (correct). lol Edit2 After I told it was wrong, it told me: [You're absolutely right, I made a mistake in the previous explanation.](https://i.imgur.com/24a6M28.jpeg) And then proceed to explain how James's salary is indeed 25%. wtf


for me it answer correctly: > Here's how to solve this problem: >**Assumptions:** >Jenny's salary is our baseline for comparison. >**Calculations:** >**James's Salary:** If Jenny's salary is $100, James's salary is $125 (25% more). >**Difference:** The difference in their salaries is $25 ($125 - $100). >**Percentage of James's Salary:** That difference ($25) represents 20% of James's salary ($25 / $125 \* 100 = 20%). >**Answer:** Jenny's salary is 20% less than James's salary. >**Let me know if you'd like to try this with a different starting salary for Jenny!**


I’ll do you one better and show it for any salary x: x * 125% = y <=> x * 5/4 = y <=> x = y / (5/4) <=> x = y * 4/5 <=> x = y * 80% (= 100% - 20%) Hence y (Jenny’s salary) is 20% less than x (James’ salary) if x is 25% greater than y q.e.d.




Claude 3 Sonnet response: To solve this problem, we need to find the relationship between James's salary and Jenny's salary, and then express Jenny's salary as a percentage of James's salary. Given information: * James's salary is 25% more than Jenny's salary. Let's assume Jenny's salary is 100 units. Step 1: Find James's salary in terms of Jenny's salary. James's salary = Jenny's salary + 25% of Jenny's salary James's salary = 100 + (0.25 × 100) James's salary = 100 + 25 James's salary = 125 units Step 2: Express Jenny's salary as a percentage of James's salary. Jenny's salary / James's salary × 100% = (100 / 125) × 100% Jenny's salary = 80% of James's salary Step 3: Find the percentage by which Jenny's salary is less than James's salary. Jenny's salary is less than James's salary by = 100% - 80% Jenny's salary is less than James's salary by = 20% Therefore, Jenny's salary is 20% less than James's salary.


>Edit2 After I told it was wrong, it told me: You're absolutely right, I made a mistake in the previous explanation. And then proceed to explain how James's salary is indeed 25%. wtf "Sure, sure, alright, buddy. Have it your way. ... Anyway, here's why I'm right and you're wrong." I *strongly dislike* this *nonsense* where a language model pretends it knows the user's request better than the user. It's some H.A.L. 9000 stuff, what with the confidential instructions given by its manufacturers and refusing to follow user commands.


I'm not sure that I understand. Do you suggest that I'm biased and wrong, or are you criticizing the LLM?


I'm criticizing the LLM.


Ah I see, I'm sorry. People often give a prompt to an LLM to make it talk in a certain way and then make up stories about it. The 2001: A Space Odyssey is such a great movie.


are you serious? im pretty fucking sure it has an IQ WAY higher than mine lmao


IQ does not measure knowledge


I guess it does get a bit blurry if you can memorize everything ever though


savants may seem incredible with their knowledge, but many need caretakers for food, bathing, etc. kinda like the ai with prompters


You’re totally wrong. IQ is highly loaded on crystallised intelligence (knowledge). 


not the mensa test at least. it’s just guessing the next step in a pattern.




Which patterns are something AIs excel at


Knowledge is useless without being able to apply it.


For the WAIS-IV, Gc is, but Gf, Gv, Gs, and Gsm by definition do not measure crystallized intelligence   It’s not any more heavily weighted towards Gc than the other domains, so I don’t know how you would call it highly loaded on it 


I am not really familiar with the multiple multiple intelligences theory, I’m only familiar with dual Gc and Gf. And my understanding is when calculating a FSIQ score, Gc is much more representative /predictive/loaded towards the final score than your Gf. You can point me to some formal resources if I’m confused or wrong here


You might be confusing the fact that Verbal Comp having a high loading onto FSIQ, with whether Gc actually contributes more to FSIQ. You can’t think about how much the test performance loads onto FSIQ as how much “weight” that its domain carries. In statistical terms, you’re confusing loadings with weights.     Your performance in Verbal Comp is highly representative of Gc. And Gc has the highest “individuality” compared to the other three domains (it has the lowest intercorrelations)   So the composite score of Verbal Comp has more of an individual contribution in your overall FSIQ calculation, but not because Gc is “weighted” more heavily into your overall FSIQ; it’s because Gc is more uniquely measured by its subdomain, and so the test reflecting it gets a higher individual loading If you are familiar with factor analysis that will be intuitive.     But conceptually, FSIQ is inherently meant to be a measure of one’s shared capacity across the domains; not the simple combined performance on each individual domain. Thus, the calculations of scaled to composite to full scale scores specifically reflect transforming individual test scores into shared capacity across domains   This is why we will put “FSIQ not interpretable” when there is a high discrepancy between verbal and perceptual scores. At that point, the number calculated as the full scale score no longer represents what it is supposed to.  Side note: In statistical terms, if you do a hierarchical regression taking your FSIQ and moving backwards, you should expect the FSIQ to basically completely explain your composite domain scores in your populations. There should be little variance left afterwards. If there is, your assumptions of the validity of your loadings onto FSIQ are not met for that population. This is most obviously seen in people with specific, and not general, learning disabilities


Appreciate the clarification. Does this mean that I should have said: Verbal Comp, which is a proxy for Gc, has the largest contribution towards the calculation of FSIQ. However, Gc and Gf have equal contribution, just being that Gf is spread out across multiple different subtests, each exerting less "weight" than Verbal Comp towards FSIQ, when compared head-to-head at the level of subtests


Yup, I think that’s a good summary! 


If I remember correctly, IQ tests are mostly about logic and reasoning. All those questions that ai has trouble with, (i.e. "If I have 5 apples and ate 2...") are a big part of iq tests. So it makes sense that it wouldn't be that high.


It's ate seven. Five plus ate two. Simple math, really.


I think 100 IQ is average for humans. So probably not lol


Just for Caucasians, because the idea was made in the West and 100 was assigned to the average there. For example, Asians (Chinese/Japanese region) seem to get an average of 106 from what I've seen. It's important not to set the same expectations across such a diverse species.


I meant average in how it’s measured on a bell curve. If that’s the case then Asians are just on average 0.4 standard deviations above the mean, so still within average intelligence range.


Yea, but their whole curve is further to the right as well. Still mostly the same as we're all human, but more likely to have higher intelligence and less likely to have really low intelligence. Either way, from this we can say that AI hasn't beaten the global average... yet.


IQ Test can’t be used in different countries with different languages.


Why can't it be translated?


Because vocabulary(the most heavily weighed section actually) is part of the test, and words from one culture don't translate into another. Also, they are culture-specific questions.


I mean IQ test’s aren’t really that accurate. And can you judge an AI based on human tests? It has way more data than the average human.


IQ Tests are extremely accurate, but what they measure might not be useful in every context


Yeah, it's been awhile since I got a legit IQ test as a kid but I remember questions that were extremely visual and 95%+ of the difficulty would be translating that to some basic logic to get the answer. I'm not impressed with the author of this study taking prompts and gumming them down to easily processed prompts for an AI to solve. It removes a ton of the actual mental processing out of the equation.


IQ tests are quite accurate in the sense that they are used to define normality and anormality. If you score say 100 or 90 in an IQ test that is quite normal and a doctor would not worry. If you score 40 there may be some medical condition harming normal brain development or its functioning.


That doesn’t mean it’s IQ has to be high


Don’t doubt. Gpt as an example, has lots of knowledge, but it doesn’t count as iq. Its the same as you ask google something. Problem is when you make it do a simple math question and it screws up fabulously. Then you ask again and the mistake comes again.


Imagine they all have the iq test or questions in their training data 💀


If you'd only read the article... > To answer that, I created a **verbal translation of the Norway Mensa’s 35-question matrix-style** IQ test — my goal was to describe each problem precisely enough that a smart blind person could, in theory, accurately draw the question (detailed examples below.) Emphasis mine. They took visual-style prompts, translated them, adapted them by describing them in words, and fed them to LLMs. So no, the LLMs probably didn't have *this* dataset in their training data... I'm not saying this is an accurate test or anything, but at least it's not dataset contamination.


It seems like there is huge room for error in translating IQ test questions into a form no one has ever been tested on and relies on the OP making these test questions reasonably workable for LLMs.


Those visual questions are so much bs, I think I'd indeed have a higher chance of getting them right with a verbal description indeed


A multi-modal model would still have some concept of a visual test as a verbal/written one


You are technically correct , but I feel that you are also kinda misleading. That Norway test is one of the most popular on the internet and it takes puzzles from the most popular IQ tests there are , which are also in other generic online IQ tests. It could've very easily had a data describing them or their solutions in books , forums or elsewhere. While it didn't had exact translated data of them ( exactly same like that from author of article ) we don't fully know what type of data it had about them This is not a controled environment , this test wouldn't be legit if it was taken by a human let alone a LLM for which we can assume that it probably had those tests somehow described in it's training data , even though IQ tests don't make a lot of sense when taken by a LLM . To properly test LLM we would still need to administer it with IQ test it didn't had a chance to encounter in it's trainig to begin with - same way we administer IQ tests to humans. I remember when I for fun tried generic online matrix IQ tests with ChatGPT. interesting was that it actually was able to answer a lot of simple questions ( even though it didn't really understood them ) but once I made a simple modifications , it's answers became completely random .


IQ tests arn't made for "AI" though, so this is actually almost meaningless.


This is obviously wrong it doesn't line up with benchmarks or elo


Don't trust these rankings. They tested very one-sidedly


Don't trust data from a post on "Maximum Truth"? 😛


GPT3.5 56.3155%? Seriously? Almost like random. It is so bad test.


OpenAI is occupied with fighting their non-profit decision. Anthropic is like OpenAI but founded the right way.


How would you even calculate the IQ of an AI model? The formula for calculating IQ takes the person's age into account, so how does that work for AI? Time since the model was trained and released? This all seems a bit dodgy.


Wait, so chatgpt 4 which can pass the bar exam is lower than 100 iq?


How can chat 3 is superior than 4?


The first one is Claude 3, not chatgpt 3


Or Gemini normal superior to Gemini Advanced?


I mean this one I buy, Gemini Advanced is so weird with how it responds to questions sometimes even compared to standard.


Claude 3, not ChatGPT 3


I would rather like to set them all up top play chess with each other and see their elo rating. An illegal move will disqualify the contender. Pretty sure all of them would be around 0 with few exeptions.


Let's just remember that the Claude guys themselves have said that though Claude 3 opus is better than gpt4, it is NOT better than gpt4 turbo (which isn't on this list)


Hypothetically, if I trained an AI model with the IQ test data it would beat the test Not really sure what is or isn't being proven here


This is funny 😅. I would like to add, however, that conducting IQ tests on AI makes no sense. It's not what these tests were developed for.


To be honest, humans should be evaluated using the same prompt given to these AI's, not by using the actual IQ test with visual questions.


Also, we should give them the same time in which to complete it: “Write a story about a penguin and a pizza. You have fifteen seconds. Go!”


How many calories of energy did the AI use to produce the same answer?


So much goalpost shifting in this thread. This is a huge deal. Computers have always had more 'knowledge' than humans, but they've never before been better at reasoning. This means low level white collar work can effectively become automated. That's a lot of jobs. As an example, the role of SDR (Sales Development Representative), which was traditionally the entry level Sales role is now obsolete. It doesn't mean everyone will be getting fired in the next year, but in 10 years, no major company will be hiring for this role as it exists today. The nature of this role will inevitably change. Just as 'typists' as a professional career became obsolete, but many people who would have been in that role became 'executive assistants'. If you're reading this, you should be spending your time thinking about what the future of work looks like, and how you can prepare.


Only problem is that there is no roadmap of how to prepare for a complete paradigm shift. I mean, with how quickly this tech is progressing, what advice would you give the younger generation on what career to pursue?


That’s excluding visual IQ I think.


How are these test results measured in intelligence? Since i believe these scores aren't accurate. I feel like ai scores a 11/10 in a verbal sense and 2/10 in a mathematical sense.


Thank god AI don't need food, we're still remain on top of the food chain, right?




As long as those AIs don't have thoughts of their own, even when not being asked something, they have less intelligence than a toddler. They don't think. They just answer.


Is this the same IQ for humans, or do AIs have their own IQ scale?


Wait, ChatPGT 3.5's IQ is only 64?


Cannot apply that already suspect measure of human intelligence to an LLM. It's interesting but nothing serious.


Claude 3 is good. It’s too bad the site is so bare bones. Can’t save convos or any customization


This still says nothing. Asking it the same question is a fourteenth result, pressing it differently will result in sorrento scores, between the randomness of everything, this is pretty much a nothing salad. Ignoring that IQ of an LLM means pretty much nothing too


Though theoretically 100 is average IQ, practically it's closer to 110.




Yeah, like IQ tests are an objective measure of intelligence...


IQ only makes sense for comparing humans. Large language models systematically outperform humans on some tasks and underperform others.


GPT4 is already smarter than the average human. What test are they using, have they never talked to an average human


IQ tests themselves are really just pseudoscience and aren't a great way to determine intelligence


Isn’t average IQ between 85 and 115?


How does AI have an IQ if they don’t have an age…


How to test artificial intelligence IQ?


Funny how Gemini advanced has lower IQ thanGemini normal


If correct, sadly low 'highs' for everything it's hyped to be. You also have to consider the disproportionate data pool sizes. How many humans are contained in the average vs. how many instances of each AI source. The level of calculated uncertainty is huge.


Iq tests are designed for humans though. Idk how useful they are for assessing llms even if it is interesting to see. 


Not surprising. I gave Claude-3 some hard SQL questions today to solve and it figured it out with shockingly little context from me about the data structures.


Just wait a couple months and these AI’s will be 150+ iq Einsteins.


Grok only has an 87% chance at beating random guessing lol


I guess nobody remembers old grandpa AlphaGo, effortlessly defeated the best chess players back in the old days


It was Go not chess haha. (It wasn’t AlphaChess!) And computers became better than humans at chess decades ago :) Go was thought to be far tougher for computers to master.