The Google destroyer, the Perplexity crusher? Or just hype? ChatGPT with Search is here, and simultaneously Altman and co did …
The Google destroyer, the Perplexity crusher? Or just hype? ChatGPT with Search is here, and simultaneously Altman and co did …
Love the EPL test.
It’s also available for all searchgpt waitlist users like me. Even on the free plan.
SearchGPT isnt just for paid users… everyone who was on the waitlist also got it (aka also free users)
Are you concerned that putting simple bench online can allow "training for the test"?
Hi, I enjoy your YT vids and decided to check out Simple Bench. I happen to have a concern with the first question that came up, and I think you should give it another look and maybe revise it.
The question: "Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?"
Ok, why I think it's problematic is because it is a complete assumption that the "frying pan" is sufficiently heated to melt the ice cubes in less than a minute. Pans are often referred to as "frying pans" as a descriptor of the type of pan, not because they are currently frying something at high heat. "While it was frying a crispy egg" could be at any time, and doesn't necessarily refer to the first three minutes. So because there is no direct piece of evidence that the frying pan is hot and could quickly melt an ice cube, then the reason an intelligence would select the correct answer "0" is because it's inferring that the question is quite silly and nonsensical, that there isn't sufficient evidence to give a correct answer, and that it thinks it's likely it is a trick question. If that is your intention with the question, then my apologies. However, I get the feeling you want the question to be more accessible and test if the intelligence can figure out the pan is indeed hot enough to melt the ice through real world context clues. To do that, it needs to be clear, not assumed or inferred, that the pan is in use and at high heat from the start.
fine-tuning YT algorithm.
Nobody has heard of preemire league. Not even gpt
Test Simple Bench on mobile, you may want to fix the leaderboard table (horizontal scroll), fix the title who reads
SimpleBenc
h
you have such a talent to summarize this frontier of info and share it here. Thank you
The problem I see with the various AI benchmarks (or human benchmarks used with an AI): The training data of the foundation models will of course also include descriptions/questions for all kinds of AI benchmarks as well as all the discussions about them. So if an AI can use parts of its knowledge when being benchmarked, this will already change the result. We'd need offline benchmarks on systems not connected to the internet, vetted by professionals who are under NDA (= won't write about the details) to really know what they are capable off, right?
Nice
"AMA", then proceeds to answer nothing. Well, that's on us it's ask me anything not answer anything (at all).
Hey man,Nice video just a appreciation from your subscriber on how you reaffirm or debunk the hype or breakthroughs made in this Ai age,Just a small request that in future videos it would great if you could include how how you simple bench worked or the models launched in the yt videos ,also some tutorials on how your website works.
Anyways great work man👍
Woo, new simple bench
Daaamn those simple bench questions are really tugging at the edge of what a human can reliably figure out without properly sitting down and studying them. I am embarrassed to say I scored only 80% on the try-it-yourself – I was fully out-foxed by the man seeing the light fall in the bathroom question, and the glove falling out of the car question. The glove I might have missed by doing it too quickly, but the light in the bathroom one I sat and thought on for several minutes and only figured it out once I knew my first guess was wrong 😩
I fed both questions I got wrong and their multi-choice answers into o1 preview, prefaced by only "think carefully, the following might be a trick question". It got both of them absolutely right in one go, and perfectly figured out the trick in each. I am mortified.
I feel like these questions really rely on visualisation and a robust world-model. I might have gotten them all right if I'd really tried to build a visual mental image of the situations a bit more carefully, or even drawn them out. If these LLMs were trained more extensively on spatial understanding using physical 3d model simulations? I wouldn't be surprised if it was able to smash through most of these.
lol! Loved Perplexity result about Simple Bench!
I love AI Explained. Funny, short, concise, and extremely informative.
I've wasted two years of my life following this AI con. What's changed since then? Chatbots that are stumbling over the same questions. AGI still "5-10 years away." Oh, but now it has a new search frontend! All these moving goal posts. I'm done.
Cool
I use Brave search by default and it’s often insufficient but since they added the AI summary last year, it has proven extremely helpful in most of my daily queries (mostly development and general tech queries)
Love this video, can’t wait to see more
Waiting for new video .
❤
I think SimpleBench evaluation is unfair because the human evaluators could remember the previous questions while the LLMs don't know the other questions..
Please eval again with history for the LLMs
i disagree on reliability being a "must". average human is a dumb mess and we call it norm
The hallucination thing is worrying. It's the thing that I'd most want to see properly addressed. I think it makes so much difference, not just for economic growth, but for everyday use as well. It is tiring to have to verify everything and deal with annoying mistakes which crop up out of nowhere whenever you're doing something. But it is what I suspected, to be honest, AI companies do not have a good answer for it. And until they do, I don't think we can ever claim they're more intelligent than humans, because reliability is implied in that claim, in my view
Are you the creator of simplebench? Congratulations, it's a great bench mark. You could run a interesting little experiment with it. Change some details of the questions (without making any diference for the reasouning) like the order names apears on those questions, the names of people or the shapes and colors of objects to see if the models decrease their performance.
Yes, reliability but does it have to be zero-hallucinations! We have sufficiently reliability inn the modern world despite human hallucinations. I sense goal-posting moving on this platform. Give us an operational definition of reliability!
Comments are closed.