Rendered at 14:12:39 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
JellyYelly 14 hours ago [-]
They say its mythos like, without actually comparing it to Mythos (fair enough, it's not public) but the bar for a model to be mythos-like has to be that you can produce as many novel and high severity security vulns outlined in the Mythos redteam blog. I haven't seen any other lab produce a report like that yet. The proof is in the pudding.
cassianoleal 6 hours ago [-]
> The proof is in the pudding.
Funny you say that, when the Mythos team have produced no proof either.
Open to all except it’s not because as soon as you try to use it for security purposes it will shut down and silently route you to a worse model. I was trying to use GPT 5.3 for reverse engineering and got an account warning.
immanuwell 5 hours ago [-]
Those miss-rate numbers are genuinely eye-opening - dropping from 40% to 10% in what sounds like a single generation is no joke - though it's worth taking any vendor-adjacent benchmark with a grain of salt until the broader security community kicks the tires
WhiteDawn 15 hours ago [-]
First you need to get through the safety net. I’ve had many productive gpt5.4 sessions hit a roadblock of “ethicality” and pollute the context with multiple rounds of trying to convince it to continue
nsingh2 16 hours ago [-]
These plots are terrible. Why is categorical data connected across categories with lines? Why not just use bar plots?
Like in the "Web Vulns in OSS" plot, white box data for Opus 4.7 is not available, but the absurd linear interpolation across categories implies it should be near 60.
scottyah 16 hours ago [-]
It's just an ad thinly disguised as useful data.
12 hours ago [-]
wmf 16 hours ago [-]
I think the x axis is meant to be time but they screwed it up.
strange_quark 16 hours ago [-]
Wasn't it already confirmed that small open-weight models were able to detect most of the same headline vulns as mythos? How is this any different?
stanfordkid 16 hours ago [-]
No, they are able to detect errors when pointed at them but they have a lot of false positives... making them functionally useless for a large unknown codebase. They also can't build and run an exploit post-identification. Mythos can find vulnerabilities (purportedly) and actually validate them by building and running exploits. This makes it functional and usable for hacking.
adrian_b 3 hours ago [-]
The only significant difference between Mythos and the older open-weights models was that Mythos found all the bugs alone, while with the older models you had to run many of them in order to find all bugs, because each model found only a part of the bugs.
For the open weights models, we know the exact prompts that have been used to find the bugs. While the prompts had to be rather specific, a good bug-finding harness should be able to generate such prompts automatically, i.e. by running repeatedly a model while requesting to find various classes of bugs.
For Mythos, we do not know what prompts have been used, but Anthropic has admitted that the process was nothing like asking "find the bugs in this project". They have also run Mythos many times on each source file, starting with more generic prompts in order to identify whether a source file is likely to have bugs, and then following with more and more specific prompts, until eventually it became likely that a certain kind of bug exists, when Mythos was run one last time with a prompt that required the confirmation that the bug exists and the possible generation of an exploit or patch.
So Mythos must also be pointed to an error. Using it naively will not provide any results like those reported.
There is no doubt that both Mythos and GPT 5.5 are superior to older models, because you can use a single model and hope to have an adequate bug coverage. But the difference between them and older models has been exaggerated. If you run older models on your own hardware, you can afford to run many models many times on each file. A serious bug searching with Mythos or GPT 5.5 is likely to be very expensive, while likely to provide the same results in most cases.
dlahoda 12 hours ago [-]
i casually asked gemini and codex 200usd subs to find and verify bugs for weeks.
it did wrote tests, injected mutations, verified fixes.
just promts.
also i had to proxy remote mainnet with localhost to force
them to do penetration and dos testing.
mythos is nothing new.
11 hours ago [-]
nardons 16 hours ago [-]
Do you have a source for this? Not doubting it, but I would like to have something concrete the next time the Mythos horse manure is cited.
Funny you say that, when the Mythos team have produced no proof either.
I don't have strong opinion on that.
There was just an article on this phenomenon today: https://news.ycombinator.com/item?id=47890235
Like in the "Web Vulns in OSS" plot, white box data for Opus 4.7 is not available, but the absurd linear interpolation across categories implies it should be near 60.
For the open weights models, we know the exact prompts that have been used to find the bugs. While the prompts had to be rather specific, a good bug-finding harness should be able to generate such prompts automatically, i.e. by running repeatedly a model while requesting to find various classes of bugs.
For Mythos, we do not know what prompts have been used, but Anthropic has admitted that the process was nothing like asking "find the bugs in this project". They have also run Mythos many times on each source file, starting with more generic prompts in order to identify whether a source file is likely to have bugs, and then following with more and more specific prompts, until eventually it became likely that a certain kind of bug exists, when Mythos was run one last time with a prompt that required the confirmation that the bug exists and the possible generation of an exploit or patch.
So Mythos must also be pointed to an error. Using it naively will not provide any results like those reported.
There is no doubt that both Mythos and GPT 5.5 are superior to older models, because you can use a single model and hope to have an adequate bug coverage. But the difference between them and older models has been exaggerated. If you run older models on your own hardware, you can afford to run many models many times on each file. A serious bug searching with Mythos or GPT 5.5 is likely to be very expensive, while likely to provide the same results in most cases.
also i had to proxy remote mainnet with localhost to force them to do penetration and dos testing.
mythos is nothing new.
https://news.ycombinator.com/item?id=47732020
“Small models also found the vulnerabilities that Mythos found” (aisle.com)
1,283 points | 12 days ago | 360 comments
I think it's also self-aggrandizing.