Discussion about this post

User's avatar
huhvgf6554's avatar

The idea of looking at task completion % for tasks of set lengths (10min, 30, 2h) is a great idea and would be a very useful eval to keep people's expectations grounded. Reminds me of Anthopic demonstrating Claude 3.7 playing Pokemon and getting through 3 gym leaders. Our evals all focus on "immediate" intelligence, like the output, but that's for the chatbots usecase. Agents need to be able to operate independently for increasingly long periods of time; that'll be what makes AI autonomous and closer to the idea of a drop-in worker. It does seem that compute and context length might be a very big hurdle, especially if being able to stay coherent for hours requires exponentially more compute - I.e. a 5h task requires 100x more compute than a 30 min task. Though maybe not many tasks truly require 5 straight hours, instead they can be discretely broken down into multiple 30 min tasks. Still interesting and certainly stretches my timelines a little bit - proliferation is still probably around 5-10 years away at the earliest.

Expand full comment
Joshua Blake's avatar

This essay is great. My challenge is whether it's over fit to the last few months of progress. It seems very anchored on inference-time compute but that wouldn't have been the focus 6 months ago. Is it reasonable to think that this will remain the important axis for scaling for more than a year or two?

Expand full comment
16 more comments...

No posts