The Differential
There seems to be a rising belief that scaling with reinforcement learning (RL), we will be able to solve many tasks that will help update the model weights. This paradigm, clubbed with features like updating the context with lessons being learnt on the fly, might help complete the picture. However, there is also a common fear that most model labs seem to be converging to similar model features.
In the recently concluded IMO, where two of the models achieved gold medals, I read a comment that everyone's Lee Sedol moment will come soon. This comment got me thinking, what will help us achieve this for every common person? If multiple models fulfil the correctness criteria, how will we decide which model helped us reach this goal? I felt the answer lay in a trend where I saw people comparing the gold medal models in terms of the quality of their solutions.
Introspecting, I drew an analogy. I thought if I feel hungry and the task is to satisfy my hunger, I can eat anything to achieve success on this task. However, that is not how I solve this task when I do have the time, I optimize for parameters like the flavours, the nutritive value of the food I am going to eat, in addition to my effort that went into making it. These are the non-functional requirements that give me the dopamine hit while eating the cooked food that day, more than just satisfying my hunger. I wonder which paradigm of learning will help achieve these non-functional requirements to give personalization in the era of experience?
Right now, it might seem that we are bottlenecked by good task definition or reward formulation, resulting in delayed model success. When RL will scale to internet level-data maybe through some form of next-token prediction and learns by doing through internet interaction or when RL will help us hill climb verifiable benchmarks (can’t every task be converted to some form of verification if you think about it?) and achieving correctness of solutions becomes trivial, the axis of scaling will shift from the 'what' to the 'how'. We have seen this play out while using LLMs for coding, where earlier the correctness of the solution was being questioned, while now everyone is a high taster commenting about their personal experiences with code quality using different compound systems like claude-code or cursor. When correctness slowly becomes a basic assumption, what will differentiate the models: the quality of the solutions. We do know that naively trying RL on user feedback for chats is not the solution, as that led to a more sycophant model for OpenAI. Over a long enough time, as all models converge to similar models i.e. the global minima of the model landscape, it will be more about the hyperpersonalized persona of the models in interactions that will differentiate them.
Let’s take a step back and think about what makes each human unique: Our learning priors that can be equated to our childhood backgrounds, based on the data we see as kids. Our pretraining stage seems to be nearly similar in terms of knowledge of the world, as we are learning the global basics of the world from a blank slate. The model also with the raw/rewritten internet data or the synthetic data from their teacher model will not differ much in that. Post our prior is fixated in childhood, we all take diverse trajectories in life to solve similar global tasks. Someone will take up history as a major, while someone else might take up law to learn based on their choices and reach the long-term reward of financial stability or career growth. Someone will develop running as a hobby while someone else might love writing. All these experiences shape us as humans and helps update our priors to view the world as the term of life increases. If we are only focused on achieving correctness on our tasks, will we ever be shaped by our experiences to be one of a kind in the human race. In the model experience phase, how is that going to be achieved? Will the models get different characters solely based on the amount of exploration in tasks they do in their experience phase, or will it also depend on the quality of interaction they have with a specific user to develop an intuition about them?
There is a famous saying I have heard where humans say that the wrinkles on their face sum up their life. When each human attributes their uniqueness in this world to the different trajectories they have taken, I wonder why we will not see this play out with models. So who is going to bake Occam’s Razor into their models and let the model character develop based on it’s experiences, in addition to training the models to achieve success on tasks. How will this world look?
Thanks to Moksh Jain and Tanay Anand for reading drafts of this.