Background3_sm-1

Aible Intern Model

Teaching GenAI Agents via Feedback on its Reasoning

Executive Summary: Language models will always struggle in enterprise use cases, because the model has never seen each enterprise's private data and does not understand each business’s unique terminology and ways of doing things. Post-training, which incorporates business user feedback and enterprise-specific information to specialize a model, can significantly improve a model's performance on enterprise-specific use cases with minimal investment.

At GTC 2025, Aible demonstrated that a smaller, 8 billion parameter model, with just $5 of post-training, could outperform much larger models on specific tasks. The specialized model did perform worse on other tasks, so there was no free lunch. But, many specialized variants of the models can be run equally efficiently on NVIDIA GPUs simultaneously resulting in an insignificant cost-performance impact.

Recently there has been a lot of debate regarding whether model post-training such as fine-tuning, reinforcement learning, etc. is more important than model pre-training, especially when it comes to enterprise use cases. The DeepSeek R1 model and the Stanford s1 model showed the significant benefits of post-training techniques, especially in adding reasoning capabilities to a base model via fine-tuning and reinforcement learning. The Stanford project showed that just 1000 examples of reasoning and $50 of compute was sufficient to improve a 32 Billion parameter base model to outperform a newer 500+ Billion parameter GPT o1 preview model.  

Aible’s initial testing was revealed at GTC 2025, where a 3 Billion parameter model performed as well as a 100X larger OpenAI o3 model with just 1000 improvement examples. 

Screenshot 2025-03-15 at 10.40.47 AM-1This indicates that adding reasoning to a model can significantly improve the accuracy of the model. It also opens up the tantalizing possibility that end-users can provide feedback on reasoning steps of the model instead of just the output. Traditionally, we have trained AI like we train a pet (relying on simple rewards or punishments based on its output) but now we can start training it like an intern (providing it feedback on the way it reasoned and offering it guidance on the tools it can use to get better results, such as what enterprise API it can invoke to access necessary information).

In our testing, users significantly preferred providing feedback on specific reasoning steps (and seeing the model output change as a result) to providing feedback on the end result and hoping that somehow that feedback would translate into the model understanding exactly what it got wrong. Moreover, several users felt that without the reasoning steps they may not even have noticed that the AI got a crucial step wrong because the final answer still looked correct even though it actually was wrong. Adding reasoning steps and enabling user feedback on each such step both improved the trust the user had for the AI’s output as well as gave them better control over the AI assisting them. This was especially true in agentic use cases, which were highlighted at GTC 2025. 

Note that the DeepSeek model has been criticised for potentially using the output of OpenAI models in a manner that violates the OpenAI license. The Stanford project was an academic one and was thus able to leverage data from a Google model. But using Stanford’s s1 for commercial purposes would violate that license. We restricted ourselves to using open source base models, human-generated open source data, and user feedback from a wide variety of organizations that participated in this project - including multiple Fortune 500s, the CIO of the State of Nebraska, the Principal Scientist for Strategic Planning and Analysis from the Air Force Research Lab Aerospace Systems Directorate, etc. We also restricted ourselves to only US-based models and datasets to avoid any concerns about Chinese content guardrails. 
We tested the following:
Test 1:
Post-trained models vs. larger pre-trained models

Conclusion: Post-trained models are significantly better for specific use cases.

With just post-training on 1000 examples of user feedback and just $11 of compute, a base model performing with just 16% accuracy on a custom benchmark shot up to 84% accuracy and started exhibiting Chain of Thought reasoning which then enabled users to provide more targeted feedback.

Test 2: No free lunch hypothesis 
Conclusion: Specializing a model for a specific task can degrade its performance on unrelated benchmarks.

While a post-trained model improved from 16% to 84% accuracy on a custom benchmark, its scores dropped from 70% to 16% on the MMLU-Pro benchmark and from 86% to 49% on the MATH benchmark.

Test 3: Cost-effective serving of multiple specialized models
Conclusion: Serving  multiple use-case-specific variants of a model is cost effective.

There was no significant degradation on the cost-performance curves as more and more post-trained variants of the same models were run in parallel on the same GPU. But, the optimal concurrency setting might change somewhat as more variants are added.
Here are the detailed results of each test:
Test 1

Are post-trained models significantly better than much larger pre-trained models for specific use cases? How does the model perform after adding more reasoning examples during the post-training process? 

Key Considerations: Enterprise customers often complain that models that perform high on standard benchmarks perform badly when applied to their enterprise data. This should not be a surprise. Language models are trained on open datasets so they don’t get to learn the unique terminologies of the enterprise. To put these models through a realistic challenge, we worked with our customers to choose a public dataset that we could test against - and we chose the historical Formula 1 (F1) test dataset at www.kaggle.com/datasets/cjgdev/formula-1-race-data-19502017. This dataset offers some unique challenges because the AI needs to understand F1 terminologies like “grid” position, the meanings of various finishing statuses, whether capitalization matters, etc. However, it became clear that some of the Large Language Models had already trained on this public dataset. We addressed this by changing the data structure by pre-joining the datasets into a single file, not providing supporting metadata for ambiguous and duplicative column names, and slightly changing some of the variable names etc. This experiment was designed to replicate the natural variation and ambiguity that we see in real world enterprise data. We came up with 50 representative natural language questions and the corresponding SQL code for our custom F1-SQL benchmark.

Results: The standard models that perform very well on public benchmarks performed terribly on the custom F1-SQL benchmark. We were concerned about these results, so we manually reviewed the output and found examples where the model gave the correct output but added some extraneous information and thus was scored as incorrect. This is a common problem with evaluating language models in that data science metrics that look at the overlap against golden answers can differ significantly from what a business user would consider correct. You can read more about this in Bridging the Gap: Why AI Metrics Often Miss the Mark in Business
Screenshot 2025-03-01 at 8.15.32 PM

As such we separately added a manual evaluation score where we gave each AI credit for responses that a business user would consider accurate even if the generated SQL was not identical to the benchmark answer. The standard LLMs still did terribly. However the version of Llama-3.3-70B-Instruct post-trained with just 1000 examples for $11.30 delivered 84% accuracy in the automated evaluation and 92% in the manual evaluation. Note that the much smaller 8B parameter variant that performed terribly (just 6% for auto evaluation & 12% for manual) on the test before post-training, actually performed quite well hitting 64% accuracy in the automated evaluation and 82% in the manual evaluation after just $4.58 of post-training. Note that depending on the use case the 82% accuracy of the post-trained 8B model would make it very attractive on a cost-performance basis.

Screenshot 2025-03-17 at 3.32.16 PM

Test 2
As the model improves its accuracy on the specific use case we are specializing it on, how does its accuracy degrade on other unrelated benchmarks?

Key Considerations: The underlying model size is not being changed in our tests. So, it is reasonable to believe that as the model specializes to do well on a specific task, it becomes less generalized and thus performs worse on other tasks. A corollary to this is that due to the current obsession with benchmarks, labs may be incentivized to train models that do well at benchmarks as opposed to perform useful business tasks well. Essentially, if you only obsess over a student’s SAT scores, you may produce a fantastic test taker with very surface and potentially flawed understanding of the underlying subjects. You can see this in DeepSeek’s benchmark results where it performs badly (e.g. 17% accuracy in NewsGuard test) when the use case is not related to math or logic or related topics. 

Results: For both the Llama-3.1-8B model and the Llama-3.3-70B model, post-training significantly improved the quality of the models for the Custom F1-SQL benchmark but significantly degraded the performance on the models on standard benchmarks. 

Screenshot 2025-03-17 at 3.32.42 PMThis level of model degradation from post-training makes it clear that enterprises will need to have many different variants of a set of base models - specialized for individual business use cases. This will mean they need scalable means of collecting end-user-feedback, post-training models, and have ways to cost-effectively run multiple variants of the same models.

Test 3 
Can we cost-effectively serve many use-case-specific variants of a model?

Key Considerations: If post-training on specific use cases can improve models to such an extent, but degrade them for unrelated functions, then we are likely to end up with thousands of specialized models in a large enterprise. NVIDIA and other accelerators offer capabilities that can batch multiple user requests together to achieve far better cost-performance characteristics at higher levels of concurrency. Aible previously tested this approach on a base model with LORA fine-tuning weights, and confirmed the following cost-performance curve. We wanted to check whether we could still batch together user requests to different use-case-specific variants of a base model while achieving similar concurrency benefits. Otherwise, the accuracy benefits of the use-case-specific model variants may be offset by the cost increase due to the inability to effectively leverage concurrency.

Screenshot 2025-03-01 at 8.44.55 PMResults: We evaluated the cost-per-request vs. average latency curve for a single model, two variants (different post-trained versions) of the same base model, three variants of the same base model, and ten variants of the same base model. The curves were fairly consistent. This shows that even if we end up with multiple specialized variants for different use cases, they can be run efficiently on GPUs at higher concurrencies to achieve close to the cost-performance characteristics achieved with single model variants. Of course, the optimal concurrency settings would be different depending on the number of variants being run in parallel.
Screenshot 2025-03-01 at 8.45.55 PM

The numbers on the dots indicate the concurrency level which is the number of requests that were batched together and processed at the same time on a single H100 processor.

Final Thoughts
But is this level of model improvement a fluke? The Stanford s1 paper explains: “Why does supervised finetuning on just 1,000 samples lead to such performance gains? We hypothesize that the model is already exposed to large amounts of reasoning data during pretraining which spans trillions of tokens. Thus, the ability to perform reasoning is already present in our model. Our sample-efficient finetuning stage just activates it and we scale it further at test time with budget forcing.” In other words, these language models already have a lot of information built into them, and targeted guidance from a human makes a huge difference. This is no different floating-sidebar than what you do for a new employee - explain to them how your company works, where they go to get information, how they should think through problems, etc.

Language models will always struggle in enterprise use cases, because the model has never seen each enterprise's private data and does not understand each business’ unique terminology and ways of doing things. But now users can easily train that AI employee on specific business tasks by just providing some guidance on the reasoning steps taken by the AI. 

Moreover, because these small models can easily run in your own cloud or on edge devices, and be post-trained on the same devices, you don’t have to send your data to a shared large model. Jensen Huang recently said, “The IT department of every company is going to be the HR department of AI agents in the future.” If I may add… and each of us are going to be the managers of the AI agents in our lives.

Now anyone can make AI their own. They are Aible.

Attendees of GTC 2025 can sign up to get involved in the Aible Intern Project at www.aible.com.

Contributors
The Aible Intern Project involved Aible Customer Advisory Board participation and user feedback from a wide variety of organizations - including multiple Fortune 500s, the CIO of the State of Nebraska, the Principal Scientist for Strategic Planning and Analysis from the Air Force Research Lab Aerospace Systems Directorate, etc. As we receive permission from specific contributors to mention their names and organizations, we will list them here.