Originally posted on Linkedin on May 14, 2025
Last week, OpenAI released a new important feature for their reasoning models (o-series): Reinforcement Fine-Tuning. This will be very meaningful for organizations who do lots of data entry into their systems.
In short, it is a process that gives you a tailor-made, fine-tuned reasoning model that does a certain task much better than before, given the prompts you supplied and the (business) rules to check upon.
This is different from the Supervised Learning Fine-Tuning method, which requires pairs of input prompts - output responses.
It is specifically made for tasks of which you can *grade* the output, which means you can assign a score to the output of the completed task. This grading must be unambiguous, you will need to directly compare the output structure with the desired structure, and give a number back as ‘reward’. Internally, the fine-tuning process will challenge itself by generating multiple outputs, grade all of them, and will use the grades to determine how it can score better on the next round.
It is imperative though, to have a great grader. Like a child, the fine-tuning process will try to find edge cases or loopholes to achieve a better score, and thus will invalidate the results. You will need to ensure the structure requirements are robust and extensively covered.
Reinforcement Fine-Tuning works great for:
- Forms with clear schemas: invoices, expense receipts, insurance claims
- Catalog enrichment: databases with strict type or format rules
- Regulatory filings: documents with hard checks, mandatory fields, calculation mismatch
Given how much structured data processing still relies on manual input, this new feature is very low-hanging fruit for many organizations who have tight schema validation or business rules in place.
The use case guide is a great resource to look at, it has real-world examples including prompts and grader code, from organizations already deploying this in production: