“DeepSeek is working with Tsinghua University on reducing the training its AI models need in an effort to lower operational costs.
The new method aims to help artificial intelligence models better adhere to human preferences by offering rewards for more accurate and understandable responses, the researchers wrote. Expanding [reinforcement learning] to more general applications has proven challenging — and that’s the problem that DeepSeek’s team is trying to solve with something it calls self-principled critique tuning. The strategy outperformed existing methods and models on various benchmarks and the result showed better performance with fewer computing resources, according to the paper.
DeepSeek is calling these new models DeepSeek-GRM — short for “generalist reward modeling” — and will release them on an open source basis, the company said. Other AI developers, including Alibaba and OpenAI, are also pushing into a new frontier of improving reasoning and self-refining capabilities while an AI model is performing tasks in real time.”
GrinNGrit on
Isn’t this a little misleading? It’s only self-improving in the sense that they built a feedback loop into the model so it continuously gets better rather than performing a batch retraining every so-many months. It’s like the algorithm feeding you trash videos on Instagram “self-improving” based on how long you watch, how much you interact, etc.
I don’t see this as being novel or interesting, it just trades faster updates at the cost of tailored training data. It becomes easier to poison the model, now.
spirit8ball on
meanwhile openAI thinking about how to charge their clients more
3 Comments
“DeepSeek is working with Tsinghua University on reducing the training its AI models need in an effort to lower operational costs.
The new method aims to help artificial intelligence models better adhere to human preferences by offering rewards for more accurate and understandable responses, the researchers wrote. Expanding [reinforcement learning] to more general applications has proven challenging — and that’s the problem that DeepSeek’s team is trying to solve with something it calls self-principled critique tuning. The strategy outperformed existing methods and models on various benchmarks and the result showed better performance with fewer computing resources, according to the paper.
DeepSeek is calling these new models DeepSeek-GRM — short for “generalist reward modeling” — and will release them on an open source basis, the company said. Other AI developers, including Alibaba and OpenAI, are also pushing into a new frontier of improving reasoning and self-refining capabilities while an AI model is performing tasks in real time.”
Isn’t this a little misleading? It’s only self-improving in the sense that they built a feedback loop into the model so it continuously gets better rather than performing a batch retraining every so-many months. It’s like the algorithm feeding you trash videos on Instagram “self-improving” based on how long you watch, how much you interact, etc.
I don’t see this as being novel or interesting, it just trades faster updates at the cost of tailored training data. It becomes easier to poison the model, now.
meanwhile openAI thinking about how to charge their clients more