DeepSeek‑R1’s big idea is simple but radical, teach reasoning with pure reinforcement learning (RL) that rewards only the final, verifiably correct answer, no human step‑by‑step traces. That freed the model to invent its own strategies (self‑checking, verification, even “wait/redo” moments) and unlocked a leap in math/coding performance. Nature
What changed the game
Answer‑only RL → emergent reasoning. R1‑Zero learns via GRPO (a PPO variant), using rule‑based rewards for correctness/format, without human CoT labels. Then R1 adds a small “cold‑start” SFT for readability and a final RL pass to retain reasoning while improving language. Nature
Measured gains. On AIME-2024, pass@1 rose from 15.6% to 77.9% during RL; 86.7% with self-consistency—evidence that final-answer rewards alone can grow real reasoning. Nature
Cost shock. DeepSeek’s peer‑reviewed Nature paper/supplement puts reported R1 training cost around $294k on 512 H800s—far below what many assumed for “elite” reasoning models. Reuters
Why markets cared: When R1 landed in January, investors suddenly had to price in cheaper, open reasoning models. NVIDIA closed −17% (~$590B of value) in a single session, with broader AI supply-chain names sliding and the Nasdaq ~−3%, S&P ~−1.7%. Call it the “DeepSeek moment.” Reuters+2Forbes+2
Bottom line
Nature featured the work (published online Sept 17, 2025) because it breaks the human‑trace ceiling: answer‑graded RL can produce reasoning behaviors and distill them to smaller, open models. That’s scientific significance and industry disruption. Nature+1
Great rundown by Rohan Paul - full Substack here (link below). Also, it's worth skimming the 83‑page Supplementary Info if you’re hands‑on. Nature
#HealthcareAI #HealthTech #AI #DeepSeek #ReinforcementLearning #Nature #LLMs #GRPO #R1 #HealthcareAI #AIGovernance
Read Rohan Paul's full Substack article here: https://www.rohan-paul.com/p/deepseek-r1s-original-paper-was-re