You don't pick the RL algorithm — SIA's Feedback loop does

SIA (Self Improving AI), released by Hexo Labs on May 26, 2026 , is the first open-source framework that co-evolves both an agent's scaffold and its model weights inside a single iterative loop. The MIT-licensed code is on github.com/hexo-ai/sia. This tutorial walks through the feedback loop logic, prerequisites, and a runnable five-generation LawBench experiment.

The Feedback Loop That Decides PPO, GRPO, or EAW

SIA's Feedback-Agent reads full execution trajectories, reward metrics, and task descriptions each generation, then decides whether the next step should be a scaffold edit, a LoRA weight update, or both — and selects the RL algorithm automatically based on the reward shape of the current task . Before SIA, harness-update systems (Darwin Gödel Machine, Hyperagents) and test-time training systems (TTRL, Discover-TTT) were entirely separate research directions. SIA is the first framework to combine both levers in a single self-improving loop, per the SIA paper (arXiv:2605.27276).

Quick Answer: SIA (arXiv:2605.27276, MIT license, May 2026) co-evolves agent scaffold and LoRA weights in a single loop. Run sia --task lawbench --max_gen 5; the Feedback-Agent picks PPO+GAE, GRPO, or Entropic Advantage Weighting based on reward shape — no RL algorithm choice required. On LawBench, the combined harness+weights variant reached 70.1% accuracy , 25.1 percentage points over prior SOTA.

The Feedback Loop That Decides PPO, GRPO, or EAW

You don't pick the RL algorithm — SIA's Feedback loop does

You don't pick the RL algorithm — SIA's Feedback loop does

Other newsrooms on this story

Related reading

What is RLHF? Reinforcement learning from human feedback for AI alignment

AI Entrepreneurs at Hexo Labs Release SIA: An Open Source “Self-Improving AI”…

Architecting RLHF Feedback Loops for AI Career Assistants: Balancing User…

AI/ML Research Digest — Jun 27, 2026

RLAIF Is Eating RLHF — Here Are the Four Places Human Feedback Still Wins

The Cowork Loop: A Software Pattern for AI Workflows That Actually Compound

Other newsrooms on this story

Related reading

What is RLHF? Reinforcement learning from human feedback for AI alignment

AI Entrepreneurs at Hexo Labs Release SIA: An Open Source “Self-Improving AI”…

Architecting RLHF Feedback Loops for AI Career Assistants: Balancing User…

AI/ML Research Digest — Jun 27, 2026

RLAIF Is Eating RLHF — Here Are the Four Places Human Feedback Still Wins

The Cowork Loop: A Software Pattern for AI Workflows That Actually Compound