continualcode

github · tweet · design doc

A coding agent that updates its own weights from your corrections. Built on Tinker and SDPO, where the model conditioned on your correction becomes its own teacher. You deny a tool call, type what went wrong, gradient step, retry.

pip install continualcode
You: "fix the test"
Agent: write(test.py, ...)       # overwrites the file
You: n → "use edit_lines; don't overwrite"
  → SDPO step runs (~2s)
  → weights updated, agent retries
Agent: edit_lines(test.py, 14, 17, ...)
You: y

How it works

The agent has seven tools: read, write, edit, edit_lines, glob, grep, bash. One tool call per turn, so each correction maps to exactly one set of generated tokens.

Four things can happen at each step:

Approve — tool call executes. No training.
Deny with correction — the main learning event. One gradient step, then retry.
Edit — you fix the tool call args directly. The diff is the correction signal.
Intermediary feedback — free-form notes ("this project uses Poetry not pip") that accumulate in session context.

Why self-distillation

The standard way to train from human feedback is RL: sample a bunch of completions, score them with a reward, update the policy. GRPO samples 64 completions per prompt. Showing a developer 64 tool calls and asking them to pick the best one is not a product.

The deeper problem is signal density. A scalar reward gives you one bit per episode: good or bad. Every token in the sequence gets the same gradient, whether it was the token that caused the bug or a perfectly fine import statement. That's O(1) bits of learning from each interaction.

SDPO (Hübotter et al. 2026) showed there's a better way. Instead of scoring completions with an external reward, you use the model itself as the teacher. The trick: give the same model your correction as additional context. Now it can re-evaluate every token it generated, knowing what went wrong. The logprob gap between "model with correction" and "model without" is a per-token advantage:

advantage[t] = teacher_logprob[t] - student_logprob[t]

Positive where the teacher agrees more strongly with that token. Negative where it would have done something different. O(N) bits from a single correction.

GRPO — scalar reward, same advantage for every token
SDPO — per-token advantage from self-teacher
red = teacher disagrees (suppress), blue = teacher agrees more strongly (reinforce)

This matters because the teacher isn't a separate model. It's the same weights, same architecture — just conditioned on richer context. There's no reward model to train, no critic network, no preference pairs to collect. The model teaches itself by comparing what it knew before your correction to what it knows after.

On-policy distillation

Why does this work better than just fine-tuning on the corrected output (SFT)?

SFT minimizes forward KL — it forces the model to assign high probability to the correction data. This is mode-covering: the model has to spread probability mass over all corrections it's seen, even if they conflict with its existing knowledge. Shenfeld and Pari (2025) showed that forward KL between the fine-tuned and base model predicts catastrophic forgetting with R2 ~ 0.96.

Self-distillation minimizes reverse KL — mode-seeking. The model only needs to match the teacher on tokens it would actually generate. It can learn the new pattern without flattening out everything else. On-policy means the training data is the model's own completions, not an external dataset. This avoids the compounding distribution mismatch that Agarwal et al. (2023) identified as the core failure mode of off-policy distillation.

The update uses importance-sampled policy gradients with a KL penalty to the reference adapter:

is_ratio = π_current(t) / π_old(t)    # clamped [0.5, 2.0]
loss     = -mean(is_ratio · adv · log π(t)) + β · KL(π_θ ‖ π_ref)

Only LoRA parameters get updated. Rank 16, attention projections — about 0.1% of total weights. The low-rank constraint itself prevents catastrophic forgetting: the base model stays frozen, the adapter encodes corrections.

Why not DPO / GRPO / PPO / SFT

DPO needs preference pairs. In a CLI you see one tool call, not two.
GRPO needs 64 samples per prompt. Not a product.
PPO doubles memory with a critic. Clipping drops rare-but-important tokens.
SFT on corrections is off-policy. Forward KL overwrites existing capabilities.

Limitations

We only get scalar logprobs from the training API, not full distributions — so no exact JSD or forward KL, just a token-level reverse-KL surrogate. No EMA teacher either (no weight-level API access), so the teacher update is instant rather than smoothed. There's optional trust-region regularization that mixes the current teacher with a frozen reference. Credit assignment assumes one tool call per message.

Lineage

Context distillation (Askell et al. 2021) showed you can bake prompted behavior into weights. GKD (Agarwal et al. 2023) showed on-policy distillation fixes the distribution mismatch that breaks SFT. SDPO (Hübotter et al. 2026) used environmental feedback as the teacher's privileged context — 10x faster than GRPO, 7x shorter traces. SDFT (Shenfeld et al. 2026) proved self-distillation enables continual learning without catastrophic forgetting.

Install

pip install continualcode
export TINKER_API_KEY=<your-key>
continualcode

Flags: enable_training=false for inference only, lora_rank=64, save_every=10.

Code

References


© 2026 github tweet