Decoder vs Encoder-Decoder

Note: This is a work in progress to help me understand the differences between the two architectures.

Decoder

Much easier to train. Just take any text and use it predict the next token.
Works better if all you're doing is unsupervised pre-training and hoping it generalizes

Encoder-Decoder

More difficult to train. You need to make a decision and define what the output and input is.
works better if you can do multi-task fine tuning, and it's not TOO hard to convert models between the two architectures

Relevant Papers

It's possible that decorder-only + RLHF (e.g. ChatGPT) is better than encoder-decoder + multi-task fine tuning (e.g. Flan-T5).

2025© Mike Surowiec

X · GitHub · LinkedIn