Decoder vs Encoder-Decoder
Note: This is a work in progress to help me understand the differences between the two architectures.
Decoder
- Much easier to train. Just take any text and use it predict the next token.
- Works better if all you're doing is unsupervised pre-training and hoping it generalizes
Encoder-Decoder
- More difficult to train. You need to make a decision and define what the output and input is.
- works better if you can do multi-task fine tuning, and it's not TOO hard to convert models between the two architectures
Relevant Papers
It's possible that decorder-only + RLHF (e.g. ChatGPT) is better than encoder-decoder + multi-task fine tuning (e.g. Flan-T5).
© Mike Surowiec