What is the difference between instruction tuning and normal fine-tuning for large language models?
Also the instruction-tuning I'm referring to isn't the in-context/prompt one.
All the recent papers about fine-tuning seem to be about instruction tuning.
I have looked at a couple of papers about fine-tuning/instruction tuning (e.g. FLAN) and none really describe the difference between instruction tuning and the alternatives (whatever the alternatives are).
I understand instruction-tuning is a form of fine-tuning but with an instruction dataset. But are all datasets not instruction datasets? What other kinds are there?
As you said, fine-tuning and instruction tuning are not mutually exclusive, but instruction tuning is a form of (supervised) fine-tuning, so there is no distinguishing feature of fine-tuning that differentiates it from instruction tuning, but only the other way around. So the answer to your first question is "No" (I read it as "Is every dataset an instruction dataset?", not as "Do instruction datasets even exist?").
What is special about instruction tuning is that the model is fine-tuned for an instruction-following task, which involves instructing the instruction receiver to perform another task, i.e. you have a second "level" of tasks (e.g. "Split the following number into digits") that is defined only in the instructions, which are part of the model's input sequence.
In classical types of supervised fine-tuning, you have no instructions, but directly tune the model to perform a single downstream task, e.g. to split an input number into digits, without being explicitly told to do so in the model input. (However, there are also hybrid approaches that involve both fine-tuning and explicit instructions.)
So although the word "task" is often used to refer to either, it is essential to conceptually distinguish between:
In summary, one could say that in instruction following, the actual task is determined dynamically, at inference time, while in the classical fine-tuning approach without instructions or similar devices, the actual task is determined statically, at training time.
Your confusion might be connected to the fact that prompting, which is another widespread adaptation technique, can involve an abstract description of the task (e.g. in zero-shot prompting), which can be formulated as an instruction.
But again, this is not necessary: Few-shot prompting does not necessarily involve an abstract description of the task, but the prompt may consist only of input-output examples of the task, plus the input for which the model should predict the output.
To answer your second question: You can find many datasets/benchmarks on the Hugging Face Hub. If you randomly click at a few of them, you will see in the preview that most of them don't contain any instructions.
EDIT: I forgot to mention one important aspect of instruction tuning: Depending on the application or research question, it often is a goal of instruction tuning to generalize instruction following across tasks. That is, the model should learn to follow instructions based on the implicit knowledge it accumulated during pre-training, and not only based on the instructions it saw during instruction tuning. To measure this cross-task generalization capability, instruction datasets are often divided into multiple tasks. Some of these tasks (not only some split of each task) are held out during instruction tuning and they are used during evaluation only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With