Alpaca Instruction Following Dataset

Motivation

For what purpose was the dataset created?

To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.

The instruction following demonstrations are bootstrapped by following the seed set released from the self-instruct project. Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent.

How many instances are there in total

In total, there are 52,002 instances in the dataset.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

not applicable.

What data does each instance consist of?

```
instruction
```
:
```
str
```
, describes the task the model should perform. Each of the 52K instructions is unique.
```
input
```
:
```
str
```
, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
```
output
```
:
```
str
```
, the answer to the instruction as generated by
```
text-davinci-003
```
.

Is any information missing from individual instances?

no.

not applicable.

Is there a label or target associated with each instance?

the finetuning target is the response generated by

text-davinci-003

Are there recommended data splits (e.g., training, development/validation, testing)?

The Alpaca models (both demo and the ones that will be released) are trained on all 52K data. There is no recommended data split for the dataset.

Are there any errors, sources of noise, or redundancies in the dataset?

All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction.