<h1 align="center">
<a href="https://prompts.chat">
To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.
Sign in to like and favorite skills
To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.
The instruction following demonstrations are bootstrapped by following the seed set released from the self-instruct project. Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent.
In total, there are 52,002 instances in the dataset.
not applicable.
instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.input: str, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.output: str, the answer to the instruction as generated by text-davinci-003.no.
not applicable.
the finetuning target is the response generated by
text-davinci-003.
The Alpaca models (both demo and the ones that will be released) are trained on all 52K data. There is no recommended data split for the dataset.
All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction.
the dataset is self-contained.
no.
The generated may contain a few inappropriate responses. In our preliminary testing, we have not encountered any offensive responses.
The Github repository contains the code to generate the dataset.
The dataset is used to train the Alpaca models that are both used for the demo and released.
Please see https://github.com/tatsu-lab/stanford_alpaca
This dataset is generated by using the OpenAI's API. Therefore, this dataset cannot be used for commerical usage that compete with OpenAI.
The dataset should not be used for commerical usage that compete with OpenAI.
The dataset can be freely downloaded.
The dataset can be downloaded from the Github repository as a json file.
This dataset is distributed under the ODC-By license.
no
no
The dataset is hosted on github and the Github repository is maintained by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li.
Please open an issue in the Github repository
We do not have plan to update the dataset.