<h1 align="center">
<a href="https://prompts.chat">
<!-- markdownlint-disable html -->
Sign in to like and favorite skills
🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat
English | 中文
Baichuan-7B is an open-source, large-scale pre-trained language model developed by Baichuan Intelligent Technology. Baichuan-7B is based on Transformer architecture, which contains 7 billion parameters and trained on approximately 1.2 trillion tokens. It supports both Chinese and English languages with a context window length of 4096. It has achieved the best performance among models of the same size on standard Chinese and English benchmarks (C-Eval, MMLU, etc).
C-Eval is a comprehensive Chinese language models evaluation dataset, covering 52 subjects and four levels of difficulty. We used the dev set from this dataset as the source for few-shot learning and conducted a 5-shot test on the test set.
Change OPENMODEL_PATH and CEVAL_DATA_PATH in evaluate_zh.py, corresponding to model's and C-Eval dataset's directories, then run:
cd evaluation python evaluate_zh.py --model_name_or_path 'your/model/path'
| Model 5-shot | Average | Avg(Hard) | STEM | Social Sciences | Humanities | Others |
|---|---|---|---|---|---|---|
| GPT-4 | 68.7 | 54.9 | 67.1 | 77.6 | 64.5 | 67.8 |
| ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 |
| Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 |
| Claude-instant-v1.0 | 45.9 | 35.5 | 43.1 | 53.8 | 44.2 | 45.4 |
| BLOOMZ-7B | 35.7 | 25.8 | 31.3 | 43.5 | 36.6 | 35.6 |
| ChatGLM-6B | 34.5 | 23.1 | 30.4 | 39.6 | 37.4 | 34.5 |
| Ziya-LLaMA-13B-pretrain | 30.2 | 22.7 | 27.7 | 34.4 | 32.0 | 28.9 |
| moss-moon-003-base (16B) | 27.4 | 24.5 | 27.0 | 29.1 | 27.2 | 26.9 |
| LLaMA-7B-hf | 27.1 | 25.9 | 27.1 | 26.8 | 27.9 | 26.3 |
| Falcon-7B | 25.8 | 24.3 | 25.8 | 26.0 | 25.8 | 25.6 |
| TigerBot-7B-base | 25.7 | 27.0 | 27.3 | 24.7 | 23.4 | 26.1 |
| Aquila-7B* | 25.5 | 25.2 | 25.6 | 24.6 | 25.2 | 26.6 |
| Open-LLaMA-v2-pretrain (7B) | 24.0 | 22.5 | 23.1 | 25.3 | 25.2 | 23.2 |
| BLOOM-7B | 22.8 | 20.2 | 21.8 | 23.3 | 23.9 | 23.3 |
| Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 |
Gaokao is an evaluation dataset curated from questions used in Chinese College Entrance Examination, to evaluate the capabilities of large language models, assessing models' language ability and logical reasoning skills. We processed the dataset to only containing single-answer multiple choice questions, we conducted a 5-shot test on all models.
| Model | Average |
|---|---|
| BLOOMZ-7B | 28.72 |
| LLaMA-7B | 27.81 |
| BLOOM-7B | 26.96 |
| TigerBot-7B-base | 25.94 |
| Falcon-7B | 23.98 |
| Ziya-LLaMA-13B-pretrain | 23.17 |
| ChatGLM-6B | 21.41 |
| Open-LLaMA-v2-pretrain | 21.41 |
| Aquila-7B* | 24.39 |
| Baichuan-7B | 36.24 |
AGIEval is a dataset aimed at evaluating models' general abilities in cognitive and problem-solving tasks. we conducted a 5-shot test on all models.
| Model | Average |
|---|---|
| BLOOMZ-7B | 30.27 |
| LLaMA-7B | 28.17 |
| Ziya-LLaMA-13B-pretrain | 27.64 |
| Falcon-7B | 27.18 |
| BLOOM-7B | 26.55 |
| Aquila-7B* | 25.58 |
| TigerBot-7B-base | 25.19 |
| ChatGLM-6B | 23.49 |
| Open-LLaMA-v2-pretrain | 23.49 |
| Baichuan-7B | 34.44 |
*The Aquila-7b are not implemented on Huggingface yet so we derived the model from (https://model.baai.ac.cn/model-detail/100098), which may have not identical to their official result.
In addition to Chinese, we also tested the performance of models in English. MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty spans from high school level to expert level, making it a mainstream evaluation dataset for Large Language Models (LLMs).
We adopt the public implementation of (https://github.com/hendrycks/test) and the final result is shown below:
| Model | Humanities | Social Sciences | STEM | Other | Average |
|---|---|---|---|---|---|
| ChatGLM-6B0 | 35.4 | 41.0 | 31.3 | 40.5 | 36.9 |
| BLOOMZ-7B0 | 31.3 | 42.1 | 34.4 | 39.0 | 36.1 |
| mpt-7B1 | - | - | - | - | 35.6 |
| LLaMA-7B2 | 34.0 | 38.3 | 30.5 | 38.1 | 35.1 |
| Falcon-7B1 | - | - | - | - | 35.0 |
| moss-moon-003-sft (16B)0 | 30.5 | 33.8 | 29.3 | 34.4 | 31.9 |
| BLOOM-7B0 | 25.0 | 24.4 | 26.5 | 26.4 | 25.5 |
| moss-moon-003-base (16B)0 | 24.2 | 22.8 | 22.4 | 24.4 | 23.6 |
| Baichuan-7B0 | 38.4 | 48.9 | 35.6 | 48.1 | 42.3 |
0: Our implementation
1: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
git clone https://github.com/hendrycks/test cd test wget https://people.eecs.berkeley.edu/~hendrycks/data.tar tar xf data.tar mkdir results cp ../evaluate_mmlu.py . python evaluate_mmlu.py -m /path/to/Baichuan-7B
Specifically, the result of 57 MMLU tasks is:
And the comparison of 21 different subjects is:
from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True) inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt') inputs = inputs.to('cuda:0') pred = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
The whole data processing process is shown below:
We use the byte pair encoding (BPE) from SentencePiece as the tokenization algorithm, along with the following optimizations:
| Model | Baichuan-7B | LLaMA | Falcon | mpt-7B | ChatGLM | moss-moon-003 |
|---|---|---|---|---|---|---|
| Compress Rate | 0.737 | 1.312 | 1.049 | 1.206 | 0.631 | 0.659 |
| Vocab Size | 64,000 | 32,000 | 65,024 | 50,254 | 130,344 | 106,029 |
The overall model is based on the standard Transformer structure, and we have adopted a model design similar to that of LLaMA.
We made numerous modifications to the original LLaMA framework to improve throughput during training, including:
By using these optimization techniques, we achieved a throughput of 182 TFLOPS for the 7B model on thousand A800 GPUs, with a peak GPU computing power utilization rate of up to 58.3%.
The final loss of the model is shown below:
pip install -r requirements.txt
You should divide the training corpus into multiple UTF-8 text files evenly according to the multiple of the total rank number, and place them in the corpus directory (default is
data_dir). Each rank processor will read different files in the corpus directory, load them all into memory, and then start the subsequent training process. The above is a simplified demonstration process. It is recommended that users adjust the data production logic according to their needs in formal training tasks.
You can download our tokenizer.model from the Huggingface, and place them in the root director.
This demo code uses the DeepSpeed framework for training. Users should modify
config/hostfile according to the cluster conditions.
scripts/train.sh
The use of the source code in this repository is governed by the open source license Apache 2.0 .
The use of the Baichuan-7B model weights, however, must follow the 《Baichuan-7B 模型许可协议》 .