Baichuan-7B

🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat

English | 中文

Introduction

Baichuan-7B is an open-source, large-scale pre-trained language model developed by Baichuan Intelligent Technology. Baichuan-7B is based on Transformer architecture, which contains 7 billion parameters and trained on approximately 1.2 trillion tokens. It supports both Chinese and English languages with a context window length of 4096. It has achieved the best performance among models of the same size on standard Chinese and English benchmarks (C-Eval, MMLU, etc).

Benchmark

Chinese Benchmarks

C-Eval

C-Eval is a comprehensive Chinese language models evaluation dataset, covering 52 subjects and four levels of difficulty. We used the dev set from this dataset as the source for few-shot learning and conducted a 5-shot test on the test set.

Change OPENMODEL_PATH and CEVAL_DATA_PATH in evaluate_zh.py, corresponding to model's and C-Eval dataset's directories, then run:

cd evaluation
python evaluate_zh.py --model_name_or_path 'your/model/path'

Results

Model 5-shot	Average	Avg(Hard)	STEM	Social Sciences	Humanities	Others
GPT-4	68.7	54.9	67.1	77.6	64.5	67.8
ChatGPT	54.4	41.4	52.9	61.8	50.9	53.6
Claude-v1.3	54.2	39.0	51.9	61.7	52.1	53.7
Claude-instant-v1.0	45.9	35.5	43.1	53.8	44.2	45.4
BLOOMZ-7B	35.7	25.8	31.3	43.5	36.6	35.6
ChatGLM-6B	34.5	23.1	30.4	39.6	37.4	34.5
Ziya-LLaMA-13B-pretrain	30.2	22.7	27.7	34.4	32.0	28.9
moss-moon-003-base (16B)	27.4	24.5	27.0	29.1	27.2	26.9
LLaMA-7B-hf	27.1	25.9	27.1	26.8	27.9	26.3
Falcon-7B	25.8	24.3	25.8	26.0	25.8	25.6
TigerBot-7B-base	25.7	27.0	27.3	24.7	23.4	26.1
Aquila-7B^*	25.5	25.2	25.6	24.6	25.2	26.6
Open-LLaMA-v2-pretrain (7B)	24.0	22.5	23.1	25.3	25.2	23.2
BLOOM-7B	22.8	20.2	21.8	23.3	23.9	23.3
Baichuan-7B	42.8	31.5	38.2	52.0	46.2	39.3

Gaokao

Gaokao is an evaluation dataset curated from questions used in Chinese College Entrance Examination, to evaluate the capabilities of large language models, assessing models' language ability and logical reasoning skills. We processed the dataset to only containing single-answer multiple choice questions, we conducted a 5-shot test on all models.

Results

Model	Average
BLOOMZ-7B	28.72
LLaMA-7B	27.81
BLOOM-7B	26.96
TigerBot-7B-base	25.94
Falcon-7B	23.98
Ziya-LLaMA-13B-pretrain	23.17
ChatGLM-6B	21.41
Open-LLaMA-v2-pretrain	21.41
Aquila-7B^*	24.39
Baichuan-7B	36.24

AGIEval

AGIEval is a dataset aimed at evaluating models' general abilities in cognitive and problem-solving tasks. we conducted a 5-shot test on all models.

Results

Model	Average
BLOOMZ-7B	30.27
LLaMA-7B	28.17
Ziya-LLaMA-13B-pretrain	27.64
Falcon-7B	27.18
BLOOM-7B	26.55
Aquila-7B^*	25.58
TigerBot-7B-base	25.19
ChatGLM-6B	23.49
Open-LLaMA-v2-pretrain	23.49
Baichuan-7B	34.44

^{*The Aquila-7b are not implemented on Huggingface yet so we derived the model from (https://model.baai.ac.cn/model-detail/100098), which may have not identical to their official result.}

English Benchmarks

In addition to Chinese, we also tested the performance of models in English. MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty spans from high school level to expert level, making it a mainstream evaluation dataset for Large Language Models (LLMs).

We adopt the public implementation of (https://github.com/hendrycks/test) and the final result is shown below：

Results on MMLU

Model	Humanities	Social Sciences	STEM	Other	Average
ChatGLM-6B⁰	35.4	41.0	31.3	40.5	36.9
BLOOMZ-7B⁰	31.3	42.1	34.4	39.0	36.1
mpt-7B¹	-	-	-	-	35.6
LLaMA-7B²	34.0	38.3	30.5	38.1	35.1
Falcon-7B¹	-	-	-	-	35.0
moss-moon-003-sft (16B)⁰	30.5	33.8	29.3	34.4	31.9
BLOOM-7B⁰	25.0	24.4	26.5	26.4	25.5
moss-moon-003-base (16B)⁰	24.2	22.8	22.4	24.4	23.6
Baichuan-7B⁰	38.4	48.9	35.6	48.1	42.3

^{0: Our implementation}
^{1: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}
^{2: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu}

How to implement by yourself

git clone https://github.com/hendrycks/test
cd test
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
tar xf data.tar
mkdir results
cp ../evaluate_mmlu.py .
python evaluate_mmlu.py -m /path/to/Baichuan-7B

Specifically, the result of 57 MMLU tasks is:

And the comparison of 21 different subjects is：

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Data

The original corpora includes open-source Chinese and English data, self-crawled Chinese internet data, and some high-quality knowledge-intensive data.
Referring to related data work, frequency and quality are two dimensions that are considered important in the data processing stage. We apply heuristic rules and quality model scoring to filter the original dataset at both the paragraph and sentence levels. Employing the Locality-Sensitive Hashing (LSH) method on the full dataset, we perform de-duplication at both the paragraph and sentence levels.

The whole data processing process is shown below:

After continuous adjustments and multiple rounds of testing, we finally determined the best Chinese to English ratio that are optimized on downstream tasks.
We used an automatic algorithm-based data sampling strategy to balance the weights of different data categories.

Tokenization

We use the byte pair encoding (BPE) from SentencePiece as the tokenization algorithm, along with the following optimizations:

Most open-source models are primarily optimized for English, resulting in low efficiency for Chinese corpus. So we trained the tokenizer using 20 million multilingual corpora mainly composed of Chinese and English, significantly improving the compression rate for Chinese.
To improve the ability for mathematics, we split all numbers into individual digits that is also adopted in LLaMA and Galactica, separately tokenizing each digit to avoid inconsistencies in numbers.
For rare words (such as emoji and special symbols), we fallback unknown characters to byte encoding of UTF-8, thus achieving full coverage of unknown words.
We analyzed the compression rate of different tokenizers on the corpus. As shown in the following table, our tokenizer significantly outperforms open-source models like LLaMA, Falcon, and others. Compared to other Chinese tokenizers with similar compression rates, it offers higher training and inference efficiency.

Model	Baichuan-7B	LLaMA	Falcon	mpt-7B	ChatGLM	moss-moon-003
Compress Rate	0.737	1.312	1.049	1.206	0.631	0.659
Vocab Size	64,000	32,000	65,024	50,254	130,344	106,029

Model Architecture

The overall model is based on the standard Transformer structure, and we have adopted a model design similar to that of LLaMA.

Positional Embeddings: rotary-embedding is the widely used positional encoding method, with better extrapolation effects. Although the maximum length during training is 4096, the model can be well extrapolated to 5000 tokens in inference time, as shown in the following diagram:

Activation：SwiGLU, and the dimension of the feedforward-layer is set to 11,008
Layer-Normalization: We use the Pre-Normalization method based on RMSNorm

Training stability and Throughput

We made numerous modifications to the original LLaMA framework to improve throughput during training, including:

Operator optimization technology: We adopted more efficient operators, such as Flash-attention, NVIDIA apex's RMSNorm, etc.
Tensor partitioning technology: We partitioned some computational operators to reduce peak memory usage.
Mixed-precision technology: This accelerates the computational process without sacrificing model accuracy.
Training failure recovery technology: The training platform and the training framework were jointly optimized. By combining IaaS and PaaS, we can locate faults and recover tasks within minutes.
Communication optimization technology which includes:
1. Topology-aware collective communication algorithms to avoid network congestion and improve communication efficiency.
2. Adaptive setting of bucket size based on the number of cards to improve bandwidth utilization.
3. Tuning the trigger timing of communication primitives based on the model and the cluster environment, thereby overlapping computation and communication.

By using these optimization techniques, we achieved a throughput of 182 TFLOPS for the 7B model on thousand A800 GPUs, with a peak GPU computing power utilization rate of up to 58.3%.

The final loss of the model is shown below：

Training

Install requirements

pip install -r requirements.txt

Prepare pre-training datasets

You should divide the training corpus into multiple UTF-8 text files evenly according to the multiple of the total rank number, and place them in the corpus directory (default is

data_dir

). Each rank processor will read different files in the corpus directory, load them all into memory, and then start the subsequent training process. The above is a simplified demonstration process. It is recommended that users adjust the data production logic according to their needs in formal training tasks.

Download tokenizer

You can download our tokenizer.model from the Huggingface, and place them in the root director.

Config DeepSpeed

This demo code uses the DeepSpeed framework for training. Users should modify

config/hostfile

according to the cluster conditions.

Start training

scripts/train.sh

License

The use of the source code in this repository is governed by the open source license Apache 2.0 .

The use of the Baichuan-7B model weights, however, must follow the 《Baichuan-7B 模型许可协议》 .

Third-Party Resources

LLaMA Efficient Tuning supports Baichuan-7B to use Qlora for finetuning, supports RLHF, and supports WebDemo. For models that have gone through sft, see hiyouga/baichuan-7b-sft.
fireballoon/baichuan-vicuna-chinese-7b uses ShareGPT, ShareGPT-ZH, COT & COT-ZH, Leetcode, dummy, and other Chinese and English data for finetuning. For training code, refer to FastChat.
fireballoon/baichuan-vicuna-7b uses ShareGPT, COT, and Leetcode, among other data, for mixed finetuning. For training code, refer to FastChat.
Efficient-Tuning-LLMs supports Baichuan-7B to use Qlora for finetuning and 4bit inference.
fastllm is a large model library implemented purely in C++, with no third-party dependencies, and supports Baichuan-7B to run on mobile devices.
TheBloke/baichuan-7B-GPTQ is for the 4bit quantization of Baichuan-7B's GPTQ.

Star History

Baichuan-7B

🤗 Hugging Face • 🤖 ModelScope • 💬 WeChat

English | 中文

Introduction

Benchmark

Chinese Benchmarks

C-Eval

Change OPENMODEL_PATH and CEVAL_DATA_PATH in evaluate_zh.py, corresponding to model's and C-Eval dataset's directories, then run:

cd evaluation
python evaluate_zh.py --model_name_or_path 'your/model/path'

Results

Model 5-shot	Average	Avg(Hard)	STEM	Social Sciences	Humanities	Others
GPT-4	68.7	54.9	67.1	77.6	64.5	67.8
ChatGPT	54.4	41.4	52.9	61.8	50.9	53.6
Claude-v1.3	54.2	39.0	51.9	61.7	52.1	53.7
Claude-instant-v1.0	45.9	35.5	43.1	53.8	44.2	45.4
BLOOMZ-7B	35.7	25.8	31.3	43.5	36.6	35.6
ChatGLM-6B	34.5	23.1	30.4	39.6	37.4	34.5
Ziya-LLaMA-13B-pretrain	30.2	22.7	27.7	34.4	32.0	28.9
moss-moon-003-base (16B)	27.4	24.5	27.0	29.1	27.2	26.9
LLaMA-7B-hf	27.1	25.9	27.1	26.8	27.9	26.3
Falcon-7B	25.8	24.3	25.8	26.0	25.8	25.6
TigerBot-7B-base	25.7	27.0	27.3	24.7	23.4	26.1
Aquila-7B^*	25.5	25.2	25.6	24.6	25.2	26.6
Open-LLaMA-v2-pretrain (7B)	24.0	22.5	23.1	25.3	25.2	23.2
BLOOM-7B	22.8	20.2	21.8	23.3	23.9	23.3
Baichuan-7B	42.8	31.5	38.2	52.0	46.2	39.3

Gaokao

Results

Model	Average
BLOOMZ-7B	28.72
LLaMA-7B	27.81
BLOOM-7B	26.96
TigerBot-7B-base	25.94
Falcon-7B	23.98
Ziya-LLaMA-13B-pretrain	23.17
ChatGLM-6B	21.41
Open-LLaMA-v2-pretrain	21.41
Aquila-7B^*	24.39
Baichuan-7B	36.24

AGIEval

AGIEval is a dataset aimed at evaluating models' general abilities in cognitive and problem-solving tasks. we conducted a 5-shot test on all models.

Results

Model	Average
BLOOMZ-7B	30.27
LLaMA-7B	28.17
Ziya-LLaMA-13B-pretrain	27.64
Falcon-7B	27.18
BLOOM-7B	26.55
Aquila-7B^*	25.58
TigerBot-7B-base	25.19
ChatGLM-6B	23.49
Open-LLaMA-v2-pretrain	23.49
Baichuan-7B	34.44

^{*The Aquila-7b are not implemented on Huggingface yet so we derived the model from (https://model.baai.ac.cn/model-detail/100098), which may have not identical to their official result.}

English Benchmarks

We adopt the public implementation of (https://github.com/hendrycks/test) and the final result is shown below：

Results on MMLU

Model	Humanities	Social Sciences	STEM	Other	Average
ChatGLM-6B⁰	35.4	41.0	31.3	40.5	36.9
BLOOMZ-7B⁰	31.3	42.1	34.4	39.0	36.1
mpt-7B¹	-	-	-	-	35.6
LLaMA-7B²	34.0	38.3	30.5	38.1	35.1
Falcon-7B¹	-	-	-	-	35.0
moss-moon-003-sft (16B)⁰	30.5	33.8	29.3	34.4	31.9
BLOOM-7B⁰	25.0	24.4	26.5	26.4	25.5
moss-moon-003-base (16B)⁰	24.2	22.8	22.4	24.4	23.6
Baichuan-7B⁰	38.4	48.9	35.6	48.1	42.3

^{0: Our implementation}
^{1: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}
^{2: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu}

How to implement by yourself

git clone https://github.com/hendrycks/test
cd test
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
tar xf data.tar
mkdir results
cp ../evaluate_mmlu.py .
python evaluate_mmlu.py -m /path/to/Baichuan-7B

Specifically, the result of 57 MMLU tasks is:

And the comparison of 21 different subjects is：

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Data

The original corpora includes open-source Chinese and English data, self-crawled Chinese internet data, and some high-quality knowledge-intensive data.
Referring to related data work, frequency and quality are two dimensions that are considered important in the data processing stage. We apply heuristic rules and quality model scoring to filter the original dataset at both the paragraph and sentence levels. Employing the Locality-Sensitive Hashing (LSH) method on the full dataset, we perform de-duplication at both the paragraph and sentence levels.

The whole data processing process is shown below:

After continuous adjustments and multiple rounds of testing, we finally determined the best Chinese to English ratio that are optimized on downstream tasks.
We used an automatic algorithm-based data sampling strategy to balance the weights of different data categories.

Tokenization

We use the byte pair encoding (BPE) from SentencePiece as the tokenization algorithm, along with the following optimizations:

Most open-source models are primarily optimized for English, resulting in low efficiency for Chinese corpus. So we trained the tokenizer using 20 million multilingual corpora mainly composed of Chinese and English, significantly improving the compression rate for Chinese.
To improve the ability for mathematics, we split all numbers into individual digits that is also adopted in LLaMA and Galactica, separately tokenizing each digit to avoid inconsistencies in numbers.
For rare words (such as emoji and special symbols), we fallback unknown characters to byte encoding of UTF-8, thus achieving full coverage of unknown words.
We analyzed the compression rate of different tokenizers on the corpus. As shown in the following table, our tokenizer significantly outperforms open-source models like LLaMA, Falcon, and others. Compared to other Chinese tokenizers with similar compression rates, it offers higher training and inference efficiency.

Model	Baichuan-7B	LLaMA	Falcon	mpt-7B	ChatGLM	moss-moon-003
Compress Rate	0.737	1.312	1.049	1.206	0.631	0.659
Vocab Size	64,000	32,000	65,024	50,254	130,344	106,029

Model Architecture

The overall model is based on the standard Transformer structure, and we have adopted a model design similar to that of LLaMA.

Positional Embeddings: rotary-embedding is the widely used positional encoding method, with better extrapolation effects. Although the maximum length during training is 4096, the model can be well extrapolated to 5000 tokens in inference time, as shown in the following diagram:

Activation：SwiGLU, and the dimension of the feedforward-layer is set to 11,008
Layer-Normalization: We use the Pre-Normalization method based on RMSNorm

Training stability and Throughput

We made numerous modifications to the original LLaMA framework to improve throughput during training, including:

Operator optimization technology: We adopted more efficient operators, such as Flash-attention, NVIDIA apex's RMSNorm, etc.
Tensor partitioning technology: We partitioned some computational operators to reduce peak memory usage.
Mixed-precision technology: This accelerates the computational process without sacrificing model accuracy.
Training failure recovery technology: The training platform and the training framework were jointly optimized. By combining IaaS and PaaS, we can locate faults and recover tasks within minutes.
Communication optimization technology which includes:
1. Topology-aware collective communication algorithms to avoid network congestion and improve communication efficiency.
2. Adaptive setting of bucket size based on the number of cards to improve bandwidth utilization.
3. Tuning the trigger timing of communication primitives based on the model and the cluster environment, thereby overlapping computation and communication.

By using these optimization techniques, we achieved a throughput of 182 TFLOPS for the 7B model on thousand A800 GPUs, with a peak GPU computing power utilization rate of up to 58.3%.

The final loss of the model is shown below：

Training

Install requirements

pip install -r requirements.txt

Prepare pre-training datasets

You should divide the training corpus into multiple UTF-8 text files evenly according to the multiple of the total rank number, and place them in the corpus directory (default is

data_dir

Download tokenizer

You can download our tokenizer.model from the Huggingface, and place them in the root director.

Config DeepSpeed

This demo code uses the DeepSpeed framework for training. Users should modify

config/hostfile

according to the cluster conditions.

Start training

scripts/train.sh

License

The use of the source code in this repository is governed by the open source license Apache 2.0 .

The use of the Baichuan-7B model weights, however, must follow the 《Baichuan-7B 模型许可协议》 .

Third-Party Resources

LLaMA Efficient Tuning supports Baichuan-7B to use Qlora for finetuning, supports RLHF, and supports WebDemo. For models that have gone through sft, see hiyouga/baichuan-7b-sft.
fireballoon/baichuan-vicuna-chinese-7b uses ShareGPT, ShareGPT-ZH, COT & COT-ZH, Leetcode, dummy, and other Chinese and English data for finetuning. For training code, refer to FastChat.
fireballoon/baichuan-vicuna-7b uses ShareGPT, COT, and Leetcode, among other data, for mixed finetuning. For training code, refer to FastChat.
Efficient-Tuning-LLMs supports Baichuan-7B to use Qlora for finetuning and 4bit inference.
fastllm is a large model library implemented purely in C++, with no third-party dependencies, and supports Baichuan-7B to run on mobile devices.
TheBloke/baichuan-7B-GPTQ is for the 4bit quantization of Baichuan-7B's GPTQ.

Related Skills

<h1 align="center">

- Identify gaps

2. Apply Deepthink Protocol (reason about dependencies