<h1 align="center">
<a href="https://prompts.chat">
[](https://github.com/hiyouga/LLaMA-Factory/stargazers)
Sign in to like and favorite skills

👋 Join our WeChat, NPU, Lab4AI, LLaMA Factory Online user group.
[ English | 中文 ]
Fine-tuning a large language model can be easy as...
https://github.com/user-attachments/assets/3991a3a8-4276-4d30-9cab-4cb0c4b9b99e
Start local training:
Start cloud training:
Read technical notes:
[!NOTE] Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.
| Support Date | Model Name |
|---|---|
| Day 0 | Qwen3 / Qwen2.5-VL / Gemma 3 / GLM-4.1V / InternLM 3 / MiniCPM-o-2.6 |
| Day 1 | Llama 3 / GLM-4 / Mistral Small / PaliGemma2 / Llama 4 |
[!TIP] Now we have a dedicated blog for LLaMA Factory!
Website: https://blog.llamafactory.net/en/
[25/10/26] We support Megatron-core training backend with mcore_adapter. See PR #9237 to get started.
[25/08/22] We supported OFT and OFTv2. See examples for usage.
[25/08/20] We supported fine-tuning the Intern-S1-mini models. See PR #8976 to get started.
[25/08/06] We supported fine-tuning the GPT-OSS models. See PR #8826 to get started.
[25/07/02] We supported fine-tuning the GLM-4.1V-9B-Thinking model.
[25/04/28] We supported fine-tuning the Qwen3 model family.
[25/04/21] We supported the Muon optimizer. See examples for usage. Thank @tianshijing's PR.
[25/04/16] We supported fine-tuning the InternVL3 model. See PR #7258 to get started.
[25/04/14] We supported fine-tuning the GLM-Z1 and Kimi-VL models.
[25/04/06] We supported fine-tuning the Llama 4 model. See PR #7611 to get started.
[25/03/31] We supported fine-tuning the Qwen2.5 Omni model. See PR #7537 to get started.
[25/03/15] We supported SGLang as inference backend. Try
infer_backend: sglang to accelerate inference.
[25/03/12] We supported fine-tuning the Gemma 3 model.
[25/02/24] Announcing EasyR1, an efficient, scalable and multi-modality RL training framework for efficient GRPO training.
[25/02/11] We supported saving the Ollama modelfile when exporting the model checkpoints. See examples for usage.
[25/02/05] We supported fine-tuning the Qwen2-Audio and MiniCPM-o-2.6 on audio understanding tasks.
[25/01/31] We supported fine-tuning the DeepSeek-R1 and Qwen2.5-VL models.
[25/01/15] We supported APOLLO optimizer. See examples for usage.
[25/01/14] We supported fine-tuning the MiniCPM-o-2.6 and MiniCPM-V-2.6 models. Thank @BUAADreamer's PR.
[25/01/14] We supported fine-tuning the InternLM 3 models. Thank @hhaAndroid's PR.
[25/01/10] We supported fine-tuning the Phi-4 model.
[24/12/21] We supported using SwanLab for experiment tracking and visualization. See this section for details.
[24/11/27] We supported fine-tuning the Skywork-o1 model and the OpenO1 dataset.
[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.
[24/09/19] We supported fine-tuning the Qwen2.5 models.
[24/08/30] We supported fine-tuning the Qwen2-VL models. Thank @simonJJJ's PR.
[24/08/27] We supported Liger Kernel. Try
enable_liger_kernel: true for efficient training.
[24/08/09] We supported Adam-mini optimizer. See examples for usage. Thank @relic-yuexi's PR.
[24/07/04] We supported contamination-free packed training. Use
neat_packing: true to activate it. Thank @chuan298's PR.
[24/06/16] We supported PiSSA algorithm. See examples for usage.
[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.
[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.
[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with
paligemma template for chat completion.
[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.
[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.
[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.
[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.
[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.
[24/04/16] We supported BAdam optimizer. See examples for usage.
[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.
[24/03/31] We supported ORPO. See examples for usage.
[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!
[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.
[24/03/13] We supported LoRA+. See examples for usage.
[24/03/07] We supported GaLore optimizer. See examples for usage.
[24/03/07] We integrated vLLM for faster and concurrent inference. Try
infer_backend: vllm to enjoy 270% inference speed.
[24/02/28] We supported weight-decomposed LoRA (DoRA). Try
use_dora: true to activate DoRA training.
[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.
[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.
[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with
dataset: glaive_toolcall_en.
[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try
use_unsloth: true argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.
[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.
[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.
[23/10/21] We supported NEFTune trick for fine-tuning. Try
neftune_noise_alpha: 5 argument to activate NEFTune.
[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try
shift_attn: true argument to enable shift short attention.
[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.
[23/09/10] We supported FlashAttention-2. Try
flash_attn: fa2 argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.
[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try
rope_scaling: linear argument in training and rope_scaling: dynamic argument at inference to extrapolate the position embeddings.
[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.
[23/07/31] We supported dataset streaming. Try
streaming: true and max_steps: 10000 arguments to load your dataset in streaming mode.
[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.
[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try
train_web.py to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.
[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.
[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.
[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.
[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.
[!TIP] If you cannot use the latest feature, please pull the latest code and install LLaMA-Factory again.
| Model | Model size | Template |
|---|---|---|
| BLOOM/BLOOMZ | 560M/1.1B/1.7B/3B/7.1B/176B | - |
| DeepSeek (LLM/Code/MoE) | 7B/16B/67B/236B | deepseek |
| DeepSeek 3-3.2 | 236B/671B | deepseek3 |
| DeepSeek R1 (Distill) | 1.5B/7B/8B/14B/32B/70B/671B | deepseekr1 |
| ERNIE-4.5 | 0.3B/21B/300B | ernie_nothink |
| Falcon/Falcon H1 | 0.5B/1.5B/3B/7B/11B/34B/40B/180B | falcon/falcon_h1 |
| Gemma/Gemma 2/CodeGemma | 2B/7B/9B/27B | gemma/gemma2 |
| Gemma 3/Gemma 3n | 270M/1B/4B/6B/8B/12B/27B | gemma3/gemma3n |
| GLM-4/GLM-4-0414/GLM-Z1 | 9B/32B | glm4/glmz1 |
| GLM-4.5/GLM-4.5(6)V | 9B/106B/355B | glm4_moe/glm4_5v |
| GPT-2 | 0.1B/0.4B/0.8B/1.5B | - |
| GPT-OSS | 20B/120B | gpt_oss |
| Granite 3-4 | 1B/2B/3B/7B/8B | granite3/granite4 |
| Hunyuan/Hunyuan1.5 (MT) | 0.5B/1.8B/4B/7B/13B | hunyuan/hunyuan_small |
| InternLM 2-3 | 7B/8B/20B | intern2 |
| InternVL 2.5-3.5 | 1B/2B/4B/8B/14B/30B/38B/78B/241B | intern_vl |
| Intern-S1-mini | 8B | intern_s1 |
| Kimi-VL | 16B | kimi_vl |
| Ling 2.0 (mini/flash) | 16B/100B | bailing_v2 |
| LFM 2.5 (VL) | 1.2B/1.6B | lfm2/lfm2_vl |
| Llama | 7B/13B/33B/65B | - |
| Llama 2 | 7B/13B/70B | llama2 |
| Llama 3-3.3 | 1B/3B/8B/70B | llama3 |
| Llama 4 | 109B/402B | llama4 |
| Llama 3.2 Vision | 11B/90B | mllama |
| LLaVA-1.5 | 7B/13B | llava |
| LLaVA-NeXT | 7B/8B/13B/34B/72B/110B | llava_next |
| LLaVA-NeXT-Video | 7B/34B | llava_next_video |
| MiMo | 7B/309B | mimo/mimo_v2 |
| MiniCPM 4 | 0.5B/8B | cpm4 |
| MiniCPM-o/MiniCPM-V 4.5 | 8B/9B | minicpm_o/minicpm_v |
| MiniMax-M1/MiniMax-M2 | 229B/456B | minimax1/minimax2 |
| Ministral 3 | 3B/8B/14B | ministral3 |
| Mistral/Mixtral | 7B/8x7B/8x22B | mistral |
| PaliGemma/PaliGemma2 | 3B/10B/28B | paligemma |
| Phi-3/Phi-3.5 | 4B/14B | phi |
| Phi-3-small | 7B | phi_small |
| Phi-4-mini/Phi-4 | 3.8B/14B | phi4_mini/phi4 |
| Pixtral | 12B | pixtral |
| Qwen2 (Code/Math/MoE/QwQ) | 0.5B/1.5B/3B/7B/14B/32B/72B/110B | qwen |
| Qwen3 (MoE/Instruct/Thinking/Next) | 0.6B/1.7B/4B/8B/14B/32B/80B/235B | qwen3/qwen3_nothink |
| Qwen2-Audio | 7B | qwen2_audio |
| Qwen2.5-Omni | 3B/7B | qwen2_omni |
| Qwen3-Omni | 30B | qwen3_omni |
| Qwen2-VL/Qwen2.5-VL/QVQ | 2B/3B/7B/32B/72B | qwen2_vl |
| Qwen3-VL | 2B/4B/8B/30B/32B/235B | qwen3_vl |
| Seed (OSS/Coder) | 8B/36B | seed_oss/seed_coder |
| StarCoder 2 | 3B/7B/15B | - |
| TeleChat 2-2.5 | 3B/7B/35B/115B | telechat2 |
| Yuan 2 | 2B/51B/102B | yuan |
[!NOTE] For the "base" models, the
argument can be chosen fromtemplate,default,alpacaetc. But make sure to use the corresponding template for the "instruct/chat" models.vicunaIf the model has both reasoning and non-reasoning versions, please use the
suffix to distinguish between them. For example,_nothinkandqwen3.qwen3_nothinkRemember to use the SAME template in training and inference.
*: You should install the
from main branch and usetransformersto skip version check.DISABLE_VERSION_CHECK=1**: You need to install a specific version of
to use the corresponding model.transformers
Please refer to constants.py for a full list of models we supported.
You also can add a custom chat template to template.py.
| Approach | Full-tuning | Freeze-tuning | LoRA | QLoRA | OFT | QOFT |
|---|---|---|---|---|---|---|
| Pre-Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Supervised Fine-Tuning | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Reward Modeling | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| PPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| DPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| KTO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ORPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| SimPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
[!TIP] The implementation details of PPO can be found in this blog.
Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.
pip install "huggingface_hub<1.0.0" huggingface-cli login
| Mandatory | Minimum | Recommend |
|---|---|---|
| python | 3.9 | 3.10 |
| torch | 2.0.0 | 2.6.0 |
| torchvision | 0.15.0 | 0.21.0 |
| transformers | 4.49.0 | 4.50.0 |
| datasets | 2.16.0 | 3.2.0 |
| accelerate | 0.34.0 | 1.2.1 |
| peft | 0.14.0 | 0.15.1 |
| trl | 0.8.6 | 0.9.6 |
| Optional | Minimum | Recommend |
|---|---|---|
| CUDA | 11.6 | 12.2 |
| deepspeed | 0.10.0 | 0.16.4 |
| bitsandbytes | 0.39.0 | 0.43.1 |
| vllm | 0.4.3 | 0.8.2 |
| flash-attn | 2.5.6 | 2.7.2 |
* estimated
| Method | Bits | 7B | 14B | 30B | 70B | B |
|---|---|---|---|---|---|---|
Full ( or ) | 32 | 120GB | 240GB | 600GB | 1200GB | GB |
Full () | 16 | 60GB | 120GB | 300GB | 600GB | GB |
| Freeze/LoRA/GaLore/APOLLO/BAdam/OFT | 16 | 16GB | 32GB | 64GB | 160GB | GB |
| QLoRA / QOFT | 8 | 10GB | 20GB | 40GB | 80GB | GB |
| QLoRA / QOFT | 4 | 6GB | 12GB | 24GB | 48GB | GB |
| QLoRA / QOFT | 2 | 4GB | 8GB | 16GB | 24GB | GB |
[!IMPORTANT] Installation is mandatory.
git clone --depth 1 https://github.com/hiyouga/LlamaFactory.git cd LlamaFactory pip install -e . pip install -r requirements/metrics.txt
Optional dependencies available:
metrics, deepspeed. Install with: pip install -e . && pip install -r requirements/metrics.txt -r requirements/deepspeed.txt
Additional dependencies for specific features are available in
examples/requirements/.
docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest
This image is built on Ubuntu 22.04 (x86_64), CUDA 12.4, Python 3.11, PyTorch 2.6.0, and Flash-attn 2.7.4.
Find the pre-built images: https://hub.docker.com/r/hiyouga/llamafactory/tags
Please refer to build docker to build the image yourself.
Create an isolated Python environment with uv:
uv run llamafactory-cli webui
You need to manually install the GPU version of PyTorch on the Windows platform. Please refer to the official website and the following command to install PyTorch with CUDA support:
pip uninstall torch torchvision torchaudio pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 python -c "import torch; print(torch.cuda.is_available())"
If you see
True then you have successfully installed PyTorch with CUDA support.
Try
dataloader_num_workers: 0 if you encounter Can't pickle local object error.
If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of
bitsandbytes library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
To enable FlashAttention-2 on the Windows platform, please use the script from flash-attention-windows-wheel to compile and install it by yourself.
To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher:
pip install -r requirements/npu.txt. Additionally, you need to install the Ascend CANN Toolkit and Kernels. Please follow the installation tutorial.
You can also download the pre-built Docker images:
# Docker Hub docker pull hiyouga/llamafactory:latest-npu-a2 docker pull hiyouga/llamafactory:latest-npu-a3 # quay.io docker pull quay.io/ascend/llamafactory:latest-npu-a2 docker pull quay.io/ascend/llamafactory:latest-npu-a3
To use QLoRA based on bitsandbytes on Ascend NPU, please follow these 3 steps:
# Install bitsandbytes from source # Clone bitsandbytes repo, Ascend NPU backend is currently enabled on multi-backend-refactor branch git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git cd bitsandbytes/ # Install dependencies pip install -r requirements-dev.txt # Install the dependencies for the compilation tools. Note that the commands for this step may vary depending on the operating system. The following are provided for reference apt-get install -y build-essential cmake # Compile & install cmake -DCOMPUTE_BACKEND=npu -S . make pip install .
git clone -b main https://github.com/huggingface/transformers.git cd transformers pip install .
double_quantization: false in the configuration. You can refer to the example.Please refer to data/README.md for checking the details about the format of dataset files. You can use datasets on HuggingFace / ModelScope / Modelers hub, load the dataset in local disk, or specify a path to s3/gcs cloud storage.
[!NOTE] Please update
to use your custom dataset.data/dataset_info.json
You can also use Easy Dataset, DataFlow and GraphGen to create synthetic data for fine-tuning.
Use the following 3 commands to run LoRA fine-tuning, inference and merging of the Qwen3-4B-Instruct model, respectively.
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml
See examples/README.md for advanced usage (including distributed training).
[!TIP] Use
to show help information.llamafactory-cli helpRead FAQs first if you encounter any problems.
llamafactory-cli webui
Read our documentation.
For CUDA users:
cd docker/docker-cuda/ docker compose up -d docker compose exec llamafactory bash
For Ascend NPU users:
cd docker/docker-npu/ docker compose up -d docker compose exec llamafactory bash
For AMD ROCm users:
cd docker/docker-rocm/ docker compose up -d docker compose exec llamafactory bash
For CUDA users:
docker build -f ./docker/docker-cuda/Dockerfile \ --build-arg PIP_INDEX=https://pypi.org/simple \ -t llamafactory:latest . docker run -dit --ipc=host --gpus=all \ -p 7860:7860 \ -p 8000:8000 \ --name llamafactory \ llamafactory:latest docker exec -it llamafactory bash
For Ascend NPU users:
docker build -f ./docker/docker-npu/Dockerfile \ --build-arg PIP_INDEX=https://pypi.org/simple \ -t llamafactory:latest . docker run -dit --ipc=host \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -p 7860:7860 \ -p 8000:8000 \ --device /dev/davinci0 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ --name llamafactory \ llamafactory:latest docker exec -it llamafactory bash
For AMD ROCm users:
docker build -f ./docker/docker-rocm/Dockerfile \ --build-arg PIP_INDEX=https://pypi.org/simple \ -t llamafactory:latest . docker run -dit --ipc=host \ -p 7860:7860 \ -p 8000:8000 \ --device /dev/kfd \ --device /dev/dri \ --name llamafactory \ llamafactory:latest docker exec -it llamafactory bash
You can uncomment
VOLUME [ "/root/.cache/huggingface", "/app/shared_data", "/app/output" ] in the Dockerfile to use data volumes.
When building the Docker image, use
-v ./hf_cache:/root/.cache/huggingface argument to mount the local directory to the container. The following data volumes are available.
hf_cache: Utilize Hugging Face cache on the host machine.shared_data: The directionary to store datasets on the host machine.output: Set export dir to this location so that the merged result can be accessed directly on the host machine.API_PORT=8000 llamafactory-cli api examples/inference/qwen3.yaml infer_backend=vllm vllm_enforce_eager=true
[!TIP] Visit this page for API document.
Examples: Image understanding | Function calling
If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.
export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows
Train the model by specifying a model ID of the ModelScope Hub as the
model_name_or_path. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct.
You can also use Modelers Hub to download models and datasets.
export USE_OPENMIND_HUB=1 # `set USE_OPENMIND_HUB=1` for Windows
Train the model by specifying a model ID of the Modelers Hub as the
model_name_or_path. You can find a full list of model IDs at Modelers Hub, e.g., TeleAI/TeleChat-7B-pt.
To use Weights & Biases for logging experimental results, you need to add the following arguments to yaml files.
report_to: wandb run_name: test_run # optional
Set
WANDB_API_KEY to your key when launching training tasks to log in with your W&B account.
To use SwanLab for logging experimental results, you need to add the following arguments to yaml files.
use_swanlab: true swanlab_run_name: test_run # optional
When launching training tasks, you can log in to SwanLab in three ways:
swanlab_api_key=<your_api_key> to the yaml file, and set it to your API key.SWANLAB_API_KEY to your API key.swanlab login command to complete the login.If you have a project that should be incorporated, please contact via email or create a pull request.
This repository is licensed under the Apache-2.0 License.
Please follow the model licenses to use the corresponding model weights: BLOOM / DeepSeek / Falcon / Gemma / GLM-4 / GPT-2 / Granite / InternLM / Llama / Llama 2 / Llama 3 / Llama 4 / MiniCPM / Mistral/Mixtral/Pixtral / Phi-3/Phi-4 / Qwen / StarCoder 2 / TeleChat2 / Yuan 2
If this work is helpful, please kindly cite as:
@inproceedings{zheng2024llamafactory, title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models}, author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma}, booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)}, address={Bangkok, Thailand}, publisher={Association for Computational Linguistics}, year={2024}, url={http://arxiv.org/abs/2403.13372} }
This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.