Text-generation-webui manual installation on Windows WSL2 / Ubuntu

Important:

For a simple automatic install, use the one-click installers provided in the original repo.
This tech is absolutely bleeding edge, methods and tools change on a daily basis, consider this page as outdates as soon as it's updated, things break - regularily
Look for more recent tutorials on youtube and reddit

Advanced WSL2 Ubuntu install 2023-05-15

reddit comments but will also eventually be outdated again

# 1 install WSL2 on Windows 11, then:
sudo apt update
sudo apt-get install build-essential
sudo apt install git -y

# optional: install a better terminal experience, otherwise skip to step 4
# 2 install brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/$USER/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew doctor

# 3 install oh-my-posh
brew install jandedobbeleer/oh-my-posh/oh-my-posh
$(brew --prefix oh-my-posh)/themes
#	copy the path and add it below to the second eval line:
sudo nano ~/.bashrc
#	add this to the end:
#		eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
#		eval "$(oh-my-posh init bash --config '/home/linuxbrew/.linuxbrew/opt/oh-my-posh/themes/atomic.omp.json')"
#		   plugins=(
#			 git
#			 # other plugins
#		   )
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
exec bash

# 4 install mamba instead of conda, because it's faster https://mamba.readthedocs.io/en/latest/installation.html
mkdir github
mkdir downloads
cd downloads
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-$(uname)-$(uname -m).sh

# 5 install the correct cuda toolkit 11.7, not 12.x
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
naon ~/.bashrc
#	add the following line, in order to add the cuda library to the environment variable
#		export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
#	after the plugins=() code block, above conda initialize
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
cd ..

# 6 install ooba's textgen
mamba create --name textgen python=3.10.9
mamba activate textgen
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio -f https://download.pytorch.org/whl/cu117/torch_stable.html
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# 7 Install 4bit support through GPTQ-for-LLaMa
mkdir repositories
cd repositories
# choose ONE of the following:
# A) for fast triton https://www.reddit.com/r/LocalLLaMA/comments/13g8v5q/fastest_inference_branch_of_gptqforllama_and/
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b fastest-inference-4bit
# B) for triton
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b triton
# C) for newer cuda
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
# D) for widely compatible old cuda
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
# groupsize, act-order, true-sequential
#	--act-order (quantizing columns in order of decreasing activation size)
#	--true-sequential (performing sequential quantization even within a single Transformer block)
#	Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
#	--groupsize
#	Currently, groupsize and act-order do not work together and you must choose one of them.
#	Ooba: There is a pytorch branch from qwop, that allows you to use groupsize and act-order together.
#	Models without group-size (better for the 7b model)
#	Models with group-size (better from 13b upwards)
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
cd ..
cd ..

# 8 Test ooba with a 4bit GPTQ model
python download-model.py 4bit/WizardLM-13B-Uncensored-4bit-128g
python server.py --wbits 4 --model_type llama --groupsize 128 --chat

# 9 install llama.cpp
cd repositories
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
nano ~/.bashrc
#	add the cuda bin folder to the path environment variable in order for make to find nvcc:
#		export PATH=/usr/local/cuda/bin:$PATH
#	after the export LD_LIBRARY_PATH line
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
make LLAMA_CUBLAS=1
cd models
wget https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/resolve/main/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
cd ..

# 10 test llama.cpp with GPU support
./main -t 8 -m models/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: write a story about llamas ### Response:" --n-gpu-layers 30
cd ..
cd ..

# 11 prepare ooba's textgen for llama.cpp support, by compiling llama-cpp-python with cuda GPU support
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Windows 11 WSL2 Ubuntu / Native Ubuntu

Installation guide from 2023-03-01 (outdated)

Install Ubuntu WSL2 on Windows 11

Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
```
wsl --install
```
You may be prompted to restart your computer. If so, save your work and restart.
Install Windows Terminal from Windows Store
Install Ubuntu on Windows Store
Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
If you prefer to use Windows Terminal from now on, close this console and start Windows Terminal then open a new Ubuntu console by clicking the drop down icon on top of Terminal and choose Ubuntu. Otherwise stay in the existing console window.

Install Anaconda + Build Essentials

```
sudo apt update
```
```
sudo apt upgrade
```
```
sudo apt install git
```
```
sudo apt install wget
```
```
mkdir downloads
```
```
cd downloads/
```

wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh

chmod +x ./Anaconda3-2022.05-Linux-x86_64.sh

```
./Anaconda3-2022.05-Linux-x86_64.sh
```
and follow the defaults
```
sudo apt install build-essential
```
```
cd ..
```

Install text-generation-webui

```
conda create -n textgen python=3.10.9
```
```
conda activate textgen
```

pip3 install torch torchvision torchaudio

```
mkdir github
```
```
cd github
```

git clone https://github.com/oobabooga/text-generation-webui

```
cd text-generation-webui
```
```
pip install -r requirements.txt
```
```
pip install chardet cchardet
```

Build and install GPTQ

If you want to try the triton branch, skip to Newer GPTQ-Triton

Older GPTQ-Cuda fork by pobabooga

Works on Windows, Linux, WSL2.
Supports 3 & 4 bit models
Only supports no-act-order models
Slower than triton
Works best with
```
--groupsize 128 --wbits 4
```
and no-act-order models

```
mkdir repositories
```
```
cd repositories
```

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

(or try the newer

https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda

build)

```
cd GPTQ-for-LLaMa
```

python -m pip install -r requirements.txt

```
python setup_cuda.py install
```
if this gives an error about g++, try installing the correct g++ version:
```
conda install -y -k gxx_linux-64=11.2.0
```
```
cd ../..
```

Newer GPTQ-Triton

This triton branch or this one:

Works on Linux and WSL2
Supports 4 bit quantized models
Is faster than cuda
Works best with the
```
--groupsize 128 --wbits 4
```
flags and act-order models

```
mkdir repositories
```
```
cd repositories
```

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

(or try

https://github.com/fpgaminer/GPTQ-triton

)

```
cd GPTQ-for-LLaMa
```
```
pip install -r requirements.txt
```
```
cd ../..
```

AutoGPTQ to install any (Newer Cuda, Newer Triton, older Cuda)

Alternatively you can try AutoGPTQ to install cuda, older llama-cuda, or triton variants:

run one of these:

```
pip install auto-gptq
```
to install cuda branch for newer models
```
pip install auto-gptq[llama]
```
if your transformers is outdated or you are using older models that don't support it
```
pip install auto-gptq[triton]
```
to install triton branch for triton compatible models

```
cd ../..
```

LAN port forwarding from Ubuntu WSL

If you want to open the webui from within your home network, enable port forwarding on your windows machine, with this command in an administrator terminal:

netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860

Install bitsandbytes cuda

Either always run

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib

before running the sever.py below

Or trying to install

pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113

Install xformers

Allows for faster, but non-deterministic inference. Optional:

```
pip install xformers
```
then use the
```
--xformers
```
flag later, when running the server.py below

You're done with the Ubuntu / WSL2 installation, you can skip to Download models section.

Windows 11 native

Install Miniconda

Download and install miniconda
Download and install git for windows
Open
```
Anaconda Prompt (Miniconda 3)
```
from the Start Menu

Install text-generation-webui

It should load in
```
C:\Users\yourusername>
```
```
mkdir github
```
```
cd github
```
```
conda create --name textgen python=3.10
```
```
conda activate textgen
```
```
conda install pip
```

conda install -y -k pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia

git clone https://github.com/oobabooga/text-generation-webui.git

python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl

```
cd text-generation-webui
```

pip install -r requirements.txt --upgrade

```
mkdir repositories
```
```
cd repositories
```

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

python -m pip install -r requirements.txt

```
python setup_cuda.py install
```
might fail, continue with the next command if so

pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl

skip this command, if the previous one didn't fail

```
cd ..\..\..\
```
(go back to text-generation-webui)
```
pip install faust-cchardet
```
```
pip install chardet
```

Download models

Still in your terminal, make sure you are in the /text-generation-webui/ folder and type
```
python download-model.py
```
select other to download a custom model
paste the huggingface user/directory, for example:
```
TheBloke/wizardLM-7B-GGML
```
and let it download the model files

Run

The base command to run. You have to add further flags, depending on the model and environment you want to run in:

if you are on WSL2 Ubuntu, run
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
```
always, before running the server.py
```
python server.py --model-menu --chat
```

```
--model-menu
```
to allow the change of models in the UI
```
--chat
```
loads the chat instead of the text completion UI
```
--wbits 4
```
loads a 4-bit quantized model
```
--groupsize 128
```
if the model specifies groupsize, add this parameter
```
--model_type llama
```
if the model name is unknown, specify it's base model. if you run llama derrived models like vicuna, alpaca, gpt4-x, codecapybara or wizardLM you have to define it as
```
llama
```
. If you load OPT or GPT-J models, define the flag accordingly
```
--xformers
```
if you have properly installed xformers and want faster but nondeterministic answer generation

Troubleshoot

cuda lib not found

If you get a

cuda lib not found

error, especially on Windows WSL2 Ubuntu, try executing

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib

before running the server.py above

ModuleNotFoundError: No module named 'chardet'

```
pip install faust-cchardet
```
```
pip install chardet
```

or the other way around. Then try to start the server again.

No GPU support on bitsandbytes

On Windows Native, try:

```
pip uninstall bitsandbytes
```

pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git

here are some discussion, but some solutions are for Windows WSL2, some for Windows native

Or try these prebuilt wheel on windows:

https://github.com/TimDettmers/bitsandbytes/files/11084955/bitsandbytes-0.37.2-py3-none-any.whl.zip
https://github.com/acpopescu/bitsandbytes/releases/tag/v0.37.2-win.0
And more help on windows support here and here

Still having problems, try to manually copy the libraries

On Linux or Windows WSL2 Ubuntu, try:

make sure you run
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
```
before running the server.py every time!
alternatively, you can try
```
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
```
and see if it works without the above command

Install xformers prebuilt Windows wheels

```
pip install xformers==0.0.16rc425
```

Prebuilt GPTQ Windows Wheels (may be outdated)

GPTQ Wheels for Windows

Apple Silicon

Use llama.cpp, HN discussion

Resources

3rd party models

See an up to date list of most models you can run locally: awesome-ai open-models

Other tools

See the awesome-ai LLM section for more tools, GUIs etc.

Other resources

🏠Home

Text-generation-webui manual installation on Windows WSL2 / Ubuntu

Important:

For a simple automatic install, use the one-click installers provided in the original repo.
This tech is absolutely bleeding edge, methods and tools change on a daily basis, consider this page as outdates as soon as it's updated, things break - regularily
Look for more recent tutorials on youtube and reddit

Advanced WSL2 Ubuntu install 2023-05-15

reddit comments but will also eventually be outdated again

# 1 install WSL2 on Windows 11, then:
sudo apt update
sudo apt-get install build-essential
sudo apt install git -y

# optional: install a better terminal experience, otherwise skip to step 4
# 2 install brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/$USER/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew doctor

# 3 install oh-my-posh
brew install jandedobbeleer/oh-my-posh/oh-my-posh
$(brew --prefix oh-my-posh)/themes
#	copy the path and add it below to the second eval line:
sudo nano ~/.bashrc
#	add this to the end:
#		eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
#		eval "$(oh-my-posh init bash --config '/home/linuxbrew/.linuxbrew/opt/oh-my-posh/themes/atomic.omp.json')"
#		   plugins=(
#			 git
#			 # other plugins
#		   )
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
exec bash

# 4 install mamba instead of conda, because it's faster https://mamba.readthedocs.io/en/latest/installation.html
mkdir github
mkdir downloads
cd downloads
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-$(uname)-$(uname -m).sh

# 5 install the correct cuda toolkit 11.7, not 12.x
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
naon ~/.bashrc
#	add the following line, in order to add the cuda library to the environment variable
#		export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
#	after the plugins=() code block, above conda initialize
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
cd ..

# 6 install ooba's textgen
mamba create --name textgen python=3.10.9
mamba activate textgen
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio -f https://download.pytorch.org/whl/cu117/torch_stable.html
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# 7 Install 4bit support through GPTQ-for-LLaMa
mkdir repositories
cd repositories
# choose ONE of the following:
# A) for fast triton https://www.reddit.com/r/LocalLLaMA/comments/13g8v5q/fastest_inference_branch_of_gptqforllama_and/
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b fastest-inference-4bit
# B) for triton
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b triton
# C) for newer cuda
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda
# D) for widely compatible old cuda
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
# groupsize, act-order, true-sequential
#	--act-order (quantizing columns in order of decreasing activation size)
#	--true-sequential (performing sequential quantization even within a single Transformer block)
#	Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
#	--groupsize
#	Currently, groupsize and act-order do not work together and you must choose one of them.
#	Ooba: There is a pytorch branch from qwop, that allows you to use groupsize and act-order together.
#	Models without group-size (better for the 7b model)
#	Models with group-size (better from 13b upwards)
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
cd ..
cd ..

# 8 Test ooba with a 4bit GPTQ model
python download-model.py 4bit/WizardLM-13B-Uncensored-4bit-128g
python server.py --wbits 4 --model_type llama --groupsize 128 --chat

# 9 install llama.cpp
cd repositories
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
nano ~/.bashrc
#	add the cuda bin folder to the path environment variable in order for make to find nvcc:
#		export PATH=/usr/local/cuda/bin:$PATH
#	after the export LD_LIBRARY_PATH line
#	CTRL+X to end editing
#	Y to save changes
#	ENTER to finally exit
source ~/.bashrc
make LLAMA_CUBLAS=1
cd models
wget https://huggingface.co/TheBloke/WizardLM-13B-Uncensored-GGML/resolve/main/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin
cd ..

# 10 test llama.cpp with GPU support
./main -t 8 -m models/wizardLM-13B-Uncensored.ggmlv3.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: write a story about llamas ### Response:" --n-gpu-layers 30
cd ..
cd ..

# 11 prepare ooba's textgen for llama.cpp support, by compiling llama-cpp-python with cuda GPU support
pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

Windows 11 WSL2 Ubuntu / Native Ubuntu

Installation guide from 2023-03-01 (outdated)

Install Ubuntu WSL2 on Windows 11

Press the Windows key + X and click on "Windows PowerShell (Admin)" or "Windows Terminal (Admin)" to open PowerShell or Terminal with administrator privileges.
```
wsl --install
```
You may be prompted to restart your computer. If so, save your work and restart.
Install Windows Terminal from Windows Store
Install Ubuntu on Windows Store
Choose the desired Ubuntu version (e.g., Ubuntu 20.04 LTS) and click "Get" or "Install" to download and install the Ubuntu app.
Once the installation is complete, click "Launch" or search for "Ubuntu" in the Start menu and open the app.
When you first launch the Ubuntu app, it will take a few minutes to set up. Be patient as it installs the necessary files and sets up your environment.
Once the setup is complete, you will be prompted to create a new UNIX username and password. Choose a username and password, and make sure to remember them, as you will need them for future administrative tasks within the Ubuntu environment.
If you prefer to use Windows Terminal from now on, close this console and start Windows Terminal then open a new Ubuntu console by clicking the drop down icon on top of Terminal and choose Ubuntu. Otherwise stay in the existing console window.

Install Anaconda + Build Essentials

```
sudo apt update
```
```
sudo apt upgrade
```
```
sudo apt install git
```
```
sudo apt install wget
```
```
mkdir downloads
```
```
cd downloads/
```

wget https://repo.anaconda.com/archive/Anaconda3-2023.03-1-Linux-x86_64.sh

chmod +x ./Anaconda3-2022.05-Linux-x86_64.sh

```
./Anaconda3-2022.05-Linux-x86_64.sh
```
and follow the defaults
```
sudo apt install build-essential
```
```
cd ..
```

Install text-generation-webui

```
conda create -n textgen python=3.10.9
```
```
conda activate textgen
```

pip3 install torch torchvision torchaudio

```
mkdir github
```
```
cd github
```

git clone https://github.com/oobabooga/text-generation-webui

```
cd text-generation-webui
```
```
pip install -r requirements.txt
```
```
pip install chardet cchardet
```

Build and install GPTQ

If you want to try the triton branch, skip to Newer GPTQ-Triton

Older GPTQ-Cuda fork by pobabooga

Works on Windows, Linux, WSL2.
Supports 3 & 4 bit models
Only supports no-act-order models
Slower than triton
Works best with
```
--groupsize 128 --wbits 4
```
and no-act-order models

```
mkdir repositories
```
```
cd repositories
```

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

(or try the newer

https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda

build)

```
cd GPTQ-for-LLaMa
```

python -m pip install -r requirements.txt

```
python setup_cuda.py install
```
if this gives an error about g++, try installing the correct g++ version:
```
conda install -y -k gxx_linux-64=11.2.0
```
```
cd ../..
```

Newer GPTQ-Triton

This triton branch or this one:

Works on Linux and WSL2
Supports 4 bit quantized models
Is faster than cuda
Works best with the
```
--groupsize 128 --wbits 4
```
flags and act-order models

```
mkdir repositories
```
```
cd repositories
```

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

(or try

https://github.com/fpgaminer/GPTQ-triton

)

```
cd GPTQ-for-LLaMa
```
```
pip install -r requirements.txt
```
```
cd ../..
```

AutoGPTQ to install any (Newer Cuda, Newer Triton, older Cuda)

Alternatively you can try AutoGPTQ to install cuda, older llama-cuda, or triton variants:

run one of these:

```
pip install auto-gptq
```
to install cuda branch for newer models
```
pip install auto-gptq[llama]
```
if your transformers is outdated or you are using older models that don't support it
```
pip install auto-gptq[triton]
```
to install triton branch for triton compatible models

```
cd ../..
```

LAN port forwarding from Ubuntu WSL

If you want to open the webui from within your home network, enable port forwarding on your windows machine, with this command in an administrator terminal:

netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860

Install bitsandbytes cuda

Either always run

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib

before running the sever.py below

Or trying to install

pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113

Install xformers

Allows for faster, but non-deterministic inference. Optional:

```
pip install xformers
```
then use the
```
--xformers
```
flag later, when running the server.py below

You're done with the Ubuntu / WSL2 installation, you can skip to Download models section.

Windows 11 native

Install Miniconda

Download and install miniconda
Download and install git for windows
Open
```
Anaconda Prompt (Miniconda 3)
```
from the Start Menu

Install text-generation-webui

It should load in
```
C:\Users\yourusername>
```
```
mkdir github
```
```
cd github
```
```
conda create --name textgen python=3.10
```
```
conda activate textgen
```
```
conda install pip
```

conda install -y -k pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia

git clone https://github.com/oobabooga/text-generation-webui.git

python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.38.1-py3-none-any.whl

```
cd text-generation-webui
```

pip install -r requirements.txt --upgrade

```
mkdir repositories
```
```
cd repositories
```

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

python -m pip install -r requirements.txt

```
python setup_cuda.py install
```
might fail, continue with the next command if so

pip install https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl

skip this command, if the previous one didn't fail

```
cd ..\..\..\
```
(go back to text-generation-webui)
```
pip install faust-cchardet
```
```
pip install chardet
```

Download models

Still in your terminal, make sure you are in the /text-generation-webui/ folder and type
```
python download-model.py
```
select other to download a custom model
paste the huggingface user/directory, for example:
```
TheBloke/wizardLM-7B-GGML
```
and let it download the model files

Run

The base command to run. You have to add further flags, depending on the model and environment you want to run in:

if you are on WSL2 Ubuntu, run
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
```
always, before running the server.py
```
python server.py --model-menu --chat
```

```
--model-menu
```
to allow the change of models in the UI
```
--chat
```
loads the chat instead of the text completion UI
```
--wbits 4
```
loads a 4-bit quantized model
```
--groupsize 128
```
if the model specifies groupsize, add this parameter
```
--model_type llama
```
if the model name is unknown, specify it's base model. if you run llama derrived models like vicuna, alpaca, gpt4-x, codecapybara or wizardLM you have to define it as
```
llama
```
. If you load OPT or GPT-J models, define the flag accordingly
```
--xformers
```
if you have properly installed xformers and want faster but nondeterministic answer generation

Troubleshoot

cuda lib not found

If you get a

cuda lib not found

error, especially on Windows WSL2 Ubuntu, try executing

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib

before running the server.py above

ModuleNotFoundError: No module named 'chardet'

```
pip install faust-cchardet
```
```
pip install chardet
```

or the other way around. Then try to start the server again.

No GPU support on bitsandbytes

On Windows Native, try:

```
pip uninstall bitsandbytes
```

pip install git+https://github.com/Keith-Hon/bitsandbytes-windows.git

here are some discussion, but some solutions are for Windows WSL2, some for Windows native

Or try these prebuilt wheel on windows:

https://github.com/TimDettmers/bitsandbytes/files/11084955/bitsandbytes-0.37.2-py3-none-any.whl.zip
https://github.com/acpopescu/bitsandbytes/releases/tag/v0.37.2-win.0
And more help on windows support here and here

Still having problems, try to manually copy the libraries

On Linux or Windows WSL2 Ubuntu, try:

make sure you run
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/wsl/lib
```
before running the server.py every time!
alternatively, you can try
```
pip install -i https://test.pypi.org/simple/ bitsandbytes-cuda113
```
and see if it works without the above command

Install xformers prebuilt Windows wheels

```
pip install xformers==0.0.16rc425
```

Prebuilt GPTQ Windows Wheels (may be outdated)

GPTQ Wheels for Windows

Apple Silicon

Use llama.cpp, HN discussion

[🏠Home](README.md)

Related Skills

<h1 align="center">

2. Apply Deepthink Protocol (reason about dependencies

- Identify gaps