Repository Guidelines

Project Structure & Module Organization

Expected layout when populating this repo:

llama3_inference/
├─ README.md                  # Profiling plan and examples
├─ bench_vllm_llama3.py       # Offline grid runner (CSV output)
├─ client_stream_profile.py   # Serve client (TTFT/e2e)
├─ scripts/                   # Helper shell scripts
│  ├─ run_prefill.sh
│  ├─ run_decode.sh
│  ├─ run_mixed.sh
│  └─ dmon_wrap.sh
├─ Makefile                   # Common targets (prefill/decode/etc.)
├─ results/                   # Generated CSV/metrics (gitignored)
└─ plots/                     # Generated figures (gitignored)

Build, Test, and Development Commands

Create env and install deps:

```
uv venv
```

uv pip install "vllm>=0.5.0" "torch>=2.3" transformers pynvml psutil pandas

Quick GPU sanity check:

nvidia-smi

then

python -c 'import torch;print(torch.cuda.is_available())'

Offline run (example):

python bench_vllm_llama3.py --model /path/to/Meta-Llama-3-8B-Instruct --tp 1 --devices 7 --B 1,2,4 --Lp 512,2048 --Lo 16 --repeats 1 --out results/prefill_tp1.csv

Serve mode (example):

vllm serve /path/to/model --tensor-parallel-size 4 --port 2234 --metrics-port 2235

Coding Style & Naming Conventions

Python: PEP 8, 4-space indent, type hints where practical. Files and functions use
```
lower_snake_case
```
; constants
```
UPPER_SNAKE_CASE
```
.
CLI: use
```
argparse
```
, explicit defaults, deterministic sampling (
```
temperature=0.0
```
,
```
top_p=1.0
```
) unless testing variability.

Bash:

#!/usr/bin/env bash

with

set -euo pipefail

; scripts named

run_*.sh

Output: write CSVs to
```
results/
```
and figures to
```
plots/
```
. Do not hardcode device IDs; use
```
CUDA_VISIBLE_DEVICES
```
.

Testing Guidelines

Add smoke runs that finish in <1 minute, e.g.:
```
B=1 Lp=128 Lo=16 repeats=1
```
producing
```
results/quickcheck.csv
```
.
Prefer “unit-style” checks on parsing and metric aggregation over long benchmarks.
If adding metrics, include minimal fixtures and a
```
--dry-run
```
or smallest grid.

Commit & Pull Request Guidelines

Use Conventional Commits:
```
feat:
```
,
```
fix:
```
,
```
docs:
```
,
```
refactor:
```
,
```
perf:
```
,
```
chore:
```
.
PRs must include: purpose, key commands to reproduce, hardware/context (GPU count, vLLM/torch versions), and a snippet of resulting CSV.
Update
```
README.md
```
when flags, directory layout, or targets change. Exclude large artifacts; commit small samples only.

Security & Configuration Tips

Do not commit model weights or secrets. Parameterize model paths via
```
--model
```
and environment variables.
Recommended env:
```
TOKENIZERS_PARALLELISM=false
```
, consider
```
VLLM_NVTX_PROFILE=1
```
for profiling. Avoid unstable Nsight flags not supported in your environment.

Project Structure & Module Organization

Expected layout when populating this repo:

llama3_inference/
├─ README.md                  # Profiling plan and examples
├─ bench_vllm_llama3.py       # Offline grid runner (CSV output)
├─ client_stream_profile.py   # Serve client (TTFT/e2e)
├─ scripts/                   # Helper shell scripts
│  ├─ run_prefill.sh
│  ├─ run_decode.sh
│  ├─ run_mixed.sh
│  └─ dmon_wrap.sh
├─ Makefile                   # Common targets (prefill/decode/etc.)
├─ results/                   # Generated CSV/metrics (gitignored)
└─ plots/                     # Generated figures (gitignored)

Build, Test, and Development Commands

Create env and install deps:

```
uv venv
```

uv pip install "vllm>=0.5.0" "torch>=2.3" transformers pynvml psutil pandas

Quick GPU sanity check:

nvidia-smi

then

python -c 'import torch;print(torch.cuda.is_available())'

Offline run (example):

python bench_vllm_llama3.py --model /path/to/Meta-Llama-3-8B-Instruct --tp 1 --devices 7 --B 1,2,4 --Lp 512,2048 --Lo 16 --repeats 1 --out results/prefill_tp1.csv

Serve mode (example):

vllm serve /path/to/model --tensor-parallel-size 4 --port 2234 --metrics-port 2235

Coding Style & Naming Conventions

Python: PEP 8, 4-space indent, type hints where practical. Files and functions use

lower_snake_case

; constants

UPPER_SNAKE_CASE

CLI: use

argparse

, explicit defaults, deterministic sampling (

temperature=0.0

top_p=1.0

) unless testing variability.

Bash:

#!/usr/bin/env bash

with

set -euo pipefail

; scripts named

run_*.sh

Output: write CSVs to

results/

and figures to

plots/

. Do not hardcode device IDs; use

CUDA_VISIBLE_DEVICES

Commit & Pull Request Guidelines

Use Conventional Commits:

feat:

fix:

docs:

refactor:

perf:

chore:

PRs must include: purpose, key commands to reproduce, hardware/context (GPU count, vLLM/torch versions), and a snippet of resulting CSV.

Update

README.md

when flags, directory layout, or targets change. Exclude large artifacts; commit small samples only.

Repository Guidelines

Repository Guidelines

Project Structure & Module Organization

Build, Test, and Development Commands

Coding Style & Naming Conventions

Testing Guidelines

Commit & Pull Request Guidelines

Security & Configuration Tips

Related Skills

Markdown Converter

Nano Banana Pro

1password

Repository Guidelines

Project Structure & Module Organization

Build, Test, and Development Commands

Coding Style & Naming Conventions

Testing Guidelines

Commit & Pull Request Guidelines

Security & Configuration Tips