Coding
PromptBeginner5 minmarkdown
Markdown Converter
Agent skill for markdown-converter
7
Expected layout when populating this repo:
Sign in to like and favorite skills
Expected layout when populating this repo:
llama3_inference/ ├─ README.md # Profiling plan and examples ├─ bench_vllm_llama3.py # Offline grid runner (CSV output) ├─ client_stream_profile.py # Serve client (TTFT/e2e) ├─ scripts/ # Helper shell scripts │ ├─ run_prefill.sh │ ├─ run_decode.sh │ ├─ run_mixed.sh │ └─ dmon_wrap.sh ├─ Makefile # Common targets (prefill/decode/etc.) ├─ results/ # Generated CSV/metrics (gitignored) └─ plots/ # Generated figures (gitignored)
uv venvuv pip install "vllm>=0.5.0" "torch>=2.3" transformers pynvml psutil pandasnvidia-smi then python -c 'import torch;print(torch.cuda.is_available())'.python bench_vllm_llama3.py --model /path/to/Meta-Llama-3-8B-Instruct --tp 1 --devices 7 --B 1,2,4 --Lp 512,2048 --Lo 16 --repeats 1 --out results/prefill_tp1.csvvllm serve /path/to/model --tensor-parallel-size 4 --port 2234 --metrics-port 2235lower_snake_case; constants UPPER_SNAKE_CASE.argparse, explicit defaults, deterministic sampling (temperature=0.0, top_p=1.0) unless testing variability.#!/usr/bin/env bash with set -euo pipefail; scripts named run_*.sh.results/ and figures to plots/. Do not hardcode device IDs; use CUDA_VISIBLE_DEVICES.B=1 Lp=128 Lo=16 repeats=1 producing results/quickcheck.csv.--dry-run or smallest grid.feat:, fix:, docs:, refactor:, perf:, chore:.README.md when flags, directory layout, or targets change. Exclude large artifacts; commit small samples only.--model and environment variables.TOKENIZERS_PARALLELISM=false, consider VLLM_NVTX_PROFILE=1 for profiling. Avoid unstable Nsight flags not supported in your environment.