I’m running this
NGPU=4 CONFIG_FILE="./train_configs/llama3_3b.toml" ./run_llama_train.sh --float8.enable_float8_linear
and the training loop freeze at model inference statement, i.e.
pred = model(input_ids)
when using NPU>1.
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama3_3b.toml
+ overrides=
+ '[' 1 -ne 0 ']'
+ overrides=--float8.enable_float8_linear
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_3b.toml --float8.enable_float8_linear
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793]
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] *****************************************
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] *****************************************
[rank0]:2024-08-08 15:35:23,985 - root - INFO - Starting job: Llama 3 3B training
[rank0]:2024-08-08 15:35:25,681 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-08-08 15:35:25,683 - root - INFO - GPU capacity: NVIDIA H100 80GB HBM3 (0) with 79.11GiB memory
[rank0]:2024-08-08 15:35:25,683 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-08-08 15:35:25,685 - root - INFO - Building tiktoken tokenizer locally from /u/tmhoangt/tokenizers/llama3/original/tokenizer.model
[rank0]:2024-08-08 15:35:25,855 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2024-08-08 15:35:25,855 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2024-08-08 15:35:30,775 - root - INFO - Building llama3 3B with ModelArgs(dim=4096, n_layers=12, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_batch_size=32, max_seq_len=4096, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-08-08 15:35:30,902 - root - INFO - lm_eval is not installed, GPTQ may not be usable
[rank0]:2024-08-08 15:35:30,907 - root - INFO - Float8 training active
[rank0]:2024-08-08 15:35:30,917 - root - INFO - Swapped to Float8Linear layers with enable_fsdp_float8_all_gather=False
[rank0]:2024-08-08 15:35:30,917 - root - INFO - Model llama3 3B size: 3,668,021,248 total parameters
[rank0]:2024-08-08 15:35:30,918 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-08-08 15:35:30,946 - root - INFO - Applied FSDP to the model
[rank0]:2024-08-08 15:35:34,855 - root - INFO - GPU memory usage for model: 3.44GiB(4.35%)
[rank0]:2024-08-08 15:35:34,857 - root - INFO - Training starts at step 1, with local batch size 2, global batch size 8, sequence length 4096, total steps 1000 (warmup 200)
The packages I use is on CUDA 12.1
pip install https://download.pytorch.org/whl/nightly/cpu/torchdata-0.9.0.dev20240808%2Bcpu-cp311-cp311-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/cu121/torchaudio-2.4.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/cu121/torchvision-0.20.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Bdedb7bdf33-cp311-cp311-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/cu121/torch-2.5.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl