* test: add more p2p tests
* fix: remove send_forward_recv_forward as p2p op list need to use the same group
* fix: make send and receive atomic
* feat: update P2PComm fn
* feat: add metadata cache in 1f1b
* feat: add metadata cache in interleaved pp
* feat: modify is_xx_stage fn
* revert: add _broadcast_object_list
* feat: add interleaved pp in llama policy
* feat: set NCCL_BUFFSIZE in HybridParallelPlugin
* [npu] setup device utils (#5047)
* [npu] add npu device support
* [npu] support low level zero
* [test] update npu zero plugin test
* [hotfix] fix import
* [test] recover tests
* [npu] gemini support npu (#5052)
* [npu] refactor device utils
* [gemini] support npu
* [example] llama2+gemini support npu
* [kernel] add arm cpu adam kernel (#5065)
* [kernel] add arm cpu adam
* [optim] update adam optimizer
* [kernel] arm cpu adam remove bf16 support
* Use p2p
* Cannot bidirectonal send p2p
* Refactor tensor creation and serialization in P2P
communication
* Fix llama forward args in flash attention
* Add flop estimate from megatron
* Support loading weight not in weight_map when strict=False in hybrid_parallel
* Use send_forward_recv_backward, etc in 1f1b
* Use dataclass for metdata
Remove torch.cuda.synchronize() as suggested
* Add comment about the torch.cuda.synchronize for potential error
* Typo
* Update hybrid_parallel_checkpoint_io.py
* Update p2p.py
* Update one_f_one_b.py
* Update p2p.py
---------
Co-authored-by: flybird11111 <1829166702@qq.com>