ColossalAI/colossalai/checkpoint_io
Elsa Granger b2ad0d9e8f
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017)
* Use p2p

* Cannot bidirectonal send p2p

* Refactor tensor creation and serialization in P2P
communication

* Fix llama forward args in flash attention

* Add flop estimate from megatron

* Support loading weight not in weight_map when strict=False in hybrid_parallel

* Use send_forward_recv_backward, etc in 1f1b

* Use dataclass for metdata
Remove torch.cuda.synchronize() as suggested

* Add comment about the torch.cuda.synchronize for potential error

* Typo

* Update hybrid_parallel_checkpoint_io.py

* Update p2p.py

* Update one_f_one_b.py

* Update p2p.py

---------

Co-authored-by: flybird11111 <1829166702@qq.com>
2023-11-16 20:15:59 +08:00
..
__init__.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
checkpoint_io_base.py [shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758) 2023-09-20 18:29:37 +08:00
general_checkpoint_io.py [shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758) 2023-09-20 18:29:37 +08:00
hybrid_parallel_checkpoint_io.py [pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017) 2023-11-16 20:15:59 +08:00
index_file.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
utils.py [shardformer] Fix serialization error with Tensor Parallel state saving (#5018) 2023-11-09 17:00:25 +08:00