* update to fully overlap, still debugging
* improve interface
* fixed deadlock bug
* debug NaN loss
* (experimental) use one comm group for send_fw_recv_fw to fix NaN
* cleaned up interfaces; use one batch p2p for all
* clean up; removed the double p2p batch case
* p2p test passsed
* improve overlap: send fwd before backward
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tentatively use 2 p2p batches
* remove two p2p batches
* fix typos
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* remove pp.sh
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: root <root@notebook-c55824c0-7742-45e8-9591-c855bb77ad29-0.notebook-c55824c0-7742-45e8-9591-c855bb77ad29.colossal-ai.svc.cluster.local>