* [pipeline inference] pipeline inference (#4492)
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* fix CI
* add cache clear
* fix code error
* fix typo
* [Pipeline inference] Modify to tieweight (#4599)
* add pp stage manager as circle stage
* fix a bug when create process group
* add ppinfer basic framework
* add micro batch manager and support kvcache-pp gpt2 fwd
* add generate schedule
* use mb size to control mb number
* support generate with kv cache
* add output, remove unused code
* add test
* reuse shardformer to build model
* refactor some code and use the same attribute name of hf
* fix review and add test for generation
* remove unused file
* modify the way of saving newtokens
* modify to tieweight
* modify test
* remove unused file
* solve review
* add docstring
* [Pipeline inference] support llama pipeline inference (#4647)
* support llama pipeline inference
* remove tie weight operation
* [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708)
* add benchmark verbose
* fix export tokens
* fix benchmark verbose
* add P2POp style to do p2p communication
* modify schedule as p2p type when ppsize is 2
* remove unused code and add docstring
* [Pipeline inference] Refactor code, add docsting, fix bug (#4790)
* add benchmark script
* update argparse
* fix fp16 load
* refactor code style
* add docstring
* polish code
* fix test bug
* [Pipeline inference] Add pipeline inference docs (#4817)
* add readme doc
* add a ico
* Add performance
* update table of contents
* refactor code (#4873)
* fix zero ckptio with offload
* fix load device
* saved tensors in ckpt should be on CPU
* fix unit test
* fix unit test
* add clear cache
* save memory for CI
* add APIs
* implement save_sharded_model
* add test for hybrid checkpointio
* implement naive loading for sharded model
* implement efficient sharded model loading
* open a new file for hybrid checkpoint_io
* small fix
* fix circular importing
* fix docstring
* arrange arguments and apis
* small fix
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
* [plugin] torch ddp plugin add save sharded model
* [test] fix torch ddp ckpt io test
* [test] fix torch ddp ckpt io test
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] remove debug info
* [test] fix flop tensor test
* [test] fix autochunk test
* [test] fix lazyinit test
* [devops] update torch version of CI
* [devops] enable testmon
* [devops] fix ci
* [devops] fix ci
* [test] fix checkpoint io test
* [test] fix cluster test
* [test] fix timm test
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] force sync to test ci
* [test] skip fsdp test