added branch context;
added vocab parallel layers;
moved split_batch from load_batch to tensor parallel embedding layers;
updated gpt model;
updated unit test cases;
fixed few collective communicator bugs
* add pipeline shared module wrapper and update load batch
* added model parallel process group for amp and clip grad (#86)
* added model parallel process group for amp and clip grad
* update amp and clip with model parallel process group
* remove pipeline_prev/next group (#88)
* micro batch offload
* optimize pipeline gpu memory usage
* pipeline can receive tensor shape (#93)
* optimize pipeline gpu memory usage
* fix grad accumulation step counter
* rename classes and functions
Co-authored-by: Frank Lee <somerlee.9@gmail.com>