ColossalAI/REFERENCE.md

3.3 KiB
Raw Blame History

References

The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few reserach works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format.

By Our Team

  • Q. Xu, S. Li, C. Gong, and Y. You, An Efficient 2D Method for Training Super-Large Deep Learning Models. arXiv, 2021.

  • Z. Bian, Q. Xu, B. Wang, and Y. You, Maximizing Parallelism in Distributed Training for Huge Neural Networks. arXiv, 2021.

  • S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, Sequence Parallelism: Long Sequence Training from System Perspective. arXiv, 2021.

  • S. Li et al., Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. arXiv, 2021.

  • B. Wang, Q. Xu, Z. Bian, and Y. You, Tesseract: Parallelize the Tensor Parallelism Efficiently, in Proceedings of the 51th International Conference on Parallel Processing, 2022.

  • J. Fang et al., A Frequency-aware Software Cache for Large Recommendation System Embeddings. arXiv, 2022.

  • J. Fang et al., Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management, IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 304315, 2023.

  • Y. Liu, S. Li, J. Fang, Y. Shao, B. Yao, and Y. You, Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models. arXiv, 2023.

By Other Organizations

  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv, 2019.

  • S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ZeRO: Memory Optimizations toward Training Trillion Parameter Models, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.

  • J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 2020, pp. 35053506.

  • D. Narayanan et al., Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, 2021.

  • Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021.

  • S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, 2021.

  • L. Zheng et al., Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning, in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559578.