Commit Graph

180 Commits (ae71036cd2210b6e60805357f4bd059674e316bc)

Author SHA1 Message Date
ver217 ae71036cd2
[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548)
* refactor parallel layer

* broadcast rank0 model after load ckpt
2022-09-06 20:18:35 +08:00
Jiarui Fang 64169f3e8f
[embedding] polish parallel embedding tablewise (#1545) 2022-09-06 10:41:20 +08:00
CsRic 964123ae0f
[embedding] freq_aware_embedding: add small functions for caller application (#1537) 2022-09-05 15:12:53 +08:00
Jiarui Fang 521078ffc9
[embedding] fix a bug in table wise sharding (#1538) 2022-09-02 15:48:35 +08:00
Jiarui Fang 87134524fd
[embedding] tablewise sharding polish (#1535) 2022-09-02 11:09:37 +08:00
CsRic 5156d5b4f8
[embedding] add tablewise sharding for FAW (#1526) 2022-09-01 17:55:41 +08:00
Jiarui Fang 4537d39df9
[doc] docstring for FreqAwareEmbeddingBag (#1525) 2022-08-31 13:52:30 +08:00
Jiarui Fang 9a9ef65313
[FAW] cpu caching operations (#1520) 2022-08-30 14:50:02 +08:00
Jiarui Fang af5438caa2
[FAW] refactor reorder() for CachedParamMgr (#1514) 2022-08-29 14:22:07 +08:00
Jiarui Fang 9feee6d06b
[FAW] LFU initialize with dataset freq (#1513) 2022-08-29 12:52:53 +08:00
CsRic 1b8fee8e9c
[FAW] shrink freq_cnter size (#1509) 2022-08-29 11:44:55 +08:00
Jiarui Fang ba61109b6c
[FAW] remove code related to chunk (#1501) 2022-08-26 14:23:30 +08:00
Jiarui Fang d5085bb317
[FAW] add more docs and fix a warning (#1500) 2022-08-26 14:10:21 +08:00
CsRic 0ed2f46131
[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) 2022-08-26 11:24:12 +08:00
CsRic b8d0e39eaf
[FAW] LFU cache for the FAW 2022-08-25 13:08:46 +08:00
Jiarui Fang cde7b8a5b8
[FAW] init an LFU implementation for FAW (#1488) 2022-08-24 17:37:22 +08:00
Geng Zhang 0aad53c62b
[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) 2022-08-23 17:38:24 +08:00
Jiarui Fang a1476ea882
[NFC] polish doc style for ColoTensor (#1457) 2022-08-16 09:21:05 +08:00
ver217 367c615818
fix nvme docstring (#1450) 2022-08-12 18:01:02 +08:00
Geng Zhang 9f3eed66eb
[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) 2022-08-12 15:55:46 +08:00
Frank Lee ae1b58cd16
[tensor] added linear implementation for the new sharding spec (#1416)
* [tensor] added linear implementation for the new sharding spec

* polish code
2022-08-12 11:33:09 +08:00
Jiarui Fang 30b4dd17c0
[FAW] export FAW in _ops (#1438) 2022-08-11 13:43:24 +08:00
Jiarui Fang c9427a323f
hotfix #1434 (#1437) 2022-08-11 13:14:25 +08:00
Jiarui Fang 10b3df65c8
[FAW] move coloparam setting in test code. (#1429) 2022-08-10 14:31:53 +08:00
Jiarui Fang cb98cf5558
[FAW] parallel FreqAwareEmbedding (#1424) 2022-08-10 13:44:30 +08:00
Jiarui Fang d209aff684
Add FreqAwareEmbeddingBag (#1421) 2022-08-09 16:26:12 +08:00
Jiarui Fang 504419d261
[FAW] add cache manager for the cached embedding (#1419) 2022-08-09 15:17:17 +08:00
ver217 12b4887097
[hotfix] fix CPUAdam kernel nullptr (#1410) 2022-08-05 19:45:45 +08:00
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399) 2022-08-02 17:49:11 +08:00
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
HELSON c7221cb2d4
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) 2022-07-29 19:33:24 +08:00
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
* fix zero ddp buffer cast

* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217 5d5031e946
fix zero ddp state dict (#1378) 2022-07-28 09:31:42 +08:00
ver217 c415240db6
[nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
* impl nvme optimizer

* update cpu adam

* add unit test

* update hybrid adam

* update docstr

* add TODOs

* update CI

* fix CI

* fix CI

* fix CI path

* fix CI path

* fix CI path

* fix install tensornvme

* fix CI

* fix CI path

* fix CI env variables

* test CI

* test CI

* fix CI

* fix nvme optim __del__

* fix adam __del__

* fix nvme optim

* fix CI env variables

* fix nvme optim import

* test CI

* test CI

* fix CI
2022-07-26 17:25:24 +08:00
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
ver217 d068af81a3
[doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
HELSON 1b41686461
[hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316) 2022-07-15 09:52:55 +08:00
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312) 2022-07-14 16:57:48 +08:00
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310) 2022-07-14 16:37:33 +08:00
HELSON 260a55804a
[hotfix] fix shape error in backward when using ColoTensor (#1298) 2022-07-13 23:06:12 +08:00
runluo f83c4d6597
[NFC] polish colossalai/nn/layer/wrapper/pipeline_wrapper.py code style (#1303) 2022-07-13 19:01:07 +08:00
XYE e83b2ce853 [NFC] polish colossalai/nn/layer/vanilla/layers.py code style (#1295) 2022-07-13 12:08:21 +08:00
Liping233 1000a41fd5 [NFC] polish colossalai/nn/layer/vanilla/__init__.py code style (#1293) 2022-07-13 12:08:21 +08:00
Wangbo Zhao(黑色枷锁) 552667825b [NFC] polish colossalai/nn/layer/parallel_1d/layers.py code style (#1290) 2022-07-13 12:08:21 +08:00
Jiatong Han 38e3ccd1e9 [NFC] polish colossalai/nn/layer/parallel_sequence/layers.py code style (#1280)
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-07-13 12:08:21 +08:00
Boyuan Yao b414eaa5db [NFC] polish colossalai/nn/optimizer/lamb.py code style (#1275) 2022-07-13 12:08:21 +08:00
Super Daniel 52d145a342 [NFC] polish colossalai/nn/lr_scheduler/onecycle.py code style (#1269) 2022-07-13 12:08:21 +08:00