ColossalAI

Commit Graph

Author	SHA1	Message	Date
digger yu	9265f2d4d7	[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc.	2 years ago
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2 years ago
Hongxin Liu	4341f5e8e6	[lazyinit] fix clone and deepcopy (#3553 )	2 years ago
Hongxin Liu	152239bbfa	[gemini] gemini supports lazy init (#3379 ) * [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method	2 years ago
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2 years ago
ver217	26b7aac0be	[zero] reorganize zero/gemini folder structure (#3424 ) * [zero] refactor low-level zero folder structure * [zero] fix legacy zero import path * [zero] fix legacy zero import path * [zero] remove useless import * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] fix test import path * [zero] fix test * [zero] fix circular import * [zero] update import	2 years ago
ver217	f8289d4221	[lazyinit] combine lazy tensor with dtensor (#3204 ) * [lazyinit] lazy tensor add distribute * [lazyinit] refactor distribute * [lazyinit] add test dist lazy init * [lazyinit] add verbose info for dist lazy init * [lazyinit] fix rnn flatten weight op * [lazyinit] polish test * [lazyinit] polish test * [lazyinit] fix lazy tensor data setter * [lazyinit] polish test * [lazyinit] fix clean * [lazyinit] make materialize inplace * [lazyinit] refactor materialize * [lazyinit] refactor test distribute * [lazyinit] fix requires_grad * [lazyinit] fix tolist after materialization * [lazyinit] refactor distribute module * [lazyinit] polish docstr * [lazyinit] polish lazy init context * [lazyinit] temporarily skip test * [lazyinit] polish test * [lazyinit] add docstr	2 years ago
ver217	6ae8ed0407	[lazyinit] add correctness verification (#3147 ) * [lazyinit] fix shared module * [tests] add lazy init test utils * [tests] add torchvision for lazy init * [lazyinit] fix pre op fn * [lazyinit] handle legacy constructor * [tests] refactor lazy init test models * [tests] refactor lazy init test utils * [lazyinit] fix ops don't support meta * [tests] lazy init test timm models * [lazyinit] fix set data * [lazyinit] handle apex layers * [tests] lazy init test transformers models * [tests] lazy init test torchaudio models * [lazyinit] fix import path * [tests] lazy init test torchrec models * [tests] update torch version in CI * [tests] revert torch version in CI * [tests] skip lazy init test	2 years ago
ver217	ed8f60b93b	[lazyinit] refactor lazy tensor and lazy init ctx (#3131 ) * [lazyinit] refactor lazy tensor and lazy init ctx * [lazyinit] polish docstr * [lazyinit] polish docstr	2 years ago
ver217	823f3b9cf4	[doc] add deepspeed citation and copyright (#2996 ) * [doc] add deepspeed citation and copyright * [doc] add deepspeed citation and copyright * [doc] add deepspeed citation and copyright	2 years ago
YH	a848091141	Fix port exception type (#2925 )	2 years ago
Nikita Shulga	01066152f1	Don't use `torch._six` (#2775 ) * Don't use `torch._six` This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709 * Update common.py	2 years ago
ver217	f0aa191f51	[gemini] fix colo_init_context (#2683 )	2 years ago
HELSON	552183bb74	[polish] polish ColoTensor and its submodules (#2537 )	2 years ago
Super Daniel	35c0c0006e	[utils] lazy init. (#2148 ) * [utils] lazy init. * [utils] remove description. * [utils] complete. * [utils] finalize. * [utils] fix names.	2 years ago
HELSON	7829aa094e	[ddp] add is_ddp_ignored (#2434 ) [ddp] rename to is_ddp_ignored	2 years ago
Frank Lee	40d376c566	[setup] support pre-build and jit-build of cuda kernels (#2374 ) * [setup] support pre-build and jit-build of cuda kernels * polish code * polish code * polish code * polish code * polish code * polish code	2 years ago
Jiarui Fang	355ffb386e	[builder] unified cpu_optim fused_optim inferface (#2190 )	2 years ago
Jiarui Fang	9587b080ba	[builder] use runtime builder for fused_optim (#2189 )	2 years ago
BlueRum	b3f73ce1c8	[Gemini] Update coloinit_ctx to support meta_tensor (#2147 )	2 years ago
Jiarui Fang	8e14344ec9	[hotfix] fix a type in ColoInitContext (#2106 )	2 years ago
Jiarui Fang	05545bfee9	[ColoTensor] throw error when ColoInitContext meets meta parameter. (#2105 )	2 years ago
HELSON	f6178728a0	[gemini] fix init bugs for modules (#2047 ) * [gemini] fix init bugs for modules * fix bugs	2 years ago
Jiarui Fang	31c644027b	[hotfix] hotfix Gemini for no leaf modules bug (#2043 )	2 years ago
ver217	f8a7148dec	[kernel] move all symlinks of kernel to `colossalai._C` (#1971 )	2 years ago
Jiarui Fang	7e24b9b9ee	[Gemini] clean no used MemTraceOp (#1970 )	2 years ago
Jiarui Fang	52c6ad26e0	[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953 )	2 years ago
Jiarui Fang	9f4fb3f28a	[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937 )	2 years ago
Frank Lee	e6ec99d389	[utils] fixed lazy init context (#1867 )	2 years ago
Jiarui Fang	3ce4463fe6	[utils] remove lazy_memory_allocate from ColoInitContext (#1844 )	2 years ago
ver217	99870726b1	[CheckpointIO] a uniform checkpoint I/O module (#1689 )	2 years ago
HELSON	1468e4bcfc	[zero] add constant placement policy (#1705 ) * fixes memory leak when paramter is in fp16 in ZeroDDP init. * bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release. * adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.	2 years ago
Kirigaya Kazuto	3b2a59b0ba	[pipeline/rank_recorder] fix bug when process data before backward \| add a tool for multiple ranks debug (#1681 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing * [pipeline/pytree] add pytree to process args and kwargs \| provide to process args and kwargs after forward	2 years ago
CsRic	2ac46f7be4	[NFC] polish utils/tensor_detector/__init__.py code style (#1573 ) Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>	2 years ago
LuGY	c7d4932956	[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style (#1566 )	2 years ago
Kirigaya Kazuto	318fbf1145	[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style (#1559 )	2 years ago
ver217	ae71036cd2	[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548 ) * refactor parallel layer * broadcast rank0 model after load ckpt	2 years ago
ver217	2bed096848	[utils] optimize partition_tensor_parallel_state_dict (#1546 )	2 years ago
ver217	a203b709d5	[hotfix] fix init context (#1543 ) * fix init context * fix lazy init ctx	2 years ago
Boyuan Yao	47fd8e4a02	[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False	2 years ago
Frank Lee	5a52e21fe3	[test] fixed the activation codegen test (#1447 ) * [test] fixed the activation codegen test * polish code	2 years ago
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2 years ago
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2 years ago
HELSON	b6fd165f66	[checkpoint] add kwargs for load_state_dict (#1374 )	2 years ago
Frank Lee	0c1a16ea5b	[util] standard checkpoint function naming (#1377 )	2 years ago
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2 years ago
HELSON	8463290642	[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368 )	2 years ago
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2 years ago
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2 years ago
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2 years ago

1 2 3 4

186 Commits (c173a69b3e1839546ad5db6840bfdeff0a09f0f9)