ColossalAI/colossalai/shardformer/model/modeling_bert.py

from typing import Any, Dict, List, Type

import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from transformers import BertForMaskedLM
from transformers.models.bert.modeling_bert import MaskedLMOutput

from ..layer.dist_crossentropy import applyDistCrossEntropy


class BertForMaskedLM_(BertForMaskedLM):

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        **kwargs,
    ):
        # print("[Inject OK] Injected forward method")
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        prediction_scores = self.cls(sequence_output)

        masked_lm_loss = None

        if labels is not None:
            masked_lm_loss = applyDistCrossEntropy(prediction_scores, labels)
        # if labels is not None:
        #     loss_fct = CrossEntropyLoss()    # -100 index = padding token
        #     masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

        if not return_dict:
            output = (prediction_scores,) + outputs[2:]
            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

        return MaskedLMOutput(
            loss=masked_lm_loss,
            logits=prediction_scores,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00			`from typing import Any, Dict, List, Type`

[shardformer] init shardformer code structure (#3731) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example 2023-05-22 07:02:17 +00:00			`import torch`
			`import torch.nn as nn`
			`from torch.nn import CrossEntropyLoss`
			`from transformers import BertForMaskedLM`
			`from transformers.models.bert.modeling_bert import MaskedLMOutput`
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00
[shardformer] add Dropout layer support different dropout pattern (#3856) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss 2023-06-01 08:21:02 +00:00			`from ..layer.dist_crossentropy import applyDistCrossEntropy`

[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00
[shardformer] init shardformer code structure (#3731) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example 2023-05-22 07:02:17 +00:00			`class BertForMaskedLM_(BertForMaskedLM):`
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00
[shardformer] init shardformer code structure (#3731) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example 2023-05-22 07:02:17 +00:00			`def forward(`
			`self,`
			`input_ids=None,`
			`attention_mask=None,`
			`token_type_ids=None,`
			`position_ids=None,`
			`head_mask=None,`
			`inputs_embeds=None,`
			`encoder_hidden_states=None,`
			`encoder_attention_mask=None,`
			`labels=None,`
			`output_attentions=None,`
			`output_hidden_states=None,`
			`return_dict=None,`
			`**kwargs,`
			`):`
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00			`# print("[Inject OK] Injected forward method")`
[shardformer] init shardformer code structure (#3731) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example 2023-05-22 07:02:17 +00:00			`return_dict = return_dict if return_dict is not None else self.config.use_return_dict`

			`outputs = self.bert(`
			`input_ids,`
			`attention_mask=attention_mask,`
			`token_type_ids=token_type_ids,`
			`position_ids=position_ids,`
			`head_mask=head_mask,`
			`inputs_embeds=inputs_embeds,`
			`encoder_hidden_states=encoder_hidden_states,`
			`encoder_attention_mask=encoder_attention_mask,`
			`output_attentions=output_attentions,`
			`output_hidden_states=output_hidden_states,`
			`return_dict=return_dict,`
			`)`

			`sequence_output = outputs[0]`
			`prediction_scores = self.cls(sequence_output)`

			`masked_lm_loss = None`

			`if labels is not None:`
[shardformer] add Dropout layer support different dropout pattern (#3856) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss 2023-06-01 08:21:02 +00:00			`masked_lm_loss = applyDistCrossEntropy(prediction_scores, labels)`
			`# if labels is not None:`
			`# loss_fct = CrossEntropyLoss() # -100 index = padding token`
			`# masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))`
[shardformer] init shardformer code structure (#3731) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example 2023-05-22 07:02:17 +00:00
			`if not return_dict:`
			`output = (prediction_scores,) + outputs[2:]`
			`return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output`

			`return MaskedLMOutput(`
			`loss=masked_lm_loss,`
			`logits=prediction_scores,`
			`hidden_states=outputs.hidden_states,`
			`attentions=outputs.attentions,`
[shardformer]: Feature/shardformer, add some docstring and readme (#3816) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit 2023-05-24 02:26:46 +00:00			`)`