Sign in

Robustness of Masked Language Models to Damage

In this article, we explore the robustness of different language models to neuron deletions and pruning to observe any multilingual advantages.

Authors: Kavya Duvedi, Tejal Kulkarni, Megan Nguyen, Meera Ruxmohan, Neeti Shah, Roger Terrazas

I. Introduction

The basis of our experiment is to confirm the controversial phenomenon in psychology and neuroscience that people who are multilingual tend to be more resilient to brain deficiencies than those who are monolingual. Studies have found that multilingual people tend to have a decreased risk of Alzheimer’s, dementia, and cognitive loss from brain damage. Confidently proving this phenomenon is difficult due to lack of participants who fit the qualifications and uncertainty of how to accurately measure the robustness of a human brain. We wanted to see if we could capture this phenomenon clearly in NLP deep learning models by applying various types of neuron manipulation.

Multilingual models are more robust than monolingual models.

We predict that the model trained on multiple languages will decrease in performance at a lower rate than that of the model trained in only English.

image source

II. Model Architecture and Benchmarks

BERT is a deep learning model that has been trained in various natural language processing tasks. BERT uses 12 layers of transformer blocks, 12 attention heads, and 110 million parameters to pre-train on monolingual Masked Language Modelling and Next Sentence Prediction tasks. Within a Masked Language Modelling task, BERT predicts an unknown word for a given sequence of words. Likewise, BERT predicts whether or not a series of sentences are consecutive for Next Sentence Prediction tasks. Users then have the ability to fine-tune BERT’s weights for any future tasks, as opposed to having to completely retrain BERT. MBERT and BERT share the same internal architecture and training process; however, MBERT is a multilingual model and has therefore been trained with over 104 languages.

XLM is a cross-lingual language model that allows surrounding tokens in one language to predict tokens in another language. This model is meant to upgrade BERT for multilingual purposes but relies on the same baseline architecture (with a few modifications). Bilingual masked language models that fall under XLM have 6 layers of transformer blocks and 8 attention heads. However, the 17-language XLM model used has 16 layers of transformer blocks and 16 attention heads. The two bilingual models we focus on are xlm-mlm-ende-1024 (trained on English and German wikipedia words) and xlm-mlm-enfr-1024 (trained on English and French wikipedia words). Although all of the XLM models were fine tuned for each of the three downstream tasks mentioned below, we focus more on how they performed with MRPC.

To test the model robustness to damage, accuracy and loss were measured using the following 3 GLUE benchmarks:

  1. Microsoft Research Paraphrase Corpus (MRPC): checks if sentence pairs extracted from news sources are semantically equivalent
  2. Recognizing Textual Entailment (RTE): Determines if there is an entailment between two sentences. The dataset is drawn from news and Wikipedia text
  3. Winograd Schema Challenge (WNLI): reads a sentence with a pronoun and must select the referent of that pronoun from a list of choices to prove reading comprehension

III. Pruning

In neural networks, pruning is defined as “removing weights, filters, neurons, or other structures” and is a common technique used to compress the number of parameters, identify the differences between over and under parametrizing, determine how much sparse weight tensors damage model, and reduce resource demands and latency (Carbin et al., 2020). Thus, the more a model is pruned, the more its weights and neurons are removed.

The brain, much like machine learning models, also goes through pruning. Synaptic pruning is the process of eliminating extra synapses in the brain from early childhood to adulthood. These synapses greatly parallel neural connections, transmitting data from one neuron to another. Constant stimulation allows certain synapses to become more permanent and, likewise, deletes weak synapses that have not had much interaction. Though a significant amount of pruning happens in the earlier stages of development, some people experience over-pruning, which scientists have found to be linked to schizophrenia and other mental disorders that increase difficulty in learning. However, those that are multilingual usually have significantly more strong, active synapses, and, therefore, are less likely to develop neurodegenerative diseases.

Iterative Pruning

Iterative pruning removes the N lowest L1-norm units for an individual layer, where N is a specified amount. Typically, the model is then retrained in order to recover from the fewer connections. Repeating this pruning and retraining cycle for each layer should minimize the drop in accuracy and still increase in sparsity and speed. For this project, we pruned the model weights through the torch function random_unstructured and bias through l1_unstructured for either all or specific layers. We did not fine-tune the model again, however, since the brain does not necessarily re-train itself after synaptic pruning.

We first pruned iteratively across all available layers. For the BERT models, this entailed 12 layers, while the XLM bilingual models had only 6. We then pruned select layers that are known to have great importance and/or influence on the models. Specifically, we experimented with the first, middle (6–9 on BERT), and last respective layers.


For pruning all layers, the evaluation accuracy and loss results on the MRPC benchmark are shown below. The BERT model starts at and maintains a higher accuracy than MBERT until about 60% of the weights are pruned. However, since MRPC is estimated to randomly guess after 56.48% damage, surpassing MBERT at the end can be due to randomness. However, it is notable that the rate of decline is considerably slower in MBERT than BERT. Additionally, MBERT consistently suffers a much smaller evaluation loss than BERT. On the other hand, the XLM models have inconclusive results. Both the loss and accuracies are suspiciously stagnant despite the increase in damage probability, which leads us to believe that there may be some unknown bug with these models.

Figure A. Pruning All Layers

When pruning just the first layer, as shown in figures, BERT performed consistently better than MBERT in terms of both accuracy and loss. Since both models remained relatively stagnant throughout, the weights of the first layer must not be very significant to the end models.

Figure B. Pruning Layer 0

Next, we decided to prune intermediate layers in order to determine whether or not they had a more significant impact on the overall evaluation loss for MBERT and BERT. The existing study, “What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models” explores the relationship between BERT’s layers and the impact of each layer on BERT’s output. The study concludes that layer impact on the output in largest between layers five through nine for a monolingual BERT model, as well as that multilingual BERT models peak earlier than monolingual BERT. Therefore, we decided to prune layers that we expected were significant to BERT. Through pruning layers six through nine, we expected that MBERT would be more robust than BERT, as we predicted that these layers are insignificant to MBERT.

Figure C. Pruning Layers 6–9

From an accuracy standpoint, BERT outperforms better than MBERT on the MRPC benchmark. Due to MBERT’s relatively constant accuracy, we conclude that pruning these intermediate layers impacts BERT more than MBERT. MBERT also has far less loss than BERT, further suggesting the significance of these layers to BERT. We suspect that MBERT may be more robust to the pruning of these layers, due to its more constant slope for both the metrics. Although we are unable to conclude anything on the overall robustness of BERT and MBERT, the results of this benchmark suggest that certain layers are more significant to BERT than MBERT, thus resulting in MBERT being more robust in this case.

Lastly, pruning just the last layer on the MRPC benchmark also resulted in relatively constant accuracy and loss values. BERT appears to perform better in both of these metrics as well.

Figure D. Pruning Layer 11


Pruning all layers for this RTE benchmark resulted in inconclusive results. Except for BERT, all accuracies and losses remained relatively steady. Once again, the constant lines for several of the bilingual/multilingual models are suspicious; it may be due to some underlying bug and should be looked into further. However, comparing just the BERT and the XLM17 models since they seemingly have more valid data, the bilingual model does appear more robust. Its accuracy and loss does not degrade as quickly before hitting the point of randomness.

Figure E. Pruning All Layers

We reached a similar conclusion when pruning layers six through nine on the RTE benchmark. As before, MBERT’s accuracy remained relatively steady while BERT’s accuracy changed significantly. MBERT’s nearly constant accuracy may be a cause for concern, especially compared to BERT’s drastic change in accuracy value. The data is therefore inconclusive, and we would need to ensure the validity of the MBERT accuracy data prior to making any firm conclusions.

Figure F. Pruning Layers 6–9


Lastly, we ran the WNLI benchmark for all 6 models. Notably, the bilingual models performed the best in terms of both accuracy and loss throughout the increase in damage probability. BERT and MBERT had similar accuracies initially. However, before the 60% damage probability mark, BERT began to deteriorate more rapidly than MBERT. After 60%, BERT had a considerable gain in accuracy, but this increase may be due to randomness. Additionally, the losses for MBERT were lower and more constant than that of BERT (before reaching 60% damage), suggesting MBERT ultimately may be more robust on this benchmark.

Figure G. Pruning All Layers

Global Pruning

Global pruning prunes the entire model holistically rather than optimizing at each layer. The result is a significantly accelerated model that, for each N percent of damage, the entirety of the model will have N percent less neural connections. Due to this, sparsity can be unevenly distributed across weight matrices, thus giving different results from other methods of pruning. Likewise the sparsity of each pruned parameter, which in these trials were query, key, and value parameters, will also not be equal to N percent, but the global sparsity will match the value.


Running global pruning for benchmark MRPC resulted in BERT performing overall better than MBERT until a percentage of around 0.75, however, the accuracies for both plummet at this point leaving both at around the same accuracy level. It is also notable that the accuracy graph does not have variation in values although it was run with 10 trials at each testing point. This could be due to how global pruning potentially eliminates the same weak neurons each time. The loss graphs for both are quite similar until around 0.65 to which BERT’s loss is significantly higher than MBERT, showing that around this threshold it seems that MBERT is less drastically affected than BERT.

Figure H. MRPC Prune All Layers


For the WNLI benchmark, the pruned models were run with 10 trials for each probability step to ensure results and account for any noise that may have occurred. Figure I exhibits all the layers pruned while Figure J shows only the last layer pruned. It is notable that the results are quite similar between the two concluding that when the whole model was pruned, most of the pruning probably happened in the last layer. MBERT’s accuracy is stagnant in both models regardless of the damage level which could be a potential bug. From the evaluated loss though, it seems BERT has overall greater loss compared to MBERT.

Figure I. WNLI Prune All Layers
Figure J. WNLI Prune Last Layer


The RTE benchmark was also run with 10 trials for each damage level but also resulted in a very monotone BERT. MBERT can be seen oscillating quite a bit, which could be associated with unwanted noise. The results of this benchmark make it difficult to come to a sound conclusion, though it did follow the same trend that was seen in previous benchmarks of BERT having overall greater accuracy at the beginning then sharply declining.

Figure K: RTE Prune All Layers

IV. Neuron Deletion

As a separate approach from pruning, our team also tried random neuron deletion from both monolingual and multilingual natural language processing models. The trial ran such that, given a probability, that percentage of neurons would be randomly chosen from the model and have their weights set to zero. We applied this manipulation to try and simulate the same effects that take place during Alzheimer’s disease, which causes neurons to lose their connections all across the brain.


For the first trial our team wanted to see how Multilingual BERT compared to normal bert on the MRPC task. To test this, we evaluated each of the models with probability of neuron deletion in range 0.0 to 1.0 with step size 0.01. We then repeated this 15 times on each model to produce 1500 evaluations per model. The results are as seen below with the blue line plot representing BERT and orange representing Multilingual BERT:

Figure L. Accuracy by increasing neuron deletion

As you can see from the two graphs there is moderate indication that Multilingual BERT decreases in performance at a lower rate than BERT. Multilingual BERT does however start to outperform BERT around the 0.2–0.6 probability of deletion range with a substantially higher margin. It is unclear what results can be concluded from this trial because the diagram does not meet the requirements of robustness we defined in our experiment.

Led by Facebook’s AI Research group, XLM attempts to resolve BERT’s challenges in sharing common subwords across multiple languages. This is achieved through byte-pair encoding and modifications of BERT architecture. In order to better understand how XLM reacts to different percentages of neuron deletion, we performed the same set of experiments (with a consistent number of samples and step size) across various XLM and BERT models. First we fine tune each of the models to a specified downstream task (MRPC, RTE, or WNLI). Then, we delete a percentage of neurons from each and measure their accuracies by evaluating them on the benchmark’s evaluation dataset. Since pre-trained bilingual BERT models were not available for use on HuggingFace, we used bilingual and 17-language XLM models to see how they would compare against monolingual BERT (uncased and trained only on English text) and MBERT (trained on cased text for 104 languages and uncased text for 102 languages). We were curious to see whether the addition of languages for different masked language models would increase robustness to neural damage.


We first evaluated 3 XLM and 3 BERT models after fine tuning them with the MRPC downstream task. While there were extreme fluctuations in accuracy past the first quarter of damage probabilities, BERT experienced the greatest drop in accuracy for fewer neuron deletions. XLM models required more neuron deletion damage to be applied before fluctuations began. Since each of the models resulted in a V-shape oscillating pattern rather than a consistent drop across the board, we are led to an indeterminable conclusion. Once all of the models reach their lowest point, there is a complete shift in accuracy as each of the models begin to dramatically increase. The ongoing pattern past 30% of neuron deletions makes it difficult to discern whether the robustness is better for XLM than BERT models; however, the initial portion of the below plot seems to indicate that there is potential for greater robustness with XLM models.

Figure M. Neuron Deletion After MRPC Fine Tuning

In order to better understand the performance of XLM models with the MRPC benchmark, we measured the accuracy of bilingual (English and French) and 17 language XLM models against damage probabilities used for neuron deletion. The two models were observed when evaluating both models using the MRPC benchmark.

Figure N. Bilingual XLM (left) and 17-language XLM (right)

More deletions had to be applied to bilingual XLM than 17-language XLM in order to break the nearly constant lines for accuracy. A significant amount of noise appears afterwards, which is either an indication of reaching a damage threshold for random accuracy guesses or a flaw in fine tuning and testing. Based on the previous experiment, it is apparent that for the first quarter of the plot, there are oscillations for both XLM models. However, this is incomparable to the extreme fluctuations that appear after deleting 30% of neurons for the 17-language XLM model and 50% of neurons for the bilingual XLM model.


With the RTE benchmark, it becomes more clear that BERT (uncased) is not as robust as the remaining models due to the sharp decline after roughly 10% of neurons have been deleted. For the RTE benchmark, we were able to run 2 XLM models (removing the other bilingual model that was measured with the MRPC benchmark). However, it remains unclear whether XLM performs better than MBERT (cased or uncased), given the diagrams shown below. There appears to be small oscillations in accuracy; however, BERT remains consistently lower in accuracy than the rest of the models after 20% of its neurons have been deleted. Due to the amount of noise observed in the other models, these results are also inconclusive and will be examined upon further once more samples have been taken per step.

Figure O. Neuron Deletion After RTE Fine Tuning

Based on the results from Figure O, more noise is observed when 10 samples are taken with a step size of 0.01 as opposed to when 5 samples are taken with a step size of 0.05. As we zoom closer into the changes in accuracy (as seen in the left figure), there appears to be a pattern of increases and decreases in accuracy for each of the models. The unexpected increases in BERT when roughly 6% or 1.1% of the neurons have been deleted may be eliminated as more samples are taken or when a specific set of neurons are deleted.


Out of the 3 benchmarks used, WNLI results in the greatest amount of noise (even when zoomed out). Only a single XLM model was compared against the remaining BERT models, but it outperformed the BERT models after each of them were fine tuned with this downstream task and experienced a range of neuron deletions. BERT (uncase) appears to have unexpected increases beginning when 10%, 20%, 30%, and 50% of the neurons have been deleted. Overall, it is difficult to determine whether MBERT actually performs better than BERT here; however, XLM remains consistent in topping its accuracy over the remaining models only with the WNLI benchmark.

Figure P. Neuron Deletion After WNLI Fine Tuning

Since these plots indicate there may be potential disruptions in evaluating these models after fine tuning them with various downstream tasks (MRPC, RTE, and WNLI), our attention will be placed more into deciding which layers to delete neurons from, determining the correct number of samples to take at each step to reduce the amount of noise, and determining whether our results with these models perform better than random guessing.

V. Adding Distractions

In several studies evaluating the advantages of bilingualism, bilingual speakers were found to perform better than monolingual speakers on tasks that required inhibiting irrelevant or distracting information. Bilingual speakers engage executive functions more often than monolingual speakers since they must be able to switch between languages quickly and frequently. This constant practice of inhibiting the use of one language when using another allows bilingual speakers to suppress distractions easier during tasks that require use of executive function.

We tested uncased BERT and MBERT’s robustness against distraction to see if this bilingual advantage could be observed in natural language processing. We recorded the test accuracy of BERT and MBERT in response to increasing the number of random English words added to the MRPC test dataset. The random words were selected from a dataset of 194,434 English words.

Our method of adding random words was adding a proportional percentage of random words to each sentence (e.g. 10% random percentage means adding (.1*len(sentence)) number of random words to the sentence)

We ran 25 trials for each percentage of random words from 0 to 30 with step size 2 for each model. At each percentage of random words, we created a new randomized dataset for each trial and tested each model on that dataset.

Figure Q. Test accuracy after adding distractions

For the MRPC task, there was no significant multilingual advantage of suppressing distracting information when random words were added to the test data. MBERT’s slight gradient increase around the 5–10 random percentage range as opposed to BERT’s strictly linear decrease could possibly indicate that MBERT could suppress distractions better around this range. However, the difference is small, so we could not confidently prove a multilingual advantage.

Future work for comparing BERT vs MBERT’s robustness against distraction would be performing similar experiments on different downstream tasks.

VI. Conclusion

Based on our accuracy and cross entropy loss results, multilingual deep learning models (MBERT and XLM) appear to be generally more robust than monolingual models (BERT) to layer pruning and neuron deletion damage.

In the future, we hope to further prove this notion and rely on cognitive science papers to see what role executive function may play between monolingual and multilingual models.

  1. Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020. URL
  2. “Glue: TensorFlow Datasets,” TensorFlow. [Online]. Available:
  3. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
  4. R. Horev, XLM — Enhancing BERT for Cross-lingual Language Model,” Towards Data Science, 11-Feb-2019.
  5. Wietse de Vries, Andreas van Cranenburgh, and Malvina Nissim. 2020. What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models. arXiv:2004.06499 [cs].
  6. Y. Seth, “BERT Explained — A list of Frequently Asked Questions,” Let the Machines Learn, 12-Jun-2019.


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store