Experiment Report: BERT Pre-trained Model

This experiment report explores the process of using the open-source BERT (Bidirectional Encoder Representations from Transformers) model for Chinese text classification. By fine-tuning this model on a specific news dataset, we evaluate its classification performance and accuracy. The report systematically analyzes the basic principles, architectural design, pre-training tasks, and fine-tuning methods of the BERT pre-trained model while providing experimental results to assess the model's performance.

Model Architecture

BERT is a language model based on Transformers proposed by the Google AI team in 2018, which significantly improved the performance of various natural language processing tasks through pre-training and fine-tuning (Devlin et al., 2019). The architecture of the BERT model consists of multiple layers of self-attention and feedforward neural network layers. In this experiment, we adopted the BERT-base model architecture, which includes 12 layers, 768 hidden units, 12 attention heads, and a total of 110 million parameters. It is noteworthy that the BERT-base model is similar to the architecture of OpenAI's GPT model but uses bidirectional language model objectives during training.

Pre-training Tasks

To better perform Chinese natural language processing tasks, we selected the following BERT pre-trained models for this experiment, with specific information as follows:

Model Dataset Mask Pre-training Tasks
BERT (Devlin et al., 2019) wiki Word Piece MLM+NSP
BERT-wwm (Cui et al., 2021) wiki WWM MLM+NSP
BERT-wwm-ext (Cui et al., 2021) wiki+ext WWM MLM+NSP
RoBERTa-wwm-ext (Cui et al., 2021) wiki+ext WWM DMLM
  • Dataset: wiki represents Chinese Wikipedia, ext represents other encyclopedias, news, QA datasets, etc.
  • Mask: Word Piece indicates the use of subword tokenization, ignoring traditional Chinese word segmentation (CWS); WWM (Whole Word Masking) uses the LTP tool from HIT to mask all characters in the same word. (Cui et al., 2021)
  • Pre-training Tasks: MLM (Masked Language Model) randomly masks some tokens in the input and then predicts these masked tokens; NSP (Next Sentence Prediction) predicts whether two sentences are consecutive; DMLM (Dynamic Masking Language Model) randomly masks some tokens in the input and dynamically adjusts the masking ratio.

Fine-tuning Tasks

In the downstream task of text classification in this experiment, we adopted the method proposed by Devlin et al. (2019). We extracted the output vector \(\mathbf{C} \in \mathbb{R}^{H}\) of the [CLS] token from the last layer of the BERT model, then input it into a fully connected layer \(\mathbf{W} \in \mathbb{R}^{H \times K}\), where \(K\) is the number of labels, and calculated the corresponding standard classification loss \(\log(\text{softmax}(\mathbf{C}\mathbf{W}^{\top}))\).

Experimental Results

The dataset for this experiment consists of news headlines and their corresponding category labels, including a development set and a test set. The development set contains 47,952 news items, and the test set contains 15,986 news items. There are a total of 32 category labels, including finance, education, technology, sports, games, etc.

We further split the development set into a training set and a validation set in an 8:2 ratio, and employed an early stopping strategy, stopping training when the accuracy on the validation set did not decrease for three consecutive epochs. The remaining experimental parameters are as follows: optimizer is Adam, learning rate is 5e-5, batch size is 64. The experiment was conducted on a Macbook Pro 2021, using the PyTorch deep learning framework.

To ensure the reliability of the results, for the same model, we ran the experiment three times (with different random seeds) and reported the maximum and average performance of the model (the average value is in parentheses).

Model Validation Set Test Set
BERT 0.8558 (0.8533) 0.8588 (0.8529)
BERT-wwm 0.8599 (0.8569) 0.8568 (0.8553)
BERT-wwm-ext 0.8608 (0.8592) 0.8636 (0.8592)
RoBERTa-wwm-ext 0.8637 (0.8604) 0.8608 (0.8588)

The above experimental results indicate that the BERT-wwm-ext model achieved the best performance on the test set, with an accuracy of 86.36%. The RoBERTa-wwm-ext model performed best on the validation set, with an accuracy of 86.37%.

The experiment code has been open-sourced and is available at: https://github.com/SignorinoY/bert-classification.

References

Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for chinese BERT. IEEE/ACM Transactions on Audio, Speech and Language Processing, 29, 3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423