Experiment Report: BERT Pre-trained Model
This experiment report explores the process of using the open-source BERT (Bidirectional Encoder Representations from Transformers) model for Chinese text classification. By fine-tuning this model on a specific news dataset, we evaluate its classification performance and accuracy. The report systematically analyzes the basic principles, architectural design, pre-training tasks, and fine-tuning methods of the BERT pre-trained model while providing experimental results to assess the model's performance.
Model Architecture
BERT is a language model based on Transformers proposed by the Google AI team in 2018, which significantly improved the performance of various natural language processing tasks through pre-training and fine-tuning (Devlin et al., 2019). The architecture of the BERT model consists of multiple layers of self-attention and feedforward neural network layers. In this experiment, we adopted the BERT-base model architecture, which includes 12 layers, 768 hidden units, 12 attention heads, and a total of 110 million parameters. It is noteworthy that the BERT-base model is similar to the architecture of OpenAI's GPT model but uses bidirectional language model objectives during training.
Pre-training Tasks
To better perform Chinese natural language processing tasks, we selected the following BERT pre-trained models for this experiment, with specific information as follows:
Model | Dataset | Mask | Pre-training Tasks |
---|---|---|---|
BERT (Devlin et al., 2019) |
wiki | Word Piece | MLM+NSP |
BERT-wwm (Cui et al., 2021) |
wiki | WWM | MLM+NSP |
BERT-wwm-ext (Cui et al., 2021) |
wiki+ext | WWM | MLM+NSP |
RoBERTa-wwm-ext (Cui et al., 2021) |
wiki+ext | WWM | DMLM |
- Dataset:
wiki
represents Chinese Wikipedia,ext
represents other encyclopedias, news, QA datasets, etc. - Mask:
Word Piece
indicates the use of subword tokenization, ignoring traditional Chinese word segmentation (CWS);WWM
(Whole Word Masking) uses the LTP tool from HIT to mask all characters in the same word. (Cui et al., 2021) - Pre-training Tasks:
MLM
(Masked Language Model) randomly masks some tokens in the input and then predicts these masked tokens;NSP
(Next Sentence Prediction) predicts whether two sentences are consecutive;DMLM
(Dynamic Masking Language Model) randomly masks some tokens in the input and dynamically adjusts the masking ratio.
Fine-tuning Tasks
In the downstream task of text classification in this experiment, we
adopted the method proposed by Devlin et al. (2019). We extracted the output vector
\(\mathbf{C} \in \mathbb{R}^{H}\) of
the [CLS]
token from the last layer of the BERT model, then
input it into a fully connected layer \(\mathbf{W} \in \mathbb{R}^{H \times K}\),
where \(K\) is the number of labels,
and calculated the corresponding standard classification loss \(\log(\text{softmax}(\mathbf{C}\mathbf{W}^{\top}))\).
Experimental Results
The dataset for this experiment consists of news headlines and their corresponding category labels, including a development set and a test set. The development set contains 47,952 news items, and the test set contains 15,986 news items. There are a total of 32 category labels, including finance, education, technology, sports, games, etc.
We further split the development set into a training set and a validation set in an 8:2 ratio, and employed an early stopping strategy, stopping training when the accuracy on the validation set did not decrease for three consecutive epochs. The remaining experimental parameters are as follows: optimizer is Adam, learning rate is 5e-5, batch size is 64. The experiment was conducted on a Macbook Pro 2021, using the PyTorch deep learning framework.
To ensure the reliability of the results, for the same model, we ran the experiment three times (with different random seeds) and reported the maximum and average performance of the model (the average value is in parentheses).
Model | Validation Set | Test Set |
---|---|---|
BERT | 0.8558 (0.8533) | 0.8588 (0.8529) |
BERT-wwm | 0.8599 (0.8569) | 0.8568 (0.8553) |
BERT-wwm-ext | 0.8608 (0.8592) | 0.8636 (0.8592) |
RoBERTa-wwm-ext | 0.8637 (0.8604) | 0.8608 (0.8588) |
The above experimental results indicate that the BERT-wwm-ext model achieved the best performance on the test set, with an accuracy of 86.36%. The RoBERTa-wwm-ext model performed best on the validation set, with an accuracy of 86.37%.
The experiment code has been open-sourced and is available at: https://github.com/SignorinoY/bert-classification.