Optimizing Named Entity Recognition (NER) with BERT: A Case Study
In the ever-evolving field of Natural Language Processing (NLP), Named Entity Recognition (NER) remains a crucial task with applications ranging from information retrieval to automated customer support. Recently, I had the opportunity to work on an interesting NER case study that allowed me to explore both traditional machine learning techniques and cutting-edge deep learning models.
Project Overview
The goal of this case study was to build an NER model capable of tagging entities in text, such as names of persons, organizations, locations, and more. The dataset consisted of sentences with corresponding words labelled using the IOB2 tagging scheme.
Here’s a quick look at the structure of the dataset:
Data Preprocessing
Before diving into model development, the data required significant preprocessing. Here are the steps I followed:
Here’s a snippet of the code used for tokenization and padding:
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize sentences
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
# Convert tokens to their corresponding IDs
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad input sequences
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
Baseline Model: Conditional Random Fields (CRF)
To set a benchmark, I first developed a baseline model using Conditional Random Fields (CRF). CRFs are well-suited for sequence tagging tasks like NER due to their ability to model dependencies between labels.
import sklearn_crfsuite
from sklearn_crfsuite import metrics
# Define the feature extraction function
def word2features(sent, i):
word = sent[i][0]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
}
if i > 0:
word1 = sent[i-1][0]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
})
else:
features['BOS'] = True # Beginning of a sentence
if i < len(sent)-1:
word1 = sent[i+1][0]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
})
else:
features['EOS'] = True # End of a sentence
return features
# Apply the feature extraction function to the dataset
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
# Initialize and train the CRF model
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
# Predict and evaluate the model
y_pred = crf.predict(X_test)
Results: The CRF model performed decently, but it had limitations in handling long-range dependencies and complex sentence structures. Here's a snapshot of its performance (Sample Tags only) :
Recommended by LinkedIn
Enhanced Model: BERT-Based NER
To overcome the limitations of the CRF model, I implemented an enhanced model using BERT (Bidirectional Encoder Representations from Transformers). BERT's contextualized embeddings provide a more nuanced understanding of each word in its context, significantly boosting the model's ability to tag entities accurately.
Here’s a snippet of the code used to fine-tune the BERT model:
from transformers import BertForTokenClassification, AdamW
# Load pre-trained BERT model
model = BertForTokenClassification.from_pretrained(
"dslim/bert-base-NER",
num_labels=len(tag2idx),
output_attentions = False,
output_hidden_states = False
)
# Optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
# Training loop
model.train()
for epoch in range(epochs):
for step, batch in enumerate(train_dataloader):
b_input_ids, b_labels = batch
optimizer.zero_grad()
outputs = model(b_input_ids, labels=b_labels)
loss = outputs.loss
loss.backward()
optimizer.step()
Results: The BERT model showed a substantial improvement over the CRF baseline. Here's how it performed (Sample Tags only) :
Model Deployment
After fine-tuning the model, I deployed it as a REST API using Flask. The deployment architecture ensured scalability and robustness, with real-time monitoring to track the model's performance over time.
Code Architecture:
NER/
│
├── Models
│ ├── Baseline_Model
│ │ ├── Saved_Models
│ │ │ ├── crf_model_2024_08_14.pkl
│ │ └── results
│ └── BERT_Model
│ ├── Saved_Models
│ │ ├── config.json
| ├── model_safetensors
│ └── results
│
├── /src/
│ ├── data_processing.py
│ └── model_training.py
│
├── /Deployment/
│ └── deployment_script.py
│
├── /Data/
│ └── /Processed/
│ ├── train_data.csv
│ ├── val_data.csv
│ └── test_data.csv
│
└── README.md
Next Steps
While the BERT model performed admirably, there's always room for improvement. Future steps include:
Conclusion
This case study reinforced the value of leveraging advanced models like BERT in NLP tasks. The significant performance boost observed in NER tagging underscores the importance of contextual embeddings and transfer learning. As we continue to develop more sophisticated NLP solutions, models like BERT will undoubtedly play a pivotal role in transforming how we process and understand text.
Iam looking forward to reading.
Hi Vivek, you need any Humans in the Loop to validate your models? We at Enlabeler have plenty of experience in that. #NER #NLP
Very informative vivek