Optimizing Named Entity Recognition (NER) with BERT: A Case Study

Vivek Sharma

Published Aug 22, 2024

In the ever-evolving field of Natural Language Processing (NLP), Named Entity Recognition (NER) remains a crucial task with applications ranging from information retrieval to automated customer support. Recently, I had the opportunity to work on an interesting NER case study that allowed me to explore both traditional machine learning techniques and cutting-edge deep learning models.

Project Overview

The goal of this case study was to build an NER model capable of tagging entities in text, such as names of persons, organizations, locations, and more. The dataset consisted of sentences with corresponding words labelled using the IOB2 tagging scheme.

Here’s a quick look at the structure of the dataset:

Data Preprocessing

Before diving into model development, the data required significant preprocessing. Here are the steps I followed:

Sentence Aggregation: Grouped words by their respective sentences to form sequences.
NER Tag Mapping: Created a mapping of words to their NER tags.
Normalization: Applied text normalization techniques such as lowercasing and removing special characters.
Tokenization: Tokenized the sentences while preserving the mapping to their original NER tags.
Padding and Truncation: Ensured uniform input lengths by padding shorter sentences and truncating longer ones.

Here’s a snippet of the code used for tokenization and padding:

from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize sentences
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

# Convert tokens to their corresponding IDs
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# Pad input sequences
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

Baseline Model: Conditional Random Fields (CRF)

To set a benchmark, I first developed a baseline model using Conditional Random Fields (CRF). CRFs are well-suited for sequence tagging tasks like NER due to their ability to model dependencies between labels.

import sklearn_crfsuite
from sklearn_crfsuite import metrics

# Define the feature extraction function
def word2features(sent, i):
    word = sent[i][0]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True  # Beginning of a sentence

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True  # End of a sentence

    return features

# Apply the feature extraction function to the dataset
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

# Initialize and train the CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

crf.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = crf.predict(X_test)

Results: The CRF model performed decently, but it had limitations in handling long-range dependencies and complex sentence structures. Here's a snapshot of its performance (Sample Tags only) :

Recommended by LinkedIn

Understanding the Decoder-Only Architecture of GPT…

Harshit Pandya 1 year ago

BERT MODEL- UNDERSTANDING

MOHD ABU BAKAR SIDDIQUE 1 year ago

Turning Text Into Numbers: Word2Vec, GloVe, and…

Michael Lively 8 months ago

Enhanced Model: BERT-Based NER

To overcome the limitations of the CRF model, I implemented an enhanced model using BERT (Bidirectional Encoder Representations from Transformers). BERT's contextualized embeddings provide a more nuanced understanding of each word in its context, significantly boosting the model's ability to tag entities accurately.

Here’s a snippet of the code used to fine-tune the BERT model:

from transformers import BertForTokenClassification, AdamW

# Load pre-trained BERT model
model = BertForTokenClassification.from_pretrained(
    "dslim/bert-base-NER",
    num_labels=len(tag2idx),
    output_attentions = False,
    output_hidden_states = False
)

# Optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

# Training loop
model.train()
for epoch in range(epochs):
    for step, batch in enumerate(train_dataloader):
        b_input_ids, b_labels = batch
        optimizer.zero_grad()
        outputs = model(b_input_ids, labels=b_labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

Results: The BERT model showed a substantial improvement over the CRF baseline. Here's how it performed (Sample Tags only) :

Model Deployment

After fine-tuning the model, I deployed it as a REST API using Flask. The deployment architecture ensured scalability and robustness, with real-time monitoring to track the model's performance over time.

Code Architecture:


NER/
│
├── Models
│   ├── Baseline_Model
│   │   ├── Saved_Models
│   │   │   ├── crf_model_2024_08_14.pkl
│   │   └── results
│   └── BERT_Model
│       ├── Saved_Models
│       │   ├── config.json
|           ├── model_safetensors
│       └── results
│
├── /src/
│   ├── data_processing.py
│   └── model_training.py
│
├── /Deployment/
│   └── deployment_script.py
│
├── /Data/
│   └── /Processed/
│       ├── train_data.csv
│       ├── val_data.csv
│       └── test_data.csv
│
└── README.md

Next Steps

While the BERT model performed admirably, there's always room for improvement. Future steps include:

Domain Adaptation: Fine-tuning the model for specific industries like healthcare or finance.
Real-Time Inference: Optimizing the model for faster inference in production environments.
Continuous Learning: Setting up a pipeline for periodic retraining with new data.

Conclusion

This case study reinforced the value of leveraging advanced models like BERT in NLP tasks. The significant performance boost observed in NER tagging underscores the importance of contextual embeddings and transfer learning. As we continue to develop more sophisticated NLP solutions, models like BERT will undoubtedly play a pivotal role in transforming how we process and understand text.

Optimizing Named Entity Recognition (NER) with BERT: A Case Study

Vivek Sharma

Project Overview

Data Preprocessing

Baseline Model: Conditional Random Fields (CRF)

Recommended by LinkedIn

Enhanced Model: BERT-Based NER

Model Deployment

Next Steps

Conclusion

Others also viewed

From Traditional NLP Models to Large Language Models (LLM): The Evolution of Language and the Role of Transformers

Hugging Face

Working with Text to Image Gen AI Tools

Unlocking the Power of Embeddings in Generative AI Language Models

A naive introduction to Retrieval-Augmented Generation (RAG)

Pre Processing Text Data for GPT Models: Techniques and Best Practices

Meta's LLaMA beats OpenAI's GPT3?

Enhancing Semantic Sentence Similarity With Pre-trained SimCSE Models

Simplified Semantic Search using Embeddings

Explore content categories