From the course: AI Sentiment Analysis with PyTorch and Hugging Face Transformers

Unlock this course with a free trial

Join today to access over 25,500 courses taught by industry experts.

Tokenization

Tokenization

- [Instructor] Tokenization is the process of breaking down larger pieces of text, like sentences or paragraphs into smaller units called tokens. For example, the sentence "I love movies," is split into tokens, I, love, and movies. To create a tokenizer, we use AutoTokenizer class from Hugging Face. It initializes the tokenizer with the distilbert model name distilbert-base-uncased. Now, let's define a tokenization function with the tokenizer. (keyboard clicking) The tokenizer here processes each item in the dataset, the parameters for padding, truncation, and max_length, make sure all sequences are of the same length, and are less than 128 tokens. These steps are necessary because our model expects a fixed-size input. Now, we apply the tokenized function to both the training and test datasets using the map function from datasets library. This tokenizes the data efficiently in batches. The dataset is now transformed and ready for model training. In the next video, we'll introduce…

Contents