From the course: AI Sentiment Analysis with PyTorch and Hugging Face Transformers
Unlock this course with a free trial
Join today to access over 25,500 courses taught by industry experts.
Tokenization
From the course: AI Sentiment Analysis with PyTorch and Hugging Face Transformers
Tokenization
- [Instructor] Tokenization is the process of breaking down larger pieces of text, like sentences or paragraphs into smaller units called tokens. For example, the sentence "I love movies," is split into tokens, I, love, and movies. To create a tokenizer, we use AutoTokenizer class from Hugging Face. It initializes the tokenizer with the distilbert model name distilbert-base-uncased. Now, let's define a tokenization function with the tokenizer. (keyboard clicking) The tokenizer here processes each item in the dataset, the parameters for padding, truncation, and max_length, make sure all sequences are of the same length, and are less than 128 tokens. These steps are necessary because our model expects a fixed-size input. Now, we apply the tokenized function to both the training and test datasets using the map function from datasets library. This tokenizes the data efficiently in batches. The dataset is now transformed and ready for model training. In the next video, we'll introduce…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.