Tokenization means to split large text into smaller units for LLMs to 'digest.' Tokenization converts text into numbers as LLMs understand numbers and not text. Every model has its own tokenizer so when you are using a model, make sure to use its correct tokenizer otherwise model's output could be wrong.
Lets see how tokenizer work in demo. I am using google colab. Lets first isntall some pre-req libraries.
We are using Autotokenizer from transformer library which automatically finds right tokenizer for your model.
Lets tokenize some text:
!pip install transformers
!pip install datasets
!pip install huggingface_hub
import pandas as pd
import datasets
from pprint import pprint
from transformers import AutoTokenizer
from huggingface_hub import notebook_login
notebook_login()
tokenizer = AutoTokenizer.from_pretrained("TinyPixel/Llama-2-7B-bf16-sharded") or stabilityai/stablecode-instruct-alpha-3b
text = "I am in Sydney."
tokenized_text = tokenizer(text)["input_ids"]
tokenized_text
Untokenized_text = tokenizer.decode(tokenized_text)
Untokenized_text
In real world, there will be lot of text, so lets see example of that:
list_text = ["I am in Sydney", "Near Bronte Beach", "Not near Blue mountains", "wow"]
tokenized_text = tokenizer(list_text)
tokenized_text["input_ids"]
As you can see that lists in this output are not of same length. Models need every list of tokens of same length because we use fixed number of tensors. So next step is to make all of these lists of same size. To do that, we first determine whats the max length of lists, and then expand each list to that length. This process is called as padding. Lets see this example:
tokenizer.pad_token = tokenizer.eos_token
tokenized_texts_longest = tokenizer(list_text, padding=True)
tokenized_texts_longest["input_ids"]
Now another thing is that every model has a max length whigh is limit of tokens. So we need to truncate the tokens as per max length. This is how you do it.
tokenized_texts_final = tokenizer(list_text, max_length=3, truncation=True, padding=True)
tokenized_texts_final["input_ids"]
No comments:
Post a Comment