Base
LLM Auto-regressive Decoding
自回归解码
在自然语言处理(NLP)中,大型语言模型(LLM)如Transformer进行推理时,自回归解码是一种生成文本的方式。在自回归解码中,模型在生成下一个单词时会依赖于它之前生成的单词。
使用自回归解码的公式可以表示为以下步骤:
初始化序列:设$(x_1,x_2,…, x_{t-1})$是目前已生成的单词序列。
计算下一个单词的概率分布:使用语言模型计算在给定上下文之后下一个单词的概率分布:
$$P(x_{t} | x_1,x_2, … , x_{t-1})$$
这一步骤通常使用softmax函数完成,它将单词的logit转换成概率分布。
选择下一个单词:根据概率分布选择下一个单词$x_t$。这可以通过不同的策略来完成,如:
贪婪解码(Greedy Decoding):选择具有最高概率的单词。$x_t= \argmax P(x_{t} | x_1,x_2, … , x_{t-1})$
随机抽样(Sampling): 根据概率分布随机选择单词,这允许生成更多样化的文本。
束搜索(Beam Search): 维护一个宽度为(k)的束(beam),在每一步选择概率最高的(k)个单词组合作为候选,然后在这些候选中选择最终的单词序列。
更新序列:将选定的单词($x_t$)添加到序列中。
重复上述步骤,直到遇到序列结束标记,或者生成了所需长度的文本
CODE
load in 4/8 bit
当模型太大的时候,很难加载到普通显卡中,可以使用4/8 bit模型来进行训练
|
|
inference in multi-GPU
https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map
建议使用infer_auto_device_map
|
|
|
|
Errors
piece id is out of range.
|
|
|
|
|
|
OpenCLIP
Bert
In this article, I will demonstrate how to use BERT using the Hugging Face Transformer library for four important tasks. I will also show you how you can configure BERT for any task that you may want to use it for, besides just the standard tasks that it was designed to solve.
Note that this article was written in January 2021, so earlier/future versions of the Hugging Face library may be a little different and the code in this article may not necessarily work.
A quick review of the architecture of BERT
BERT is a bidirectional transformer pre-trained using a combination of masked language modeling and next sentence prediction. The core part of BERT is the stacked bidirectional encoders from the transformer model, but during pre-training, a masked language modeling and next sentence prediction head are added onto BERT. When I say “head”, I mean that a few extra layers are added onto BERT that can be used to generate a specific output. The raw output of BERT is the output from the stacked Bi-directional encoders. This fact is especially important as it allows you to essentially do anything with BERT, and you will see examples of this later on in the article.
There are many tasks that BERT can solve that hugging face provides, but the ones that I will be going over in this article are Masked Language Modeling, Next Sentence Prediction, Language Modeling, and Question Answering. I will also demonstrate how to configure BERT to do any task that you want besides the ones stated above and that hugging face provides.
Before I discuss those tasks, I will describe how to use the BERT Tokenizer.
BERT Tokenizer
The BERT Tokenizer is a tokenizer that works with BERT. It has many functionalities for any type of tokenization tasks. You can download the tokenizer using this line of code:
|
|
Unlike the BERT Models, you don’t have to download a different tokenizer for each different type of model. You can use the same tokenizer for all of the various BERT models that hugging face provides.
Given a text input, here is how I generally tokenize it in projects:
|
|
As BERT can only accept/take as input only 512 tokens at a time, we must specify the truncation parameter to True. The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens. Return_tensors = “pt” is just for the tokenizer to return PyTorch tensors. If you don’t want this to happen(maybe you want it to return a list), then you can remove the parameter and it will return lists.
In the code below, you will see me not adding all the parameters I listed above and this is primarily because this is not necessary as I am not tokenizing text for a real project. In a real machine learning/NLP project, you will want to add these parameters, especially the truncation and padding as we have to do this for each batch in the dataset in a real project.
tokenizer.encode_plus() specifically returns a dictionary of values instead of just a list of values. Because tokenizer.encode_plus() can return many different types of information, like the attention_masks and token type ids, everything is returned in a dictionary format, and if you want to retrieve the specific parts of the encoding, you can do it like this:
|
|
Additionally, because the tokenizer returns a dictionary of different values, instead of finding those values as shown above and individually passing these into the model, we can just pass in the entire encoding like this
|
|
One more very important thing about the tokenizer to know is that you can specify to retrieve specific tokens if desired. For example, if you are doing masked language modeling and you want to insert a mask at a location for your model to decode, then you can simply retrieve the mask token like this
|
|
and you can simply insert it into your input by concatenating it with your input text.
You can also retrieve many other tokens, like the [SEP] token, in the same way.
I typically use the tokenizer.encode_plus() function to tokenize my input, but there is another function that can be used to tokenize input, and this tokenizer.encode(). Here is an example of this:
|
|
The main difference between tokenizer.encode_plus() and tokenizer.encode() is that tokenizer.encode_plus() returns more information. Specifically, it returns the actual input ids, the attention masks, and the token type ids, and it returns all of these in a dictionary. tokenizer.encode() only returns the input ids, and it returns this either as a list or a tensor depending on the parameter, return_tensors = “pt”.
Masked Language Modeling
Masked Language Modeling is the task of decoding a masked token in a sentence. In simple terms, it is the task of filling in the blanks.
Instead of just getting the best candidate word to replace the mask token, I will demonstrate how you can take the top 10 replacement words for the mask token, and here is how you can do this:
|
|
Hugging Face is set up such that for the tasks that it has pre-trained models for, you have to download/import that specific model. In this case, we have to download the Bert For Masked Language Modeling model, whereas the tokenizer is the same for all different models as I said in the section above.
Masked Language Modeling works by inserting a mask token at the desired position where you want to predict the best candidate word that would go in that position. You can simply insert the mask token by concatenating it at the desired position in your input like I did above. The Bert Model for Masked Language Modeling predicts the best word/token in its vocabulary that would replace that word. The logits are the output of the BERT Model before a softmax activation function is applied to the output of BERT. In order to get the logits, we have to specify return_dict = True in the parameters when initializing the model, otherwise, the above code will result in a compilation error. After we pass the input encoding into the BERT Model, we can get the logits simply by specifying output.logits, which returns a tensor, and after this we can finally apply a softmax activation function to the logits. By applying a softmax onto the output of BERT, we get probabilistic distributions for each of the words in BERT’s vocabulary. Word’s with a higher probability value will be better candidate replacement words for the mask token. In order to get the tensor of softmax values of all the words in BERT’s vocabulary for replacing the mask token, we can specify the masked token index, which we get using torch.where(). Because in this particular example I am retrieving the top 10 candidate replacement words for the mask token(you can get more than 10 by adjusting the parameter accordingly), I used the torch.topk() function, which allows you to retrieve the top k values in a given tensor, and it returns a tensor containing those top k values. After this, the process becomes relatively simple, as all we have to do is iterate through the tensor, and replace the mask token in the sentence with the candidate token. Here is the output the code above compiles:
The capital of France, paris, contains the Eiffel Tower.
The capital of France, lyon, contains the Eiffel Tower.
The capital of France, lille, contains the Eiffel Tower.
The capital of France, toulouse, contains the Eiffel Tower.
The capital of France, marseille, contains the Eiffel Tower.
The capital of France, orleans, contains the Eiffel Tower.
The capital of France, strasbourg, contains the Eiffel Tower.
The capital of France, nice, contains the Eiffel Tower.
The capital of France, cannes, contains the Eiffel Tower.
The capital of France, versailles, contains the Eiffel Tower.
and you can see that Paris is indeed the top candidate replacement word for the mask token.
If you want to only get the top candidate word, you can do this:
|
|
Instead of using torch.topk() for retrieving the top 10 values, we just use torch.argmax(), which returns the index of the maximum value in the tensor. The rest of the code is pretty much the same thing as the original code.
Language Modeling
Language Modeling is the task of predicting the best word to follow or continue a sentence given all the words already in the sentence.
|
|
Language Modeling works very similarly to Masked language modeling. To start off, we have to download the specific Bert Language Model Head Model, which is essentially a BERT model with a language modeling head on top of it. One additional parameter we have to specify while instantiating this model is the is_decoder = True parameter. We have to specify this parameter if we want to use this model as a standalone model for predicting the next best word in the sequence. The rest of the code is relatively the same as the one in masked language modeling: we have to retrieve the logits of the model, but instead of specifying the index to be that of the masked token, we just have to take the logits of the last hidden state of the model(using -1 index), compute the softmax of these logits, find the largest probability value in the vocabulary, and decode and print this token.
Next Sentence Prediction
Next Sentence Prediction is the task of predicting whether one sentence follows another sentence. Here is my code for this:
|
|
Next Sentence prediction is the task of predicting how good a sentence is a next sentence for a given sentence. In this case, “The child came home from school.” is the given sentence and we are trying to predict whether “He played soccer after school.” is the next sentence. To do this, the BERT tokenizer automatically inserts a [SEP] token in between the sentences, which represents the separation between the two sentences, and the specific Bert For Next Sentence Prediction model predicts two values of whether the sentence is the next sentence. Bert returns two values in a tensor: the first value represents whether the second sentence is a continuation of the first, and the second value represents whether the second sentence is a random sequence or not a good continuation of the first. Unlike Language Modeling, we don’t retrieve any logits because we are not trying to compute a softmax on the vocabulary of BERT; we are simply trying to compute a softmax on the two values that BERT for next sentence prediction returns so that we can see which value has the highest probability value, and this will represent whether the second sentence is a good next sentence for the first. Once we get the softmax values, we can simply look at the tensor by printing it out. Here are the values that I got:
tensor([[0.9953, 0.0047]])
Because the first value is considerably higher than the second index, BERT believes that the second sentence follows the first sentence, which is the correct answer.
Extractive Question Answering
Extractive Question Answering is the task of answering a question given some context text by outputting the start and end indexes of where the answer lies in the context. Here is my code for extractive question answering:
|
|
Similar to the other three tasks, we begin by downloading the specific BERT model for Question Answering, and we tokenize our two inputs: the question and the context. Unlike the other models, the process is relatively straightforward for this model as it outputs the values for each word in the tokenized input. As I mentioned before, the way extractive question answering works is by computing the best start and end indexes for where the answer is located in the context. The model returns values for all of the words in context/input corresponding to how good they would be a start value and end value for the given question; in other words, each of the words in the input receives a start and end index score/value representing whether they would be a good start word for the answer or a good end word for the answer. The rest of this process is fairly similar to what we did on the other three programs; we compute the softmax of these scores to find the probabilistic distribution of values, retrieve the highest values for both the start and end tensors using torch.argmax(), and find the actual tokens that correspond to this start : end range in the input and decode them and print them out.
Using BERT for any task you want
Although Text Summarization, Question answering, and a basic Language Model are especially important, often, people want to use BERT for other unspecified tasks, especially in research. The way that they do this is by taking the raw outputs of the stacked encoders of BERT, and attaching their own specific model to it, most commonly a linear layer, and then fine-tuning this model on their specific dataset. When doing this in Pytorch using the Hugging Face transformer library, it is best to set this up as a Pytorch deep learning model like such:
|
|
As you can see, instead of downloading a specific BERT Model already designed for a specific task like Question Answering, I downloaded the raw pre-trained BertModel, which does not come with any heads attached to it.
To get the size of the raw BERT outputs, simply use self.bert.config.hidden_size, and attach this to the number of classes you want your linear layer to output.
To use the code above for sentiment analysis, which is surprisingly a task that does not come downloaded/already done in the hugging face transformer library, you can simply add a sigmoid activation function onto the end of the linear layer and specify the classes to equal 1.
|
|
I hope that you found this content easy to understand. If you think that I need to elaborate further or clarify anything, drop a comment below.
LLavA
配置环境
检查环境是否配好
|
|
训练
|
|
CUDA Out-of-Memory (OOM)
在script添加参数 –bit 4/8,但是同时会报错
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details
可以配合lora使用
如果不想使用lora,那么可以把模型换成vicuna-7b或者vit-base的,让模型更小点
lora finetune
merge_lora_weights.py
ValueError: The generation config instance is invalid
此错误似乎是升级变压器版本时发生的问题。我通过在vicuna的generation_config.json文件中手动添加do_sample:true来解决此问题。