嵌入 | xw

LLM分词和嵌入

Oct 26, 2025

分词器负责在将输入文本送入生成模型之前，将其分割成词元。

下面使用 transformers 来加载分词器和模型。Hugging Face 网站上找到分词器和模型, 只需传入相应的 ID 即可。使用 microsoft/Phi-3- mini-4k-instruct 作为模型的主路径，这里使用的是 Google Colab 提供的 T4 GPU（device_map=“cuda”），也可以使用其他设备。

加载依赖

!pip install --upgrade transformers==4.41.2 sentence-transformers==3.0.1 gensim==4.3.2 scikit-learn==1.5.0 accelerate==0.31.0 peft==0.11.1 scipy==1.10.1 numpy==1.26.4

下载和运行LLM

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

首先声明提示词

然后对其进行分词，再将这些词元传递给模型，模型随后生成输出。

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Heartfelt Apologies for the Gardening Mishap


Dear

模型实际上并没有直接处理提示词文本。相反，输入提示词是由分词器处理的。分词器在变量 input_ids 中返回了模型所需的信息，随后模型将其用作输入。
分词器在变量 input_ids 中返回了模型所需的信息，随后模型将其用作输入。

打印 input_ids 可以看到：

print(input_ids)

output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')

for id in input_ids[0]:
   print(tokenizer.decode(id))

output

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>

每个整数都是特定词元（字符、词或词的一部分）的唯一 ID。这些 ID 是分词器内部的一张词元表的索引，该表包含了分词器能够识别的所有词元；
第一个词元是 ID 1（~~），这是一个表示文本开始的特殊词元；~~

一些词元是完整的单词（例如 Write、an、email）；

一些词元是单词的部分（例如 apolog、izing、trag、ic）；

标点符号是独立的词元。

注意空格字符不用单独的词元表示，代表词的一部分的词元（如 izing 和 ic）在开头有一个特殊的隐藏字符，表示它们与文本中前面的词元相连。没有这个特殊字符的词元前面则都被视为有一个空格。