利用LangChain和语言模型交互
从名字上可以看出来,LangChain可以用来构建自然语言处理能力的链条。它是一个库,提供了统一的接口,来和各种语言模型进行交互。更重要的是,它支持使用插件(工具),让语言模型能够获取实时的知识。LangChain基于Python,如果你熟悉JavaScript/TypeScript,可以考虑使用LangChain.js。
具体来说,LangChain包括以下模块:
- Model I/O:提供针对语言模型的统一风格接口
- 数据连接(Data connection):提供应用程序相关数据的接口
- 处理链(Chains):用于构建针对模型、提示词、工具调用的流水线
- 代理(Agents):自动选择适当工具解决问题
- 记忆(Memory):跨越多次处理链调用,来保持程序状态信息
- Callbacks:在处理链的中间步骤中进行回调
1 2 3 4 5 6 7 8 |
# 最小化依赖 pip install langchain # 支持主流大语言模型 pip install langchain[llms] # 支持所有能力,注意,占用很大磁盘空间且较为缓慢 pip install langchain[all] |
以OpenAI为例,安装依赖:
1 |
pip install openai |
通过环境变量来设置API Key:
1 |
export OPENAI_API_KEY="..." |
如果需要通过代理来访问OpenAPI的API端点,参考下面的方式设置环境变量:
1 2 |
HTTP_PROXY=https://user:pswd@proxy.gmem.cc HTTPS_PROXY=https://user:pswd@proxy.gmem.cc |
下面的代码,根据你提供的输入来预测输出,也就是进行问答:
1 2 3 4 5 |
from langchain.llms import OpenAI # 温度用来控制随机性,温度越高输出越随机 llm = OpenAI(temperature=0.9) print(llm.predict("Hey what's up?")) # Nothing much, just trying to get some work done. How about you? |
对话模型(Chat Models)是语言模型的变体。语言模型的输入、输出是一段文本,对话模型的输入则是一系列消息,这些消息构成了对话的上下文。对话模型的输出,则是一个消息。
LangChain将消息分为不同的类型,它们有对应的Python类:
- AIMessage:表示从模型角度发出的消息(a message sent from the perspective of the AI)
- HumanMessage:表示从最终用户角度发出的消息
- SystemMessage:用于提出系统级别的指令,例如指示语言模型该如何格式化输出
- ChatMessage:通用消息类型,可以指定一个role参数,从而等价于上述消息之一
这个消息分类,也是源自OpenAI的ChatGPT,ChatGPT会根据消息类型的不同,使用不同的处理方式。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
if __name__ == "__main__": from langchain.chat_models import ChatOpenAI from langchain.schema import ( SystemMessage, HumanMessage ) chat = ChatOpenAI(temperature=0) resp = chat.predict_messages([ SystemMessage(content="Translate whatever I say into Japanese."), HumanMessage(content="I love coding"), ]) print(resp.content) # 私はコーディングが大好きです。 |
语言模型具有很大的灵活性,我们将其集成到自己的应用程序中时,通常需要加以限制,不能让最终用户随意的提供输入。这种情况下,可以使用提示词模板。
具体来说,就是由应用程序作者提供提示词的主体部分,最终用户仅仅填充提示词中的占位符。假设我们开发一个翻译器,那么用户可能仅仅需要指定待翻译的文本、目标语言:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.chat_models import ChatOpenAI from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate if __name__ == "__main__": system_message_prompt = SystemMessagePromptTemplate.from_template( """ Act as a sophisticated translator between multiple languages. You can detect the language of a text, and translate it into a different one that I specify. """ ) human_message_prompt = HumanMessagePromptTemplate.from_template( """ translate this sentence into {language}: {text} """ ) chat_prompt = ChatPromptTemplate.from_messages( [system_message_prompt, human_message_prompt] ) messages = chat_prompt.format_messages(language="Chinese", text="Hey what's up?") chat = ChatOpenAI(temperature=0.5, model_name="gpt-3.5-turbo") resp = chat.predict_messages(messages) print(resp.content) # 嗨,最近怎么样? |
处理链可以连接模型、提示词,或者其它处理链。最简单的例子:连接语言模型和提示词:
1 2 3 4 5 |
from langchain import LLMChain chain = LLMChain(llm=chat, prompt=chat_prompt) resp: str = chain.run(language="Chinese", text="Hey what's up?") print(resp) # 嗨,最近怎么样? |
上面定义了一个非常简单的处理链:渲染提示模板、调用LLM的API。某些复杂的流程下,需要根据输入动态的选择操作。
代理(Agent)能够完成这种动态选择,它调用语言模型来分析输入,从而确定需要先做什么,再做什么。代理能够反复的选择、运行工具,直到获取最终答案。
加载一个代理需要以下要素:
- LLM或者Chat模型:这个模型提供自然语言理解能力,用于驱动代理
- 工具(集):工具是执行特定任务的函数,例如进行Google搜索、数据库查找、执行Python REPL、调用其它处理链
- 代理的名称:这个名称表示使用的代理类。代理类使用参数化的提示词来调用模型,从而确定下一步该做什么
例如serpapi这个工具,它能够发起搜索引擎检索:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# You should provide your serpapi key with environment variable SERPAPI_API_KEY from langchain.agents import load_tools, initialize_agent, AgentType from langchain.chat_models import ChatOpenAI if __name__ == "__main__": llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") tools = load_tools(["serpapi"], llm=llm) agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True) result = agent.run( """ Who is the president of the United States? What did he do last week? """ ) print(result) |
Verbose设置为True能够输出处理链的整个决策过程:
> Entering new chain...
I need to find out who the current president is and what he did last week. I can use the search tool for this.
Action: Search
Action Input: "current president of the United States"
Observation: Joe Biden
Thought:I now know the current president of the United States. I can use the search tool again to find out what he did last week.
Action: Search
Action Input: "Joe Biden last week"
Observation: The Supreme Court on Friday struck down President Joe Biden's plan to cancel up to $20,000 of student debt for tens of millions of Americans, thwarting a major ...
Thought:I now know the final answer.
Final Answer: Joe Biden is the current president of the United States. Last week, the Supreme Court struck down his plan to cancel up to $20,000 of student debt.
> Finished chain.
Joe Biden is the current president of the United States. Last week, the Supreme Court struck down his plan to cancel up to $20,000 of student debt.
可以看到,代理把问题划分为两个子任务,并且通过搜索网络来解答问题。
默认情况下处理链和代理是无状态的,很多情况下,我们需要依赖先前和模型的交互(作为上下文)。Memory模块提供了这种记住先前交互的机制,它支持基于最新的输入/输出来更新状态,或者基于已存储的状态,或者contextualize下一次输入。
Memory模块有多种实现,最简单的基于内存缓冲 —— 简单的将最近的几次输入/输出,作为当前输入的前缀:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from langchain.chains import ConversationChain from langchain.chat_models import ChatOpenAI from langchain.memory import ConversationBufferMemory from langchain.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, MessagesPlaceholder, \ HumanMessagePromptTemplate if __name__ == "__main__": llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") prompt = ChatPromptTemplate.from_messages([ SystemMessagePromptTemplate.from_template( """ The following is a conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it doesn't know. """ ), MessagesPlaceholder(variable_name="history"), HumanMessagePromptTemplate.from_template("{input}") ]) memory = ConversationBufferMemory(return_messages=True) conversation = ConversationChain(memory=memory, prompt=prompt, llm=llm) print(conversation.predict(input="My name is Alex")) print(conversation.predict(input="What is my name?")) |
这种实现有个重要的缺点:可能随着对话历史的增长,占用空间越来越大,因而影响AI的性能(或者超过长度限制)。通常需要考虑pruning或者summarization以保证上下文完整的同时不影响性能。
语言模型都是将文本作为输入的,这些文本就被称为提示词(Prompt)。
在语言模型应用程序中,提示词通常不是硬编码的,也不是让用户随意输入。一般情况下提示词由模板、样例、用户输入组成,LangChain提供了若干类用来构造提示词。
模板用于可复用的构造提示词,它由静态文本和一系列占位符(输入变量)构成。模板内容可以包括:
- 对语言模型的指令
- 帮助语言模型生成更好结果的若干示例
- 对语言模型的提问
这里是一个最简单的例子:
1 2 3 4 5 6 7 8 9 10 |
from langchain import PromptTemplate template = """/ You are a naming consultant for new companies. What is a good name for a company that makes {product}? """ prompt = PromptTemplate.from_template(template) prompt.format(product="colorful socks") |
上面的代码能够自动根据template中的占位符,来推断由哪些输入变量。手工声明变量的方式如下:
1 2 3 4 5 |
multiple_input_prompt = PromptTemplate( input_variables=["adjective", "content"], template="Tell me a {adjective} joke about {content}." ) multiple_input_prompt.format(adjective="funny", content="chickens") |
除了这种语言模型通用提示模板,还有针对对话模型的模板类:SystemMessagePromptTemplate、AIMessagePromptTemplate、HumanMessagePromptTemplate。它们的区别上文已经介绍过。一个或者多个这些模板,可以构成ChatPromptTemplate:
1 2 3 4 5 |
# 返回值是PromptValue chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt]) # 渲染模板为消息 chat_prompt.format_prompt(input_language="English", ... ).to_messages() |
内置的提示词模板类(包括两种类型,一个是基于字符串的,一个是基于Chat的),都是根据输入变量,进行提示内容的格式化。
如果你需要以输入变量为键,从第三方(例如数据库,或者Feast这样的Feature Store)获取信息并生成提示词,可以选择自定义提示词模板。
这里以基于字符串的模板类为例来说明如何自定义提示词模板。我们常常会选择继承StringPromptTemplate,但是实际上,自定义提示词模板只需要满足两点:
- 提供 input_variables属性,定义支持的输入变量的名字
- 提供 format方法,接收关键字参数,用于根据指定的输入变量来生成提示词
下面这个模板的例子,以函数名为输入变量,生成让语言模型来解释函数用途的提示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
from langchain.prompts import StringPromptTemplate from pydantic import BaseModel, validator class FunctionExplainerPromptTemplate(StringPromptTemplate, BaseModel): @validator("input_variables") def validate_input_variables(cls, v): """Validate that the input variables are correct.""" if len(v) != 1 or "function_name" not in v: raise ValueError("function_name must be the only input_variable.") return v def format(self, **kwargs) -> str: # Get the source code of the function source_code = get_source_code(kwargs["function_name"]) # Generate the prompt to be sent to the language model prompt = f""" Given the function name and source code, generate an English language explanation of the function. Function Name: {kwargs["function_name"].__name__} Source Code: {source_code} Explanation: """ return prompt def _prompt_type(self): return "function-explainer" fn_explainer = FunctionExplainerPromptTemplate(input_variables=["function_name"]) prompt = fn_explainer.format(function_name=get_source_code) print(prompt) |
所谓少量样本(Few-shot)是指给语言模型提供几个例子(shots),以使其能够更好的理解你的意图并生成更好的响应。
样本可以说明期望的输出格式、得到最终输出的途径等等,其形式是比较自由的,语言模型能够理解其中的意图。
下面是一个样本集的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
from langchain.prompts.few_shot import FewShotPromptTemplate from langchain.prompts.prompt import PromptTemplate examples = [ { "question": "Who lived longer, Muhammad Ali or Alan Turing?", "answer": """ Are follow up questions needed here: Yes. Follow up: How old was Muhammad Ali when he died? Intermediate answer: Muhammad Ali was 74 years old when he died. Follow up: How old was Alan Turing when he died? Intermediate answer: Alan Turing was 41 years old when he died. So the final answer is: Muhammad Ali """ }, { "question": "When was the founder of craigslist born?", "answer": """ Are follow up questions needed here: Yes. Follow up: Who was the founder of craigslist? Intermediate answer: Craigslist was founded by Craig Newmark. Follow up: When was Craig Newmark born? Intermediate answer: Craig Newmark was born on December 6, 1952. So the final answer is: December 6, 1952 """ } ... ] |
基于上述样本集生成提示:
1 2 3 4 5 6 7 8 |
prompt = FewShotPromptTemplate( examples=examples, example_prompt=example_prompt, suffix="Question: {input}", input_variables=["input"] ) print(prompt.format(input="Who was the father of Mary Ball Washington?")) |
最终生成的提示是这样的:
Question: Who lived longer, Muhammad Ali or Alan Turing?
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
Question: When was the founder of craigslist born?
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952
Question: Who was the father of Mary Ball Washington?
我们可能会根据用户输入的语义,选择一个最匹配的样本,而不是把所有样本一股脑的发给语言模型。这时候,可以使用样本选择器。
向量存储用于保存Embeddings,所谓Embeddings,是指语言元素(例如单词、短语)在向量空间的数学表示(也就是多维向量)。向量存储支持进行语义相似度的判断,样本选择器依赖于向量存储的能力,来选取和输入最匹配的样本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings example_selector = SemanticSimilarityExampleSelector.from_examples( # 候选样本列表 examples, # 这是一个embedding class,用于生成embeddings OpenAIEmbeddings(), # 这是一个向量存储,用于存储embeddings并进行相似度搜索 Chroma, # This is the number of examples to produce. k=1 ) # 选择相似的样本 question = "Who was the father of Mary Ball Washington?" selected_examples = example_selector.select_examples({"question": question}) |
FewShotPromptTemplate这个提示模板,支持传入一个样本选择器,生成最终提示词:
1 2 3 4 5 6 7 8 |
prompt = FewShotPromptTemplate( example_selector=example_selector, example_prompt=example_prompt, suffix="Question: {input}", input_variables=["input"] ) print(prompt.format(input="Who was the father of Mary Ball Washington?")) |
目前LangChains没有为对话模型设计特别的抽象。在对话模型中,你只需要在输入消息中,提供交替的(Alternating)人类/AI消息,即可产生少量样本的效果:
1 2 3 4 5 6 7 8 9 10 11 |
template = "You are a helpful assistant that translates english to pirate." system_message_prompt = SystemMessagePromptTemplate.from_template(template) example_human = HumanMessagePromptTemplate.from_template("Hi") example_ai = AIMessagePromptTemplate.from_template("Argh me mateys") human_template = "{text}" human_message_prompt = HumanMessagePromptTemplate.from_template(human_template) chat_prompt = ChatPromptTemplate.from_messages( # 交替的消息作为样本 [system_message_prompt, example_human, example_ai, human_message_prompt] ) |
对于OpenAI来说,可以使用系统消息,并用可选的name属性来“标记”样本:
1 2 3 4 5 6 7 8 9 10 |
template = "You are a helpful assistant that translates english to pirate." system_message_prompt = SystemMessagePromptTemplate.from_template(template) example_human = SystemMessagePromptTemplate.from_template( "Hi", additional_kwargs={"name": "example_user"} # 样本的人类消息 ) example_ai = SystemMessagePromptTemplate.from_template( "Argh me mateys", additional_kwargs={"name": "example_assistant"} # 样本的AI消息 ) human_template = "{text}" human_message_prompt = HumanMessagePromptTemplate.from_template(human_template) |
默认情况下,PromptTemplate的字符串,被看作是Python的f-string,可以通过template_format指定其它格式,目前仅仅支持jinja2模板:
1 2 |
jinja2_template = "Tell me a {{ adjective }} joke about {{ content }}" prompt_template = PromptTemplate.from_template(template=jinja2_template, template_format="jinja2") |
调用提示词模板的format*方法,可以得到字符串、消息列表或者ChatPromptValue:
1 2 3 4 5 6 7 8 9 |
# 输出为str output = chat_prompt.format(input_language="English", output_language="French", text="I love programming.") output = chat_prompt.format_prompt(input_language="English", output_language="French", text="I love programming.").to_string() # 输出为ChatPromptValue chat_prompt.format_prompt(input_language="English", output_language="French", text="I love programming.") # 输出为List[BaseMessage] chat_prompt.format_prompt(input_language="English", output_language="French", text="I love programming.").to_messages() |
LangChain允许仅部分渲染提示词模板 —— 传入输入变量的子集,生成一个新模板,新模板仅仅使用那些尚未渲染的变量。
1 2 3 4 5 6 7 8 9 |
from langchain.prompts import PromptTemplate prompt = PromptTemplate(template="{foo}{bar}", input_variables=["foo", "bar"]) partial_prompt = prompt.partial(foo="foo"); print(partial_prompt.format(bar="baz")) prompt = PromptTemplate(template="{foo}{bar}", input_variables=["bar"], partial_variables={"foo": "foo"}) print(prompt.format(bar="baz")) |
除了将直接量赋予输入变量,还可以使用函数,这种方式可以让每次进行部分渲染,得到不同结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from datetime import datetime def _get_datetime(): now = datetime.now() return now.strftime("%m/%d/%Y, %H:%M:%S") prompt = PromptTemplate( template="Tell me a {adjective} joke about the day {date}", input_variables=["adjective", "date"] ); partial_prompt = prompt.partial(date=_get_datetime) prompt = PromptTemplate( template="Tell me a {adjective} joke about the day {date}", input_variables=["adjective"], partial_variables={"date": _get_datetime} ); |
LangChain支持从JSON或者YAML文件中加载提示词。你可以将模板、样本放在一个文件中,也可以放在多个文件中并相互引用。
考虑下面这个YAML形式的提示词模板:
1 2 3 4 5 |
_type: prompt input_variables: ["adjective", "content"] template: Tell me a {adjective} joke about {content}. |
我们可以这样加载它:
1 2 3 |
from langchain.prompts import load_prompt prompt = load_prompt("simple_prompt.yaml") print(prompt.format(adjective="funny", content="chickens")) |
注意,不管是YAML还是JSON,都用上述方式加载,没有区别。
考虑下面这个提示词模板,它引用了在外部文件中定义的模板文本:
1 2 3 4 5 |
{ "_type": "prompt", "input_variables": ["adjective", "content"], "template_path": "simple_template.txt" } |
1 |
Tell me a {adjective} joke about {content}. |
LangChain会自动到相同目录下寻找txt文件。加载方式和上文一致。
除了PromptTemplate,我们也可用相同的方式加载FewShotPromptTemplate。下面这个例子,将少量样本放在独立文件中:
1 2 3 4 |
[ {"input": "happy", "output": "sad"}, {"input": "tall", "output": "short"} ] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# 这个标注模板类型 _type: few_shot input_variables: ["adjective"] # 提示词前缀 prefix: Write antonyms for the following words. # 样本模板 example_prompt: _type: prompt input_variables: ["input", "output"] template: "Input: {input}\nOutput: {output}" # 指定样本文件 examples: examples.json # 提示词后缀 suffix: "Input: {adjective}\nOutput:" |
加载模板:
1 2 |
prompt = load_prompt("few_shot_prompt.yaml") print(prompt.format(adjective="funny")) |
渲染得到的结果如下:
Write antonyms for the following words.
Input: happy
Output: sad
Input: tall
Output: short
Input: funny
Output:
当有大量样本时, 可能需要选择一些最匹配的放入到提示中,这是需要用到样本选择器。样本选择器的基类如下:
1 2 3 4 5 6 |
class BaseExampleSelector(ABC): """Interface for selecting examples to include in prompts.""" @abstractmethod def select_examples(self, input_variables: Dict[str, str]) -> List[dict]: """Select which examples to use based on the inputs.""" |
唯一需要实现的方法,入参是输入变量,结果是选中样本的列表。
LengthBasedExampleSelector基于长度即字符数量进行选择,当你担心提示词超过上下文窗口大小(Token限制)时有用。如果用户输入太长,该选择器会选取较小的样本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from langchain.prompts import PromptTemplate from langchain.prompts import FewShotPromptTemplate from langchain.prompts.example_selector import LengthBasedExampleSelector examples = [ {"input": "happy", "output": "sad"}, {"input": "tall", "output": "short"}, {"input": "energetic", "output": "lethargic"}, {"input": "sunny", "output": "gloomy"}, {"input": "windy", "output": "calm"}, example_prompt = PromptTemplate( input_variables=["input", "output"], template="Input: {input}\nOutput: {output}", ) example_selector = LengthBasedExampleSelector( # 可选择样本集 examples=examples, # 提示模板 example_prompt=example_prompt, # 格式化后样本的长度,长度使用get_text_length 函数计算 max_length=25, get_text_length: Callable[[str], int] = lambda x: len(re.split("\n| ", x)) ) dynamic_prompt = FewShotPromptTemplate( example_selector=example_selector, example_prompt=example_prompt, prefix="Give the antonym of every input", suffix="Input: {adjective}\nOutput:", input_variables=["adjective"], ) |
所谓最大边际相关性 maximal marginal relevance (MMR)是一种信息检索和文本摘要的技术,用于在保持查询相关性的同时,最大限度地减少结果集中的冗余信息。MMR的目的是在结果集中提供尽可能多的新信息,以增加用户获取有价值信息的机会。
MMR算法通过在查询相关性和结果间的新颖性之间进行权衡来实现这一目标。对于每个候选结果,MMR会计算与查询的相关性以及与已选择结果的相似性。然后,根据一个权衡参数(通常表示为λ),MMR会选择具有最大边际相关性的结果。这种方法有助于在结果集中提供多样性,避免重复内容,同时确保结果与用户的查询相关。
进行相似性判断时,MMR使用最大余弦相似性(greatest cosine similarity)。它计算样本、输入的Embeddings在高维空间的向量的夹角,这个夹角越小(也就是余弦越大)意味着相似度越高。
与此同时,MMR尽量保证这些样本之间的差异性。
1 2 3 4 5 6 7 8 9 |
example_selector = MaxMarginalRelevanceExampleSelector.from_examples( examples, # 产生Embeddings OpenAIEmbeddings(), # 选择一个向量存储类 FAISS, # 选择的样本数量 k=2, ) |
n-gram overlap是另外一种衡量文本相似性的方法,用于比较两个文本序列中的n-gram序列。n-gram是指在给定文本中连续出现的n个元素(通常是字符或单词)。n-gram重叠计算的是两个文本序列中共享的n-gram数量,以评估它们之间的相似性。假设我们比较以下两个句子的单词级别的2-gram(bi-gram):
- 我喜欢吃苹果
- 我喜欢吃香蕉
句子1的bi-grams:(我喜欢),(喜欢吃),(吃苹果) 句子2的bi-grams:(我喜欢),(喜欢吃),(吃香蕉)。这两个句子共享了2个bi-grams:(我喜欢)和(喜欢吃)。因此,它们的bi-gram重叠数量为2。
n-gram重叠通常用于自然语言处理任务,如文本分类、文本摘要、机器翻译评估和文档聚类等。通过计算n-gram重叠,可以评估文本之间的结构和内容相似性。n-gram重叠可能受到n的选择和文本长度的影响。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
example_selector = NGramOverlapExampleSelector( examples=examples, example_prompt=example_prompt, # This is the threshold, at which selector stops. # It is set to -1.0 by default. threshold=-1.0, # For negative threshold: # Selector sorts examples by ngram overlap score, and excludes none. # For threshold greater than 1.0: # Selector excludes all examples, and returns an empty list. # For threshold equal to 0.0: # Selector sorts examples by ngram overlap score, # and excludes those with no ngram overlap with input. ) |
这种方式仅仅用最大余弦相似性(greatest cosine similarity)来计算输入和样本的相似度,不考虑样本之间的差异性。
1 2 3 4 5 6 7 8 |
example_selector = SemanticSimilarityExampleSelector.from_examples( examples, OpenAIEmbeddings(), # 向量存储实现 Chroma, # 选择的样本数量 k=1 ) |
LangChain提供针对两类模型的接口:
- 大语言模型(LLMs):如上文所述,这类接口以文本为输入,文本为输出
- 对话模型:其是基于语言模型的封装,以消息列表为输入,消息为输出
LangChains不提供任何模型,它知识提供了操控各种语言模型的标准接口,支持的LLM提供者包括:OpenAI、Hugging Face、Cohere等。
LangChain基于asyncio库实现了异步AI的支持:
1 2 3 4 5 6 7 |
async def async_generate(llm): resp = await llm.agenerate(["Hello, how are you?"]) print(resp.generations[0][0].text) llm = OpenAI(temperature=0.9) tasks = [async_generate(llm) for _ in range(10)] await asyncio.gather(*tasks) |
要封装一个LangChain不支持的语言模型,你仅仅需要继承抽象类LLM即可,需要实现的方法包括:
- 属性 _llm_type ,给这个模型一个类型名
- _call 调用模型
- _acall 异步调用模型
- 属性 _identifying_params,返回一个字典,用于打印模型实例的关键参数
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
from typing import Optional, List, Mapping, Any from langchain.callbacks.manager import CallbackManagerForLLMRun, AsyncCallbackManagerForLLMRun from langchain.llms.base import LLM class CustomLLM(LLM): n: int @property def _llm_type(self) -> str: return "custom" def _call( self, prompt: str, stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, ) -> str: if stop is not None: raise ValueError("stop kwargs are not permitted.") return prompt[: self.n] async def _acall(self, prompt: str, stop: Optional[List[str]] = None, run_manager: Optional[AsyncCallbackManagerForLLMRun] = None, **kwargs: Any) -> str: return self._call(prompt, stop, run_manager) @property def _identifying_params(self) -> Mapping[str, Any]: """Get the identifying parameters.""" return {"n": self.n} if __name__ == "__main__": # 构造函数已经由基类定义 llm = CustomLLM(n=10) print(llm("Hello, world!")) |
这个模型用于测试,它根据静态的仿冒来响应请求:
1 2 |
responses = ["Action: Python REPL\nAction Input: print(2 + 2)", "Final Answer: 4"] llm = FakeListLLM(responses=responses) |
这个模型也用于测试或调试目的,它让人类去扮演LLM,从而可以测试Agent的行为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from langchain.llms.human import HumanInputLLM from langchain.agents import load_tools from langchain.agents import initialize_agent from langchain.agents import AgentType if __name__ == "__main__": tools = load_tools(["wikipedia"]) llm = HumanInputLLM( # 这里用来格式化提示词 # 使用Agent和工具时,喂给语言模型的提示词会自动生成 # 你要模仿语言模型的行为(理解自然语言的能力、遵从提示词的要求) prompt_func=lambda prompt: print( f"\n===PROMPT START====\n{prompt}\n=====PROMPT END======" ) ) agent = initialize_agent( tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True ) agent.run("Who is Joe Biden?") |
运行后,显示提示词:
===PROMPT START====
真正的语言模型会理解这里的提示词,测试时人类要模仿语言模型的行为
Answer the following questions as best you can. You have access to the following tools:
这里告诉语言模型,工具及其用途
Wikipedia: A wrapper around Wikipedia. Useful for when you need to answer general questions about people, places, companies, facts, historical events, or other subjects. Input should be a search query.
这里告诉语言模型如何做,因为它的语言理解能力,应答中会给出足够的信息,让代理知道用什么输入调用什么工具
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Wikipedia]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: Who is Joe Biden?
Thought:
=====PROMPT END======
现在人类可以输入,这个输入就是作为语言模型的响应:
I need to use a tool 注意这个是 Thought
Action: Wikipedia
Action Input: Joe Biden
Observation: Page: Joe Biden
由于上述响应,代理就知道需要用关键字Joe Biden来搜索维基百科了。工具会生成页面的摘要文本,代理则会基于这些文本,继续和语言模型交互。
人类扮演的模型可以继续进行上述Thought/Action/Action Input/Observation交互,并且在最终给出输入:
Final Answer: Joe Biden is an old man
当看到 Final Answer后,代理会终止处理链。
LangChain提供了针对LLM的缓存层,用于:
- 减少对LLM API的调用,节省费用
- 针对相同问题,提升响应速度
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import langchain from langchain import OpenAI if __name__ == "__main__": llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2) # 内存缓存 from langchain.cache import InMemoryCache langchain.llm_cache = InMemoryCache() print(llm.predict("Tell me a joke")) # 慢 print(llm.predict("Tell me a joke")) # 快 # SQLite缓存 from langchain.cache import SQLiteCache langchain.llm_cache = SQLiteCache(database_path=".langchain.db") |
在处理链中,可以选择为特定节点关闭缓存:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
import requests from langchain import OpenAI from langchain.text_splitter import CharacterTextSplitter from langchain.docstore.document import Document from langchain.chains.summarize import load_summarize_chain if __name__ == "__main__": url = 'https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt' state_of_the_union = requests.get(url).text llm = OpenAI(model_name="text-davinci-002") # disable caching no_cache_llm = OpenAI(model_name="text-davinci-002", cache=False) text_splitter = CharacterTextSplitter() texts = text_splitter.split_text(state_of_the_union) # select the first 3 texts and build a document for each docs = [Document(page_content=t) for t in texts[:3]] # produce a summarization of the documents with map_reduce chain, the map step # will be cached ( using non-cahced llm). The reduce( combine ) step won't be frozen # because we are using a llm with cache disabled chain = load_summarize_chain(llm, chain_type="map_reduce", reduce_llm=no_cache_llm) # first run will be slow, but the map step will get cached print(chain.run(docs)) # here we run it again, it will be substantially faster while producing a different answer # cause the combine step is not cached print(chain.run(docs)) |
LangChain支持从文件实例化LLM所需的配置:
1 2 3 4 5 6 7 8 |
from langchain.llms import OpenAI from langchain.llms.loading import load_llm # load from JSON, notice that YAML is also acceptable llm = load_llm("llm.json") # persist the llm config to file llm.save("llm.json") |
某些LLM模型支持提供流式响应,也就是不需要等待所有响应生成后再返回给客户端,我们使用的ChatGPT之类的聊天应用,都会使用这种响应模式。
LangChain支持OpenAI、ChatOpenAI、ChatAnthropic的流式响应。你需要编写一个 CallbackHandler回调,实现 on_llm_new_token方法。StreamingStdOutCallbackHandler是LangChain自带的一个实现:
1 2 3 4 5 6 7 |
class StreamingStdOutCallbackHandler(BaseCallbackHandler): """Callback handler for streaming. Only works with LLMs that support streaming.""" def on_llm_new_token(self, token: str, **kwargs: Any) -> None: """Run on new LLM token. Only available when streaming is enabled.""" sys.stdout.write(token) sys.stdout.flush() |
它简单的将新产生的Token写入标准输出:
1 2 3 4 5 6 |
if __name__ == "__main__": from langchain.llms import OpenAI from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler llm = OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0) resp = llm("Write me a song about sparkling water.") |
使用一个上下文管理器来包围你的代码,可以统计这些代码的Token用量,从而估算成本。目前仅仅支持OpenAI。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from langchain.llms import OpenAI from langchain.callbacks import get_openai_callback llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2) with get_openai_callback() as cb: result = llm("Tell me a joke") print(cb) with get_openai_callback() as cb: response = agent.run( "Who is Olivia Wilde's boyfriend? What is his current age raised to the 0.23 power?" ) print(f"Total Tokens: {cb.total_tokens}") print(f"Prompt Tokens: {cb.prompt_tokens}") print(f"Completion Tokens: {cb.completion_tokens}") print(f"Total Cost (USD): ${cb.total_cost}") |
详细的支持列表可以看官方文档。
Hugging Face Hub是模型托管平台,它提供开源的120K模型、20K数据集、50K演示应用。LangChain支持直接使用Hugging Face上的模型。首先需要安装包:
1 |
pip install huggingface_hub |
设置你的API Token:
1 |
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN |
使用模型:
1 2 3 |
repo_id = "google/flan-t5-xxl llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": 0.5, "max_length": 64}) llm_chain = LLMChain(prompt=prompt, llm=llm) |
Hugging Face上的模型,也可以被下载到你本地来运行。你需要先安装transformers包:
1 |
pip install transformers |
然后使用HuggingFacePipeline类来加载模型:
1 2 3 4 5 6 7 |
from langchain import HuggingFacePipeline llm = HuggingFacePipeline.from_model_id( model_id="bigscience/bloom-1b7", task="text-generation", model_kwargs={"temperature": 0, "max_length": 64}, ) |
注意,取决于你使用的模型的规模,下载可能消耗很长时间。当然,你本地机器也得有模型所需的硬件资源和软件。
Text Generation Inference是一个开源服务软件,利用它,你可以在本地运行HuggingFace模型,并且暴露HTTP端点。LangChain可以连接到这种端点:
1 2 3 4 5 6 7 8 9 10 |
llm = HuggingFaceTextGenInference( inference_server_url="http://localhost:8010/", max_new_tokens=512, top_k=10, top_p=0.95, typical_p=0.95, temperature=0.01, repetition_penalty=1.03, ) llm("What did foo say about bar?") |
用法和LLM模型一样:
1 2 3 4 5 6 7 |
import langchain from langchain.chat_models import ChatOpenAI llm = ChatOpenAI() from langchain.cache import InMemoryCache langchain.llm_cache = InMemoryCache() |
用法和LLM版本的HumanInput模型类似:
1 2 3 4 5 6 7 8 9 10 |
from langchain.chat_models.human import HumanInputChatModel from langchain.agents import load_tools from langchain.agents import initialize_agent from langchain.agents import AgentType tools = load_tools(["wikipedia"]) llm = HumanInputChatModel() agent = initialize_agent( tools, llm, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=True ) |
1 2 3 4 5 6 7 8 9 |
from langchain.chat_models import ChatOpenAI from langchain.schema import ( HumanMessage, ) from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0) resp = chat([HumanMessage(content="Write me a song about sparkling water.")]) |
默认情况下,语言模型输出自然语言文本。有时候你希望获得结构化的数据,便于在后续业务流程中继续处理。
LangChain提供了一系列的输出解析器(OutputParser),它们实现两类关键方法:
- 获取格式指令:返回一个字符串,包含给语言模型的指令,用于指示语言模型去格式化自己的输出
- 解析:接受一段文本(来自语言模型的输出),解析为目标格式
以及一个可选方法:
- 使用提示词来解析:同时提供一个字符串(相当于模型响应)+ 提示词(相当于生成前述响应的提示词)。这个方法在输出解析器需要通过某种方式重试、修复输出的时候调用
这个输出解析器,允许用户提供任意的JSON Schema,同时要求LLM的输出遵从该Schema。注意:能否输出格式正确的JSON,取决于模型,在OpenAI模型家族中DaVinci能可靠的生成JSON而Curie则不靠谱。
你需要定义一个Pydantic模型,来声明Schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pydantic import BaseModel, Field, validator # Define your desired data structure. class Joke(BaseModel): setup: str = Field(description="question to set up a joke") punchline: str = Field(description="answer to resolve the joke") # You can add custom validation logic easily with Pydantic. @validator("setup") def question_ends_with_question_mark(cls, field): if field[-1] != "?": raise ValueError("Badly formed question!") return field |
使用输出解析器,来生成提示词的格式指令部分:
1 2 |
parser = PydanticOutputParser(pydantic_object=Joke) instructions = parser.get_format_instructions() |
格式指令用自然语言提示LLM,该如何生成符合Schema的响应:
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline": {"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup", "punchline"]}
```
使用上述格式指令,构建一个提示词模板:
1 2 3 4 5 6 7 |
prompt = PromptTemplate( template="Answer the user query.\n{format_instructions}\n{query}\n", input_variables=["query"], partial_variables={"format_instructions": instructions}, ) _input = prompt.format_prompt(query="Tell me a joke.") |
完整的提示词:
Answer the user query.
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline": {"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup", "punchline"]}
```
Tell me a joke.
生成响应:
1 2 3 4 |
output = model(_input.to_string()) joke: Joke = parser.parse(output) print(joke.punchline) |
output即原始响应,为JSON形式:
{"setup": "Why did the chicken cross the road?", "punchline": "To get to the other side!"}'
parse方法将其反序列化为Joke对象。
这个解析器包装另外一个解析器,当后者解析失败,自动修复解析器会调用另外一个LLM来修复失败。该解析器会将格式错误的原始应答 + 格式指令一同发送给新的LLM:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
class Actor(BaseModel): name: str = Field(description="name of an actor") film_names: List[str] = Field(description="list of names of films they starred in") actor_query = "Generate the filmography for a random actor." parser = PydanticOutputParser(pydantic_object=Actor) # This is a mock of misformatted output, JSON properties must be double quoted. misformatted = "{'name': 'Tom Hanks', 'film_names': ['Forrest Gump']}" # parsing will fail with a JSONDecodeError parser.parse(misformatted) from langchain.output_parsers import OutputFixingParser # here we wrap the original parser with OutputFixingParser, which will attempt to fix the output new_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI()) new_parser.parse(misformatted) |
某些情况下,模型给出的输出是不完整,而不是格式错误的。这种情况下可以尝试RetryOutputParser,它会将提示词 + 原始输出一起传递给LLM以尝试获得更好的应答:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
class Action(BaseModel): action: str = Field(description="action to take") action_input: str = Field(description="input to the action") parser = PydanticOutputParser(pydantic_object=Action) template = """Based on the user question, provide an Action and Action Input for what step should be taken. {format_instructions} Question: {query} Response:""" prompt = PromptTemplate( template="Answer the user query.\n{format_instructions}\n{query}\n", input_variables=["query"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) # this is an incomplete output bad_response = '{"action": "search"}' from langchain.output_parsers import RetryWithErrorOutputParser retry_parser = RetryWithErrorOutputParser.from_llm( parser=parser, llm=OpenAI(temperature=0) ) retry_parser.parse_with_prompt(bad_response, prompt_value) # Action(action='search', action_input='who is leo di caprios gf?' |
CommaSeparatedListOutputParser用于生成一个列表,条目以逗号分隔:
1 2 3 4 5 6 7 8 9 10 |
output_parser = CommaSeparatedListOutputParser() format_instructions = output_parser.get_format_instructions() prompt = PromptTemplate( template="List five {subject}.\n{format_instructions}", input_variables=["subject"], partial_variables={"format_instructions": format_instructions} ) _input = prompt.format(subject="ice cream flavors") |
DatetimeOutputParser用于将输出解析为日期/时间:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
output_parser = DatetimeOutputParser() template = """Answer the users question: {question} {format_instructions}""" prompt = PromptTemplate.from_template( template, partial_variables={"format_instructions": output_parser.get_format_instructions()}, ) output = chain.run("around when was bitcoin founded?") output_parser.parse(output) |
LLM一次完整的训练需要高昂的成本,这导致模型不能学到实时的知识,另外,一些私有领域的知识,也不能出现在训练集中。LLM具有In-Context学习的能力,允许用户在输入中注入知识,然后由LLM组织这些知识,生成符合用户预期的应答。
很多基于LLM的应用,都需要利用这种In-Context学习的能力,典型的是基于知识库的问答机器人。将所有知识,都以提示词的形式注入给LLM是不现实的,目前的模型都有Token数的限制。并且Token越多,成本越高,模型性能也越差。
LangChain的数据连接(Data Connection)给出了解决方案,整体上来说,需要在本地离线的对知识库进行处理、索引。在使用LLM时,在本地检索出和用户问题相关的知识片段,然后将这些知识片段作为提示词的一部分发送。
和数据连接有关的模块包括:
- 文档加载器(Document loaders):从各种源加载文档
- 文档转换器(Document transformers):分割文档、剔除重复文档内容
- 文本嵌入模型(Text embedding models):将非结构化的文本转换为Embeddings,即高维空间的向量(形式上为浮点数的列表)
- 向量存储库(Vector stores):用于对上述Embeddings进行存储、检索
- 检索器(Retrievers):用于检索数据,向量存储库可以作为检索器的一种后端
文档加载器负责将特定来源的数据加载为Document对象。Document是关联了元数据的文本。
1 2 3 4 5 6 7 8 9 10 11 |
from langchain.document_loaders.csv_loader import CSVLoader loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv') loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', csv_args={ 'delimiter': ',', 'quotechar': '"', 'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins'] }) data = loader.load() |
1 2 3 4 |
from langchain.document_loaders import UnstructuredHTMLLoader loader = UnstructuredHTMLLoader("example_data/fake-content.html") data = loader.load() |
可以使用BeautifulSoup4加载器,自动解析HTML中的文本到page_content,将页面的标题解析到metadata.title:
1 2 3 4 5 |
from langchain.document_loaders import BSHTMLLoader loader = BSHTMLLoader("example_data/fake-content.html") data = loader.load() data |
1 2 3 4 5 6 7 8 |
from langchain.document_loaders import JSONLoader loader = JSONLoader( file_path='./example_data/facebook_chat.json', # 使用jq语法指定从JSON的什么字段抽取Document.page_content jq_schema='.messages[].content') data = loader.load() |
如果源文档是JSON Lines格式(每行均为一个JSON),则需要指定 json_lines 为True:
1 2 3 4 5 6 |
loader = JSONLoader( file_path='./example_data/facebook_chat_messages.jsonl', jq_schema='.content', json_lines=True) data = loader.load() |
除了page_content字段,还可以抽取任意元数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def metadata_func(record: dict, metadata: dict) -> dict: metadata["sender_name"] = record.get("sender_name") metadata["timestamp_ms"] = record.get("timestamp_ms") return metadata loader = JSONLoader( # { # 'messages': [ # {'content': 'Bye!', 'sender_name': 'User 2', 'timestamp_ms': 1675597571851} # ] # } file_path='./example_data/facebook_chat.json', jq_schema='.messages[]', content_key="content", metadata_func=metadata_func ) data = loader.load() |
LangChains支持从大量在线服务读取数据,具体参考官方文档。
加载完文档后,通常需要对其进行转换以适合应用程序需要。其中最常见的例子是,将文档切割为较小的块(chunk)以适合模型的上下文大小窗口。
将长文档分割为小块时,需要处理很多潜在的复杂性。理想情况下,语义相关的片段,应该被分割在同一个块中。大体上来说,文本分割的工作方式如下:
- 将文本分割为语义相关的小块,通常以句子为边界
- 将这些小块合并为更大的块,直到到达特定的大小
- 基于上面的块,外加前后的一些重叠,构建最终的块。这个操作的目的是保留块之间的上下文
默认推荐的文本分割器是RecursiveCharacterTextSplitter。该分割器接受一个分隔符列表(默认 ["\n\n", "\n", " ", ""]),在确保块不过大的情况下,优先使用列表中前面的分隔符来分割块。该分割器提供了一些辅助的配置项:
- length_function:控制如何计算chunk的长度,默认是计算字符的个数,实际更常用的是计算Token的个数
- chunk_size:允许的chunk的最大长度
- chunk_overlap:两个相邻的chunk重叠部分的最大长度
- add_start_index:是否包含chunk所在原始文档中的起始位置
示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
with open('../../state_of_the_union.txt') as f: state_of_the_union = f.read() from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show. chunk_size = 100, chunk_overlap = 20, length_function = len, add_start_index = True, ) texts = text_splitter.create_documents([state_of_the_union]) print(texts[0]) |
1 2 3 4 5 6 7 8 |
from langchain.text_splitter import ( RecursiveCharacterTextSplitter, Language, ) # 获取特定语言的分隔符列表 python_splitter = RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON) python_docs = python_splitter.create_documents([PYTHON_CODE]) |
嵌入处理过程需要同时考虑整体上下文、文本中句子和短语的相互关系,以保证生成质量更高的向量表示,精确的捕获文档的主题和含义。这要求具有相同上下文的文本,应该被分割在相同的块中。
类似于Markdown这样的格式,本身就通过标题级别划分了上下文,因此我们可以考虑基于这些标题级别进行分割,LangChains提供了MarkdownHeaderTextSplitter来处理Markdown:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
markdown_document = "# Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly" headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) md_header_splits = markdown_splitter.split_text(markdown_document) # [Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}), # Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}), # Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})] |
在上述基于标题级别分割的基础上,可以进一步分割:
1 2 3 4 5 6 |
from langchain.text_splitter import RecursiveCharacterTextSplitter chunk_size = 250 chunk_overlap = 30 text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) splits = text_splitter.split_documents(md_header_splits) |
LLM通常有Token数限制,我们分割的chunk不应该超过这个限制。 分词器(tokenizer)种类很多,计算chunk中有多少token时,应当使用和LLM一致的分词器。
OpenAI提供了基于BPE的分词器tiktoken,下面是其用法示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter.from_tiktoken_encoder( chunk_size=100, chunk_overlap=0 ) texts = text_splitter.split_text(state_of_the_union) # 或者 from langchain.text_splitter import TokenTextSplitter text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0) texts = text_splitter.split_text(state_of_the_union) |
spaCy是一个用于自然语言处理的库,LangChain也支持基于spaCy的分词器:
1 2 3 4 5 |
# pip install spacy from langchain.text_splitter import SpacyTextSplitter text_splitter = SpacyTextSplitter(chunk_size=1000) |
Hugging Face也提供了很多分词器,示例:
1 2 3 4 5 6 |
from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") text_splitter = CharacterTextSplitter.from_huggingface_tokenizer( tokenizer, chunk_size=100, chunk_overlap=0 ) |
LangChain提供了和各种Embedding模型进行交互的标准化接口。上文我们提到过,Embedding是将文本转换为向量表示的过程,这样就可以将其存储在向量空间,并进行两个文本的相似度检索。
标准化接口提供了两个方法,其中之一用于嵌入文档:
1 2 3 4 5 6 7 8 9 10 11 12 |
from langchain.embeddings import OpenAIEmbeddings embeddings_model = OpenAIEmbeddings(openai_api_key="...") embeddings = embeddings_model.embed_documents( [ "Hi there!", "Oh, hello!", "What's your name?", "My friends call me World", "Hello World!" ] ) |
另外一个用于嵌入查询:
1 |
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?") |
除了针对OpenAI的OpenAIEmbeddings类,还有针对各种其它模型的封装,具体参考官方文档。
有了针对文档、查询的Embeddings,就可以进行相似度计算了 —— 找出和查询相似度最高的文档。向量存储提供了这种查询能力,同时它也负责存储文档的Embeddings。
向量存储的实现有很多,例如FAISS(Facebook AI Similarity Search)库,下面是初始化基于FAISS的向量存储的方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# pip install faiss-cpu from langchain.document_loaders import TextLoader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import FAISS raw_documents = TextLoader('...').load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) documents = text_splitter.split_documents(raw_documents) embeddings = OpenAIEmbeddings() db = FAISS.from_documents(documents, embeddings) |
下面是执行相似度搜索的方法:
1 2 3 4 5 6 7 8 |
# 搜索文档 query = "What did the president say about Ketanji Brown Jackson" docs = db.similarity_search(query) print(docs[0].page_content) # 基于Embedding搜索 embedding_vector = embeddings.embed_query(query) docs = db.similarity_search_by_vector(embedding_vector) |
除了FAISS,LangChain提供了多种向量存储的封装,具体参考官方文档。
检索器可以根据给定的非结构化查询,返回匹配的文档。检索器可以将向量存储作为后端,但是并非必须。检索器的基类很简单:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
from abc import ABC, abstractmethod from typing import Any, List from langchain.schema import Document from langchain.callbacks.manager import Callbacks class BaseRetriever(ABC): ... def get_relevant_documents( self, query: str, *, callbacks: Callbacks = None, **kwargs: Any ) -> List[Document]: """Retrieve documents relevant to a query. Args: query: string to find relevant documents for callbacks: Callback manager or list of callbacks Returns: List of relevant documents """ ... async def aget_relevant_documents( self, query: str, *, callbacks: Callbacks = None, **kwargs: Any ) -> List[Document]: """Asynchronously get documents relevant to a query. Args: query: string to find relevant documents for callbacks: Callback manager or list of callbacks Returns: List of relevant documents """ ... |
也就是说检索器可以返回和查询相关的文档列表,相关性的定义取决于具体的检索器。
主要类型的检索器,还是基于向量存储的。LangChain默认情况下使用Chroma作为向量存储实现。下面的代码从文档加载器创建索引:
1 2 3 4 5 6 7 8 |
from langchain.chains import RetrievalQA from langchain.llms import OpenAI from langchain.document_loaders import TextLoader loader = TextLoader('...', encoding='utf8') from langchain.indexes import VectorstoreIndexCreator index = VectorstoreIndexCreator().from_loaders([loader]) |
可以基于此索引来进行检索:
1 2 |
query = "What did the president say about Ketanji Brown Jackson" index.query(query) |
下面的代码展示了如何从索引得到检索器接口:
1 |
index.vectorstore.as_retriever() |
VectorstoreIndexCreator内部,做了以下工作:
- 将文档分割为块
12345documents = loader.load()from langchain.text_splitter import CharacterTextSplittertext_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)texts = text_splitter.split_documents(documents) - 为每个文档创建Embeddings、存储文档和Embeddings到向量存储
12345from langchain.embeddings import OpenAIEmbeddingsembeddings = OpenAIEmbeddings()from langchain.vectorstores import Chromadb = Chroma.from_documents(texts, embeddings) - 从向量存储得到检索器接口,并构造处理链:
12retriever = db.as_retriever()qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever) - 基于索引的查询,其实就是调用处理链:
12query = "What did the president say about Ketanji Brown Jackson"qa.run(query)
你可以对VectorstoreIndexCreator进行定制化:
1 2 3 4 5 |
index_creator = VectorstoreIndexCreator( vectorstore_cls=Chroma, embedding=OpenAIEmbeddings(), text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) ) |
如果Embedding没有很好的捕获数据的语义,那么查询文本中的细微变化,可能导致相似度检索结果的很大不同。提示词工程和微调有时候用来解决这种检索不准确的问题,但是处理起来可能比较繁琐。
MultiQueryRetriever可以自动化这个微调过程,它会调用LLM,从不同视角,从用户输入生成多个查询。然后,基于这些查询,分别获取相关文档,取并集作为最终结果。通过这种处理,某些情况下能够克服向量存储的那种基于距离的相似度判断的缺点,获取更加丰富的检索结果。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# Build a sample vectorDB from langchain.vectorstores import Chroma from langchain.document_loaders import WebBaseLoader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter # Load blog post loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") data = loader.load() # Split text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0) splits = text_splitter.split_documents(data) # VectorDB embedding = OpenAIEmbeddings() vectordb = Chroma.from_documents(documents=splits,embedding=embedding) from langchain.chat_models import ChatOpenAI from langchain.retrievers.multi_query import MultiQueryRetriever question="What are the approaches to Task Decomposition?" llm = ChatOpenAI(temperature=0) retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(),llm=llm) # Set logging for the queries import logging logging.basicConfig() logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO) unique_docs = retriever_from_llm.get_relevant_documents(query=question) len(unique_docs) |
你不知道用户会提供什么样的查询,真正和查询意图最相关的信息,可能会淹没在大量无关的文本中。将无关的文本返回给程序,进而对LLM发起调用,可能导致更高的成本或者生成较差的应答。
上下文压缩用于解决这个问题,它的思路很简单:不是把检索得到的文档直接返回,而是基于查询给出的上下文信息,进行压缩(去掉无关信息)。
下面的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def pretty_print_docs(docs): print(f"\n{'-' * 100}\n".join([f"Document {i + 1}:\n\n" + d.page_content for i, d in enumerate(docs)])) if __name__ == "__main__": from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.document_loaders import UnstructuredURLLoader from langchain.vectorstores import FAISS url = 'https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt' documents = UnstructuredURLLoader(urls=[url]).load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever() docs = retriever.get_relevant_documents("What did the president say about Ketanji Brown Jackson") pretty_print_docs(docs) from langchain.llms import OpenAI from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor |
检索到的第一个结果是:
Document 1:
In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections.
We cannot let this happen.
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
现在进行上下文压缩:
1 2 3 4 5 6 7 8 9 10 |
from langchain.llms import OpenAI from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor llm = OpenAI(temperature=0) compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever) compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown") pretty_print_docs(compressed_docs) |
结果长度变短:
Document 1:
"Tonight, I'd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation's top legal minds, who will continue Justice Breyer's legacy of excellence."
上面代码中的LLMChainExtractor,仅仅负责检查检索器返回的原始结果集,决定哪些应该丢弃。它不会修改结果的内容。
针对每个结果,都调用一次LLM,成本较高且速度慢。EmbeddingsFilter提供一个更廉价和迅速的方案,它对文档和查询进行嵌入并仅仅返回和查询相似度足够高的那些文档:
1 2 3 4 5 6 7 8 |
from langchain.embeddings import OpenAIEmbeddings from langchain.retrievers.document_compressors import EmbeddingsFilter embeddings = OpenAIEmbeddings() embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76) compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever) compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown") |
利用DocumentCompressorPipeline类,可以构造一个流水线,编排多个压缩器以及BaseDocumentTransformers。后者不进行上下文压缩,仅仅负责简单的转换,例如可以用TextSplitters将文档分割为更小的片段,利用EmbeddingsRedundantFilter基于Embedding相似度剔除冗余的文档:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from langchain.document_transformers import EmbeddingsRedundantFilter from langchain.retrievers.document_compressors import DocumentCompressorPipeline from langchain.text_splitter import CharacterTextSplitter # 首先分割为更小的块 splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ") # 然后剔除重复文档(块) redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings) # 最后根据相似度进行过滤 relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76) pipeline_compressor = DocumentCompressorPipeline( transformers=[splitter, redundant_filter, relevant_filter] ) compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever) compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown") |
文档不但具有内容,还可以具有任意附加的元数据。元数据是一个字典,上文提到的文档加载器,可以在加载文档时抽取元数据。
所谓自查询的(Self-querying)检索器,能够对自身(通常是背后的向量存储)进行查询。具体来说,它能够从用户的查询输入中,抽取针对已存储的文档的元数据的过滤器,并应用这些过滤器。
打个比方,对于一个查询:What did bar say about foo。而文档具有元数据author,那么自查询检索器就能去匹配auther为bar的、内容为foo is a charming chap的文档。
代码示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
if __name__ == "__main__": # pip install lark chromadb from langchain.schema import Document from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma embeddings = OpenAIEmbeddings() docs = [ Document( page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"}, ), Document( page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...", metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}, ), Document( page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea", metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6}, ), Document( page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3}, ), Document( page_content="Toys come alive and have a blast doing so", metadata={"year": 1995, "genre": "animated"}, ), Document( page_content="Three men walk into the Zone, three men walk out of the Zone", metadata={ "year": 1979, "rating": 9.9, "director": "Andrei Tarkovsky", "genre": "science fiction", "rating": 9.9, }, ), ] vectorstore = Chroma.from_documents(docs, embeddings) # instantiate the self-query retriever from langchain.llms import OpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain.chains.query_constructor.base import AttributeInfo metadata_field_info = [ AttributeInfo( name="genre", description="The genre of the movie", type="string or list[string]", ), AttributeInfo( name="year", description="The year the movie was released", type="integer", ), AttributeInfo( name="director", description="The name of the movie director", type="string", ), AttributeInfo( name="rating", description="A 1-10 rating for the movie", type="float" ), ] document_content_description = "Brief summary of a movie" llm = OpenAI(temperature=0) retriever = SelfQueryRetriever.from_llm( llm, vectorstore, document_content_description, metadata_field_info, verbose=True ) # only specifies a relevant query docs = retriever.get_relevant_documents("What are some movies about dreams?") # only specifies a filter docs = retriever.get_relevant_documents("I want to watch a movie rated higher than 8.5") # A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea # specifies both a query and a filter docs = retriever.get_relevant_documents("Has Greta Gerwig directed any movies about women") for doc in docs: print(doc.page_content) |
自查询检索器会自动分析用户输入和元数据的关联关系,并生成过滤器:
query='dreams' filter=None limit=None
query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None
query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig') limit=None
自查询检索器仅仅能应对比较简单的查询。比如:I'm looking for Greta Gerwig directed any movies about women, or anyone directed movies about dreams,就无法生成恰当的查询和过滤器。
这种检索器,在相似度检索的基础上,让最近被检索器命中过的文档,获得优先级。算法如下:
semantic_similarity + (1.0 - decay_rate) ^ hours_passed
decay_rate为衰退率,控制时间因子的影响大小。hours_passed表示上一次该文档被检索器访问以来已经过去的时间。衰退率的范围为0-1,可以看到,过低或者过高的衰退率,都会快速、很大程度上削弱时间的影响,从而等价于简单的相似度检索。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import faiss from datetime import datetime, timedelta from langchain.docstore import InMemoryDocstore from langchain.embeddings import OpenAIEmbeddings from langchain.retrievers import TimeWeightedVectorStoreRetriever from langchain.schema import Document from langchain.vectorstores import FAISS # Define your embedding model embeddings_model = OpenAIEmbeddings() # Initialize the vectorstore as empty embedding_size = 1536 index = faiss.IndexFlatL2(embedding_size) vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {}) retriever = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore, decay_rate=.0000000000000000000000001, k=1) |
从向量存储对象,都可以得到一个检索器:
1 |
retriever = db.as_retriever() |
使用这些检索器时,有一些通用的请求参数。
默认情况下,使用基于相似度的检索,可以改为基于最大边际相关性(MMR)的检索,如前文提到过的一样,MMR有利于去除重复的内容:
1 |
retriever = db.as_retriever(search_type="mmr") |
检索的时候,可以指定相似度分数的阈值:
1 |
retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5}) |
上述例子中,只有相似度评分高于0.5的文档被返回。
此外,你可以限定仅仅返回相似度最高的K个文档:
1 |
retriever = db.as_retriever(search_kwargs={"k": 1}) |
对于简单的应用程序,仅仅使用一个语言模型(可能配合向量存储做信息检索)就足够了。更复杂的应用,可能需要联合使用多个语言模型或者调用任意的其它工具。
LangChain提供了Chain接口,用来编排对一系列工具的顺序化调用。这些工具可能包括LLM和其它Chain。这个接口非常简单:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
class Chain(BaseModel, ABC): """Base interface that all chains should implement.""" memory: BaseMemory callbacks: Callbacks def __call__( self, inputs: Any, return_only_outputs: bool = False, callbacks: Callbacks = None, ) -> Dict[str, Any]: ... |
最简单的处理链的例子,我们在快速参考一章中介绍过,它将提示词模板和LLM调用链在一起。
Leave a Reply