Langchain基于语义相似性的chunk split方法

背景优化思路在切割的时候考虑标点符合或者其他分隔符通过语义计算，判断split条件 Langchain语义切割实现利用百分比（percentage）阈值分割利用标准差（standard diverse）阈值分割利用四分位距离（Interquartile）切割利用距离梯度（Gradient）切割其他切割方法针对不同的数据按照token切割参考

背景

rag系统在接触完文档后的第一步，就是需要把文档切割成一个个chunk，通过embedding的方法，存入vector store中。

简单粗暴一点的方法就是固定一个字符长度，按照长度切割成固定长度的chunk，比如说下面用Langchain实现的例子：

from langchain.text_splitter import CharacterTextSplitter
text = "This is the text I would like to chunk up. It is the example text for this exercise"
text_splitter = CharacterTextSplitter(
   chunk_size = 35,
	 chunk_overlap=0,
	 separator='',
	 strip_whitespace=False)
text_splitter.create_documents([text])

Langchain代码

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

输出

但是这样做有几个问题：

没有考虑语义结构，比如说换行，问号，句号等。直接将一个完整的句子生硬切开，破坏了结构和语义，影响后续的检索和大模型context learning的效果

长度和overlap的设置过于magic，很难选择出一个合适的参数。

适应性不佳，表达不同内容通常而言就是有不同的文本长度，而不是固定死的，介绍一支笔和介绍电脑显然期望的表达的内容长度是有差异的。

针对上面提的问题，有几个优化的思路：

优化思路

在切割的时候考虑标点符合或者其他分隔符

比如说Langchain中RecursiveCharacterTextSplitter对象实现的方法：在chunk size范围内，按照标点符号再切割一次。

from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)
text_splitter.create_documents([text])

通过语义计算，判断split条件

在通过基础的标点切分成一个个短句之后，判断短句embedding之后的语义相似性，如果相似性较高，则将两个短句合并成一个句子存入vector store中。如果相似性较低，则切割开。

Langchain语义切割实现

利用cosine similarity计算embedding后的向量距离，相似度越高，距离越近。

利用百分比（percentage）阈值分割

如果相邻chunk的cos距离超过95%（默认值，可以设置），则切割

import numpy as np
# distances: 相邻chunk的距离
np.percentile(distances, self.breakpoint_threshold_amount)

利用标准差（standard diverse）阈值分割

如果相邻chunk的cos距离超过 （均值 + threshold * 标准差），则切割；

import numpy as np
np.mean(distances) + self.breakpoint_threshold_amount * np.std(distances)

方差的计算公式：，其中为样本均值；方差衡量的是数据的离散程度

利用四分位距离（Interquartile）切割

如果相邻chunk的cos距离超过（均值 + threshold * 四分位差距），则切割

import numpy as np

q1, q3 = np.percentile(distances, [25, 75])
iqr = q3 - q1

np.mean(distances) + self.breakpoint_threshold_amount * iqr

四分位距离表示，一组数据中，25%位置，和75%位置之间的距离；如果这个距离很小，则说明数据分布差异较小，如果这个距离比较大，则说明数据分布差异较大。

利用距离梯度（Gradient）切割

如果相邻chunk的cos距离超过（梯度一定百分比的阈值），则切割

import numpy as np

distance_gradient = np.gradient(distances, range(0, len(distances)))

np.percentile(distance_gradient, self.breakpoint_threshold_amount)

向量的梯度计算公式为：，h代表的是距离，实例代码如下：其中的计算过程是：

f = np.array([1, 2, 4, 7, 11, 16], dtype=float)
np.gradient(f)
array([1. , 1.5, 2.5, 3.5, 4.5, 5. ])

其他切割方法

针对不同的数据

切割json数据：https://python.langchain.com/docs/how_to/recursive_json_splitter/

切割代码：https://python.langchain.com/docs/how_to/code_splitter/

切割markdown：https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/

切割html：

https://python.langchain.com/docs/how_to/HTML_section_aware_splitter/

https://python.langchain.com/docs/how_to/HTML_header_metadata_splitter/

按照token切割

https://python.langchain.com/docs/how_to/split_by_token/

参考

https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

https://x.com/thesephist/status/1724159343237456248?s=46

https://python.langchain.com/docs/how_to/semantic-chunker/