<article style="font-size: 16px;">
<p>bert文本相似度计算</p>
<div>
<section>
<div>
<div>
<h2><a href="https://towardsdatascience.com/tagged/getting-started">入门</a><span style="font-weight: bold;">(</span><a href="https://towardsdatascience.com/tagged/getting-started">Getting Started</a><span style="font-weight: bold;">)</span></h2>
<p><strong>Introduction</strong></p>
<p><strong>介绍</strong></p>
<p>Document similarities is one of the most crucial problems of NLP. Finding similarity across documents is used in several domains such as recommending similar books and articles, identifying plagiarised documents, legal documents, etc.</p>
<p>文档相似性是NLP的最关键问题之一。 在多个领域使用跨文档查找相似性,例如推荐相似的书籍和文章,识别抄袭的文档,法律文档等。</p>
<p>We can call two documents similar if they are semantically similar and define the same concept or if they are duplicates.</p>
<p> 如果两个文档在语义上相似并且定义相同的概念,或者它们是重复的,则可以称两个文档相似。</p>
<p>To make machines figure out the similarity between documents we need to define a way to measure the similarity mathematically and it should be comparable so that machine can tell us which documents are most similar or which are least. We also need to represent text from documents in a quantifiable form (or a mathematical object, which is usually a vector form), so that we can perform similarity calculations on top of it.</p>
<p> 为了使机器能够计算出文档之间的相似性,我们需要定义一种数学上测量相似性的方法,并且该方法应该具有可比性,以便机器可以告诉我们哪些文档最相似或哪些文档最少。 我们还需要以可量化的形式(或数学对象,通常是矢量形式)表示文档中的文本,以便我们可以在其上执行相似度计算。</p>
<p>So, converting a document into a mathematical object and defining a similarity measure are primarily the two steps required to make machines perform this exercise. We will look into different ways of doing this.</p>
<p> 因此,将文档转换为数学对象并定义相似性度量主要是使机器执行此练习所需的两个步骤。 我们将研究执行此操作的不同方法。</p>
</div>
</div>
</section>
<section>
<div>
<div>
<p><strong>Similarity Function</strong></p>
<p> <strong>相似度函数</strong></p>
<p>Some of the most common and effective ways of calculating similarities are,</p>
<p>计算相似度的一些最常见,最有效的方法是,</p>
<p><em>Cosine Distance/Similarity</em> - It is the cosine of the angle between two vectors, which gives us the angular distance between the vectors. Formula to calculate cosine similarity between two vectors A and B is,</p>
<p> <em>余弦距离/相似度</em>-它是两个向量之间的角度的余弦值,它为我们提供了向量之间的角距离。 计算两个向量A和B之间的余弦相似度的公式为:</p>
<figure style="display:block;text-align:center;">
<div>
<div>
<div>
<div style="text-align: center;">
<img alt="Image for post" height="226" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-71c2d6e2938d4e55dc93024936c12373.png" style="outline: none;" width="572">
</div>
</div>
</div>
</div>
</figure>
<p>In a two-dimensional space it will look like this,</p>
<p> 在二维空间中,它看起来像这样,</p>
<figure style="display:block;text-align:center;">
<div>
<div>
<div>
<div style="text-align: center;">
<img alt="Image for post" height="354" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-d6327fee1d94b613e8dd94c5915315c4.png" style="outline: none;" width="572">
</div>
</div>
</div>
</div>
<figcaption>
angle between two vectors A and B in 2-dimensional space (Image by author)
</figcaption>
<figcaption>
二维空间中两个向量A和B之间的角度(图片由作者提供)
</figcaption>
</figure>
<p>You can easily work out the math and prove this formula using the <a href="https://en.wikipedia.org/wiki/Law_of_cosines">law of cosines</a>.</p>
<p> 您可以轻松地算出数学并使用<a href="https://en.wikipedia.org/wiki/Law_of_cosines">余弦定律</a>证明该公式。</p>
<p>Cosine is 1 at theta=0 and -1 at theta=180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. For this reason, it is called similarity. You can consider 1 - cosine as distance.</p>
<p> 余弦在theta = 0处为1,在theta = 180处为-1,这意味着对于两个重叠的向量,余弦对于两个完全相反的向量而言将是最高和最低的。 因此,这称为相似性。 您可以考虑1-余弦作为距离。</p>
<p><em>Euclidean Distance </em>- This is one of the forms of Minkowski distance when p=2. It is defined as follows,</p>
<p> <em>欧几里德距离</em>-当p = 2时, |
|