最近在做知识图谱的时候,需要用到实体对齐的方法,后面发现了用最小编辑距离和jacard可以做一个实体对齐的算法,原代码见参考文献,但是源代码写得有点粗糙,我这里重新整理了一下,最小编辑距离代码:

def edit_distance(word1, word2):
    len1 = len(word1)
    len2 = len(word2)
    dp = np.zeros((len1 + 1, len2 + 1))
    for i in range(len1 + 1):
        dp[i][0] = i
    for j in range(len2 + 1):
        dp[0][j] = j
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            delta = 0 if word1[i - 1] == word2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j - 1] + delta, min(dp[i - 1][j] + 1, dp[i][j - 1] + 1))
    return dp[len1][len2]

jacard代码:

def Jaccrad(terms_model,reference):
    grams_reference = set(reference)
    grams_model = set(terms_model)
    temp = 0
    for i in grams_reference:
        if i in grams_model:
            temp = temp + 1
    fenmu = len(grams_model) + len(grams_reference) - temp
    jaccard_coefficient = float(temp / fenmu)
    return jaccard_coefficient

测试代码:

blists=["vipkid","vipki",'vip','福建省委']
for i in range(len(blists)):
    for j in range(0,i):
        a = blists[i]
        b = blists[j]
        print(blists[i],blists[j])
        td = Jaccrad(a, b)
#         print(td)
        std =edit_distance(a, b)/max(len(a),len(b))
        fy = 1-std
#         print(fy)
        huizon = (td+fy)/2
        print('avg_sim: ', huizon)

输出为:

vipki vipkid
avg_sim:  0.8166666666666667
vip vipkid
avg_sim:  0.55
vip vipki
avg_sim:  0.675
福建省委 vipkid
avg_sim:  0.0
福建省委 vipki
avg_sim:  0.0
福建省委 vip
avg_sim:  0.0

效果还是可以的,当然也可以举出反例,然后再选择合适的阈值来进行实体对齐了哈,这里阈值就自己定了,下游也就自己写咯

参考文献

[1].基于Neo4j 图数据库的知识图谱的关联对齐(实体对齐)——上篇. https://blog.csdn.net/for_yayun/article/details/100292617

 

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐