一、fuzzywuzzy
介绍:JavaWuzzy是Java版的FuzzyWuzzy,用于计算字符串之间的匹配度。
FuzzySearch.ratio(String s1, String s2)
全匹配,对顺序敏感
FuzzySearch.partialRatio(String s1, String s2)
搜索匹配(部分匹配),对顺序敏感
FuzzySearch.tokenSortRatio(String s1, String s2)
首先做排序,然后全匹配,对顺序不敏感(也就是更换单词位置之后,相似度依然会很高)
FuzzySearch.tokenSortPartialRatio(String s1, String s2)
首先做排序,然后搜索匹配(部分匹配),对顺序不敏感
FuzzySearch.tokenSetRatio(String s1, String s2)
首先取集合(去掉重复词),然后全匹配,对顺序不敏感,第二个字符串包含第一个字符串就100
FuzzySearch.tokenSetPartialRatio(String s1, String s2)
首先取集合,然后搜索匹配(部分匹配),对顺序不敏感
FuzzySearch.weightedRatio(String s1, String s2)
对顺序敏感,算法不同

开源地址:https://github.com/xdrop/fuzzywuzzy

案例:

        System.out.println("1 "+FuzzySearch.ratio("admin", "admin"));
        System.out.println("2 "+FuzzySearch.partialRatio("ADMIN", "admin"));
        System.out.println("3 "+FuzzySearch.tokenSetPartialRatio("test", "test1"));
        System.out.println("4 "+FuzzySearch.weightedRatio("你是", "你是我"));
        System.out.println("5 "+FuzzySearch.tokenSortRatio("你是", "你是W"));
        System.out.println("6 "+FuzzySearch.tokenSetRatio("你是", "你是o"));
        System.out.println(DiffUtils.getRatio("你是", "你是我"));
        System.out.println(DiffUtils.levEditDistance("你是", "你是我",1));
        System.out.println(DiffUtils.getMatchingBlocks("你是", "你是我"));
        System.out.println(DiffUtils.getEditOps("你是", "你是我"));

maven:

        <dependency>
            <groupId>me.xdrop</groupId>
            <artifactId>fuzzywuzzy</artifactId>
            <version>1.3.1</version>
        </dependency>

二、commons-text
介绍:Commons Text 是一组用于在 Java 环境中使用的处理文本的实用、可重用组件。

开源地址:http://commons.apache.org/proper/commons-text/

案例:

        FuzzyScore fuzzyScore = new FuzzyScore(Locale.ENGLISH);
        System.out.println("1 "+fuzzyScore.fuzzyScore("admin", "admin"));
        FuzzyScore fuzzyScores = new FuzzyScore(Locale.CHINESE);
        System.out.println("2 "+fuzzyScores.fuzzyScore("你是", "你是"));

maven:

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.4</version>
        </dependency>

三、java-string-similarity
介绍:一个实现不同字符串相似度和距离度量的库。当前实现了十二种算法(包括Levenshtein编辑距离和同级,Jaro-Winkler,最长公共子序列,余弦相似性等)。

归一化,度量,相似度和距离
基于(n-gram)的相似度和距离
莱文施泰因
标准化莱文施泰因
加权Levenshtein
Damerau-Levenshtein
最佳字符串对齐
杰罗·温克勒
最长公共子序列
公制最长公共子序列
N-格拉姆
基于碎片(n-gram)的算法
Q-Gram
余弦相似度
雅卡指数
Sorensen-Dice系数


开源地址:https://github.com/tdebatty/java-string-similarity

案例:

        Levenshtein levenshtein = new Levenshtein();
        System.out.println(levenshtein.distance("My string", "My $tring"));
        System.out.println(levenshtein.distance("My string", "My $tring"));
        System.out.println(levenshtein.distance("My string", "My $tring"));
        NormalizedLevenshtein normalizedLevenshtein = new NormalizedLevenshtein();
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        System.out.println(normalizedLevenshtein.distance("My string", "My $tring"));
        Damerau damerau = new Damerau();
        // 1 substitution
        System.out.println(damerau.distance("ABCDEF", "ABDCEF"));
        // 2 substitutions
        System.out.println(damerau.distance("ABCDEF", "BACDFE"));
        // 1 deletion
        System.out.println(damerau.distance("ABCDEF", "ABCDE"));
        System.out.println(damerau.distance("ABCDEF", "BCDEF"));
        System.out.println(damerau.distance("ABCDEF", "ABCGDEF"));
        // All different
        System.out.println(damerau.distance("ABCDEF", "POIU"));
        OptimalStringAlignment optimalStringAlignment = new OptimalStringAlignment();
        System.out.println(optimalStringAlignment.distance("CA", "ABC"));
        JaroWinkler jaroWinkler = new JaroWinkler();
        // substitution of s and t
        System.out.println(jaroWinkler.similarity("My string", "My tsring"));
        // substitution of s and n
        System.out.println(jaroWinkler.similarity("My string", "My ntrisg"));
        LongestCommonSubsequence longestCommonSubsequence = new LongestCommonSubsequence();
        // Will produce 4.0
        System.out.println(longestCommonSubsequence.distance("AGCAT", "GAC"));
        // Will produce 1.0
        System.out.println(longestCommonSubsequence.distance("AGCAT", "AGCT"));
        RatcliffObershelp ratcliffObershelp = new RatcliffObershelp();
        // substitution of s and t
        System.out.println(ratcliffObershelp.similarity("My string", "My tsring"));
        // substitution of s and n
        System.out.println(ratcliffObershelp.similarity("My string", "My ntrisg"));


maven:

        <dependency>
            <groupId>info.debatty</groupId>
            <artifactId>java-string-similarity</artifactId>
            <version>2.0.0</version>
        </dependency>

四、java-diff-utils
介绍:Diff Utils库是一个开放源代码库,用于执行文本之间的比较操作:计算差异,应用补丁,生成统一的差异或对其进行解析,生成差异输出以方便将来显示(如并排视图)等等。
构建该库的主要原因是缺乏使用差异文件时需要的所有常用内容的易于使用的库。最初它受JRCS库的启发,并且是diff模块的不错的设计。

开源地址:https://github.com/java-diff-utils/java-diff-utils

案例:

        System.out.println(DiffUtils.diffInline("admin","admin"));
        System.out.println(DiffUtils.diff(Arrays.asList("admin"),Arrays.asList("admin"),true);
        System.out.println(DiffUtils.diff(Arrays.asList("admin"),Arrays.asList("admin")));

maven:

        <dependency>
            <groupId>io.github.java-diff-utils</groupId>
            <artifactId>java-diff-utils</artifactId>
            <version>4.7</version>
        </dependency>
 

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐