博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Levenshtein Distance
阅读量:5287 次
发布时间:2019-06-14

本文共 5141 字,大约阅读时间需要 17 分钟。

 原文地址:

Strings may be different yet very similar. With the Levenshtein distance algorithm, we measure similarity and match approximate strings with fuzzy logic. Many projects need this logic.

Levenshtein distance computationsWords:                ant, auntLevenshtein distance: 1Note:                 Only 1 edit is needed.		      The 'u' must be added at index 2.Words:                Samantha, SamLevenshtein distance: 5Note:                 The final 5 letters must be removed.Words:                Flomax, VolmaxLevenshtein distance: 3Note:                 The first 3 letters must be changed		      Drug names are commonly confused.

  

  

Levenshtein  in  t-sql

SET QUOTED_IDENTIFIER ON GOSET ANSI_NULLS ON GOCREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)RETURNS intASBEGIN  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0  WHILE @j <= @tl    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1  WHILE @i <= @sl  BEGIN    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000    WHILE @j <= @tl    BEGIN      SET @c = @c + 1      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END      IF @c > @c1 SET @c = @c1      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1      IF @c > @c1 SET @c = @c1      IF @c < @cmin SET @cmin = @c      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1    END    IF @cmin > @d BREAK    SELECT @cv1 = @cv0, @i = @i + 1  END  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 ENDENDGO

  

Levenshtein algorithm

 

First, credit at the conceptual level goes to Vladimir Levenshtein, a Russian scientist. This code uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height. The two-dimensional array requires fewer allocations upon the managed heap and may be faster in this context.

Program that implements the algorithm [C#]
using System;/// /// Contains approximate string matching/// static class LevenshteinDistance{    ///     /// Compute the distance between two strings.    ///     public static int Compute(string s, string t)    {	int n = s.Length;	int m = t.Length;	int[,] d = new int[n + 1, m + 1];	// Step 1	if (n == 0)	{	    return m;	}	if (m == 0)	{	    return n;	}	// Step 2	for (int i = 0; i <= n; d[i, 0] = i++)	{	}	for (int j = 0; j <= m; d[0, j] = j++)	{	}	// Step 3	for (int i = 1; i <= n; i++)	{	    //Step 4	    for (int j = 1; j <= m; j++)	    {		// Step 5		int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;		// Step 6		d[i, j] = Math.Min(		    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),		    d[i - 1, j - 1] + cost);	    }	}	// Step 7	return d[n, m];    }}class Program{    static void Main()    {	Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));	Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));	Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));    }}

  

Output 1 5 3

 

The Levenshtein method is static—this Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions.

Tip:You can verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

Usage

Continuing on, we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop; we use a List of string[] arrays.

Program that calls Levenshtein in loop [C#]
static void Main(){    List
l = new List
{ new string[]{
"ant", "aunt"}, new string[]{
"Sam", "Samantha"}, new string[]{
"clozapine", "olanzapine"}, new string[]{
"flomax", "volmax"}, new string[]{
"toradol", "tramadol"}, new string[]{
"kitten", "sitting"} }; foreach (string[] a in l) { int cost = Compute(a[0], a[1]); Console.WriteLine("{0} -> {1} = {2}", a[0], a[1], cost); }}

 

Outputant -> aunt = 1Sam -> Samantha = 5clozapine -> olanzapine = 3flomax -> volmax = 3toradol -> tramadol = 3kitten -> sitting = 3

Resource

You can visit an excellent page about the Levenshtein distance and many implementations of it. The page and its links provides a more detailed reference.

Summary

 

We saw the famous Levenshtein Distance algorithm, optimized for the C# language. This code implements approximate string matching. The difference between two strings is not represented as true or false, but as an integer indicating the number of steps needed to get from one to the other.

As a reminder:The brilliance of the algorithm comes from Dr. Levenshtein.

转载于:https://www.cnblogs.com/hun_dan/archive/2012/07/17/2594872.html

你可能感兴趣的文章
【刷题】SPOJ 705 SUBST1 - New Distinct Substrings
查看>>
IEEE 754浮点数表示标准
查看>>
declare 结构用来设定一段代码的执行指令
查看>>
图解算法读书笔记
查看>>
调试学习笔记
查看>>
解开lambda最强作用的神秘面纱
查看>>
Java基础:Object类中的equals与hashCode方法
查看>>
C#拦截Http请求
查看>>
图片下载器
查看>>
找不到docker.socket解决方法
查看>>
Activity生命周期
查看>>
sql server和mysql中分别实现分页功能
查看>>
kafka server管理
查看>>
系统设计与分析(六)
查看>>
Java IO-1 File类
查看>>
HW5.29
查看>>
Linux查看物理CPU个数,核数,逻辑CPU个数;内存信息
查看>>
sqlserver查询效率
查看>>
FoxMail邮件设置
查看>>
percona-toolkit 之 【pt-online-schema-change】说明
查看>>