Python + Sqlite 中的字符串相似度 (Levenshtein 距离/编辑距离)

8
有没有Python+Sqlite中可用的字符串相似度测量方法,例如使用sqlite3模块?
用例示例:
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

此查询应匹配ID为1的行,但不匹配ID为2的行:
c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

如何在Sqlite+Python中实现这一功能?

以下是我目前了解到的注意事项:

  • The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)

  • The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance

  • I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool

  • Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5

    CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
    CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
    SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
    SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
    

    but I don't find about string comparison with such a "similarity distance", FTS's features MATCH or NEAR don't seem to have similarity measure with letters changes, etc.

  • Moreover this answer shows that:

    SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
    A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

    The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

2个回答

9

这里是一个可直接使用的示例test.py

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

重要提示:距离 editdist3 已经归一化处理,其中插入和删除的值为100,替换的值为150。

以下是在Windows上首先要做的事情:

  1. Download https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them

  2. Replace C:\Python27\DLLs\sqlite3.dll by the new sqlite3.dll from here. If skipping this, you'd get a sqlite3.OperationalError: The specified procedure could not be found later

  3. Run:

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"  
    

    or

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
    cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
    python test.py
    

    (With MinGW, it would be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

以下是在Linux Debian上的操作步骤:

(基于这个答案)

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

以下是在Linux Debian上使用旧版本Python的方法:

如果您的发行版Python版本比较旧,需要使用另一种方法。由于sqlite3模块已经内置在Python中,因此似乎升级它并不直接pip install --upgrade pysqlite只会升级pysqlite模块,而不是底层的SQLite库)。因此,例如这种方法适用于如下情况:如果import sqlite3; print sqlite3.sqlite_version的版本是3.8.2:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]

1
总是很愉快,我在路上学了一点msvc命令行! - Jacques Gaudin
1
非常好的写作 :-). 请注意,只需付出一点努力,您就可以修改权重以使其与100和150不同。 - bodo
我尝试了以下命令:G:\downloads>C:\HashiCorp\Vagrant\embedded\mingw64\bin\gcc.exe -g -fPIC -shared G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c -I G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/ -o spellfix.so,但是出现了错误。提示找不到命令 G:\downloads。 - SL5net
1
@SL5net 这可能是文件夹名称的问题,可能需要使用引号等。除此之外,我不知道它可能是什么。 - Basj
@Basj C:\HashiCorp\Vagrant\embedded\mingw64\bin>x86_64-w64-mingw32-gcc-7.3.0.exe -g -fPIC -shared "G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c" -I "G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/" -o spellfix.so 在文件 G:\downloads\sqlite-src-3110100\sqlite-src-3110100/ext/misc/spellfix.c 中包含: G:\downloads\sqlite-src-3110100\sqlite-src-3110100/src/sqlite3ext.h:20:10: 致命错误:sqlite3.h:没有那个文件或目录 #include "sqlite3.h" - SL5net
显示剩余7条评论

5

我将距离相关函数(Damerau-Levenshtein, Jaro-Winkler, 最长公共子字符串和子序列)实现为SQLite运行时可加载的扩展。支持任何UTF-8字符串。

https://github.com/schiffma/distlib


4
欢迎来到StackOverflow,并为这个看起来很棒的Github仓库点赞。为了使这个回答有用(链接式回答不被接受),您可以添加更多上下文信息,提供一些Python代码示例(几行即可)以展示如何实际使用它? - Basj
1
同意Basj的评论。那是一个很好的资源,而且它运行得非常好!您应该在您的答案中添加一些Python代码和解释,以展示如何使用和安装它。您的Github上有很多.cpp文件,可能会让一些Python程序员感到害怕... - Laurent

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接