Python + Sqlite 中的字符串相似度 (Levenshtein 距离/编辑距离)

Question

Python + Sqlite 中的字符串相似度 (Levenshtein 距离/编辑距离)

8

有没有Python+Sqlite中可用的字符串相似度测量方法，例如使用sqlite3模块？

用例示例：

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

此查询应匹配ID为1的行，但不匹配ID为2的行：

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

如何在Sqlite+Python中实现这一功能？

以下是我目前了解到的注意事项：

The Levenshtein distance, i.e. the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other, can be useful, but I'm not sure if an official implementation exists in Sqlite (I've seen a few custom implementations, like this one)
The Damerau-Levenshtein is the same, except it also allows transposition between 2 adjacent characters; it is also called the Edit distance
I know it's possible to define a function myself, but implementing such a distance will be non-trivial (doing natural language processing comparison super efficiently for databases is really non-trivial), that's why I wanted to see if Python / Sqlite already features such a tool

Sqlite has FTS (Full Text Seach) features: FTS3, FTS4, FTS5

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);     /* FTS3 table */
CREATE TABLE enrondata2(content TEXT);                        /* Ordinary table */
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux';  /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */

but I don't find about string comparison with such a "similarity distance", FTS's features MATCH or NEAR don't seem to have similarity measure with letters changes, etc.

Moreover this answer shows that:

SQLite's FTS engine is based on tokens - keywords that the search engine tries to match.
A variety of tokenizers are available, but they are relatively simple. The "simple" tokenizer simply splits up each word and lowercases it: for example, in the string "The quick brown fox jumps over the lazy dog", the word "jumps" would match, but not "jump". The "porter" tokenizer is a bit more advanced, stripping the conjugations of words, so that "jumps" and "jumping" would match, but a typo like "jmups" would not.

The latter (the fact that "jmups" cannot be found as similar to "jumps") makes it unpractical for my use case, sadly.

- Basj

2个回答

5

我将距离相关函数(Damerau-Levenshtein, Jaro-Winkler, 最长公共子字符串和子序列)实现为SQLite运行时可加载的扩展。支持任何UTF-8字符串。

https://github.com/schiffma/distlib

- schiffma

4

欢迎来到StackOverflow，并为这个看起来很棒的Github仓库点赞。为了使这个回答有用（链接式回答不被接受），您可以添加更多上下文信息，提供一些Python代码示例（几行即可）以展示如何实际使用它？ - Basj

1

同意Basj的评论。那是一个很好的资源，而且它运行得非常好！您应该在您的答案中添加一些Python代码和解释，以展示如何使用和安装它。您的Github上有很多.cpp文件，可能会让一些Python程序员感到害怕... - Laurent

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Basj · Accepted Answer

这里是一个可直接使用的示例test.py：

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

重要提示：距离 editdist3 已经归一化处理，其中插入和删除的值为100，替换的值为150。

以下是在Windows上首先要做的事情:

Download https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip and unzip them
Replace C:\Python27\DLLs\sqlite3.dll by the new sqlite3.dll from here. If skipping this, you'd get a sqlite3.OperationalError: The specified procedure could not be found later

Run:

call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"

or

call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
python test.py

(With MinGW, it would be: gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll)

以下是在Linux Debian上的操作步骤：

(基于这个答案)

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

以下是在Linux Debian上使用旧版本Python的方法：

如果您的发行版Python版本比较旧，需要使用另一种方法。由于sqlite3模块已经内置在Python中，因此似乎升级它并不直接（pip install --upgrade pysqlite只会升级pysqlite模块，而不是底层的SQLite库）。因此，例如这种方法适用于如下情况：如果import sqlite3; print sqlite3.sqlite_version的版本是3.8.2：

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]