在Python中，如何检查一个字符串是否只包含特定的字符？

Question

在Python中，如何检查一个字符串是否只包含特定的字符？

pythonregexsearchcharacter

90

在Python中，如何检查一个字符串是否只包含特定字符？

我需要检查一个字符串是否只包含a..z、0..9和"."(句点)，没有其他字符。

我可以遍历每个字符并检查该字符是否为a..z或0..9，或者"."(句点)，但这样做会很慢。

我现在不清楚如何使用正则表达式来进行检查。

这正确吗？你能否提供一个更简单的正则表达式或更高效的方法。

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''

- X10

1

看起来没问题。如果你在字符类中，就不需要在 . 前面加反斜杠，但这只能节省一个字符 ;) - Alice Purcell

@Ingenutrix，John 确实在我的答案中发现了一个 bug。我认为他的解决方案是最好的。 - Nadia Alramli

将已接受的答案从Nadia更改为John Machin。 - X10

请参阅Tim Peters对此问题的回答：如何在Python中检查字符串是否仅包含给定集合中的字符。 - mattst

如果您想将字符串转换为仅包含指定字符，请参见https://dev59.com/FGUo5IYBdhLWcg3wzCFr。在某些特殊情况下，还可以应用其他技术：例如，https://dev59.com/xHM_5IYBdhLWcg3wRw11 适用于仅数字的情况。此外，还可以参见https://dev59.com/nnVC5IYBdhLWcg3wcgmd，专门用于创建有效的文件名。 - Karl Knechtel

9个回答

56

最终编辑

答案已经包含在一个函数中，并带有注释的交互式会话：

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>

注意：下面的回答中将与使用re.match()进行比较。更进一步的时间显示，对于更长的字符串，match()会获胜；当最终答案为True时，match()似乎具有比search()更大的开销；这很令人困惑（也许是返回MatchObject而不是None的成本），可能需要进一步深入研究。

==== Earlier text ====

之前被接受的答案需要进行一些改进：

(1) 呈现方式给人一种似乎是交互式Python会话结果的外观：

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True

但是match()函数并不返回True

(2) 对于match()函数的使用，模式开头的^是多余的，并且似乎比没有^的同一模式略慢。

(3) 应该自动地无意识地促进任何re模式的原始字符串的使用。

(4) 点号/句号前面的反斜杠是多余的。

(5) 比OP的代码慢！

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop

(6) 可能会产生错误的答案！！

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False

- John Machin

4

谢谢纠正我的答案。我忘记了match只检查字符串开头的匹配项。Ingenutrix，我认为你应该选择这个答案作为被采纳的答案。 - Nadia Alramli

哇，接受一个解决方案后获得另一个解决方案。@John Machin，感谢您参与讨论。请问您能否将最终的清理过的解决方案放在帖子的顶部。所有这些不同的（虽然很棒的）帖子可能会混淆另一个搜索最终解决方案的新手。请不要更改或删除您的帖子中的任何内容，通过您的步骤看到您的解释非常好。它们非常有启发性。谢谢。 - X10

@Nadia：你真是太慷慨了。谢谢！ @Ingenutrix：按要求清理完毕。 - John Machin

51

有更简单的方法吗？能再用一点Pythonic的方式实现吗？

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

它肯定不是最有效率的，但易读性非常好。

- Mark Rushakoff

3

ok = dict.fromkeys("012345789abcdef") 可能会提高速度，而不会影响可读性。 - jfs

@J.F.Sebastian：在我的系统上，使用dict.fromkeys和一个长字符串和一个短字符串的技巧只能快1到3％。（使用Python 3.3） - erik

1

@erik：使用bytes.translate可以提高速度。请参见评论中的讨论和答案中的性能比较。 - jfs

17

编辑：将正则表达式更改为排除 A-Z 字母

到目前为止，正则表达式解决方案是最快的纯 Python 解决方案。

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

与其他解决方案相比：

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

如果您想允许空字符串，则将其更改为：

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

根据请求，我将返回答案的另一部分。但请注意以下接受A-Z范围。

您可以使用isalnum

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

编辑使用 isalnum 比使用 set 更高效。

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

编辑2 John举了一个例子，上述方法不能起作用。我通过使用编码来解决这个特殊情况来改进了解决方案。

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

它仍然比集合解决方案快近三倍。

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

在我看来，使用正则表达式是解决这个问题的最佳方法。

- Nadia Alramli

非常有趣！感谢提供速度细节，顺便说一下，大写字母检查应该失败，但这只是一个小问题。

'A.a'.lower().replace('.', '').isalnum() True

您能否更新您的非编码、编码和正则表达式解决方案，以排除 A-Z。（虽然这只是一个小问题，但您似乎比我更加专业，我不想在错误的地方放置 .lower() 从而弄乱答案）我的主要关注点是确保我的解决方案是正确的，但我很高兴在这里发布了问题，因为速度非常重要。这个检查会被执行几百万次，看到速度结果后，这确实很重要！ - X10

我觉得我对A.a'.lower().replace('.', '').isalnum()的理解有误了，这个最好由专家来处理。 - X10

娜迪亚，你之前详细的帖子更加丰富和有启发性（即使它有点偏离问题）。如果可以恢复它，请这样做。仅阅读它就有助于像我这样的新手。 - X10

如果您决定采用这种方法，另一个性能注意事项是，您应该编译正则表达式一次，然后重复使用已编译的版本，而不是每次调用函数都重新编译它。编译正则表达式是一个非常耗时的过程。 - Brent Writes Code

@Ingenutrix，我按照要求返回了答案的其余部分。正如Brent所说，您只需要编译一次正则表达式。 - Nadia Alramli

显示剩余3条评论

5

这个问题已经得到了令人满意的答复，但是如果有人在事后遇到这个问题，我已经对几种不同的完成方法进行了一些分析。在我的情况下，我需要大写十六进制数字，因此根据需要进行修改以适应您的需求。

以下是我的测试实现:

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))

测试在Python 3.4.0和Mac OS X上进行：

import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")

以下是结果：

其结果如下：

         50335004个函数调用，用时13.428秒
按：累积时间、函数名称排序
   由于限制，列表从20个减少到8个
调用次数     总时间     每次调用     累计时间     每次调用 文件名:行号(函数)
    10000    0.005233    0.000001    0.367360    0.000037 :1(test_re_match2)
    10000    0.006248    0.000001    0.378853    0.000038 :1(test_re_match3)
    10000    0.010710    0.000001    0.395770    0.000040 :1(test_re_match1)
    10000    0.004578    0.000000    0.467386    0.000047 :1(test_re_search2)
    10000    0.005994    0.000001    0.475329    0.000048 :1(test_re_search3)
    10000    0.008100    0.000001    0.482209    0.000048 :1(test_re_search1)
    10000    0.863139    0.000086    0.863139    0.000086 :1(test_set)
    10000    0.007414    0.000001    9.962580    0.000996 :1(test_not_any)

其中：

调用次数: 函数被调用的次数
总时间: 给定函数中花费的总时间，不包括用于子函数的时间
每次调用: 总时间除以调用次数的商
累计时间: 在此和所有子函数中花费的累计时间
每次调用: 累计时间除以原始调用的商

我们实际关心的列是累计时间和每次调用，因为这向我们展示了从函数进入到退出所需的实际时间。正如我们所看到的，正则表达式匹配和搜索并没有很大的区别。

如果您每次都会编译正则表达式，则不必费力地编译它会更快。编译一次比每次编译快约7.5％，但只比不编译快2.5％。

test_set比re_search慢两倍，比re_match慢三倍

test_not_any比test_set慢一个数量级

简而言之：使用re.match或re.search

- KingRadical

hex_match = re.compile(r'^[A-F0-9]+$') matches "F00BAA\n" ... use \Z instead of $ - John Machin

$匹配\n之前的*：>>> re.match(r'^[A-F0-9]+$', 'F00BAA\n').group(0)'<<< 'F00BAA'。只有在您明确希望匹配失败时，才最好使用\Z，如果在结尾处有换行符。 - KingRadical

阅读OP问题的第二行：“没有其他字符” - 这需要使用\Z。 - John Machin

3

当你需要比较数据集时，使用Python的Set。字符串可以很快地表示为字符集。这里我测试字符串是否为允许的电话号码。第一个字符串是允许的，第二个不允许。运行快速简单。

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 фыв');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

如果可以避免的话，永远不要使用正则表达式。

- remort

1

allowed_characters = 'hsjwnbs#'
def isValidName(string,allowed_chars):
  allowed_chars = set((allowed_chars))
  validation = set((string))
  return validation.issubset(allowed_chars)

- Yaver Javid

0

一个不同的方法，因为在我的情况下我还需要检查它是否包含某些单词（例如在这个例子中的 'test'），而不仅仅是字符本身：

input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed

因此，允许的字符串（字符或单词）会从输入字符串中切割出来。如果输入字符串只包含被允许的字符串，则应该留下一个空字符串，因此应该通过 if not input_string。

- kasimir

这段代码针对每个允许的字符串进行全文扫描，时间复杂度为O(n*k)。如果你要处理大文本，建议修改代码只循环一遍字符，从而将时间复杂度降至O(n)。 - Bob Bobster

-1

自从Python 3.4版本以后，re模块变得更加容易使用了。可以使用fullmatch函数。

import re
----
pattern = r'[.a-z0-9]*'
result = re.fullmatch(pattern, string)
if result:
   return True
else:
   return False

- Nim

我已经更新了这个答案中的正则表达式，因为它之前不能正常工作。我移除了插入符号(^ - 表示字符串或匹配开始) 因为在使用 re.fullmatch() 时不需要它。虽然通常 '.' 用于表示任何字符，并且在显式搜索时应该转义，但是在原始字符串 (r'原始字符串') 的集合内部时，您不需要（也不希望）转义它。星号 (*) 也是必需的以匹配所有字符。 - Jeremy Davis

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John Millikin · Accepted Answer

这里有一个简单的、纯Python实现。它应该在性能不是关键因素时使用（包括未来的Google搜索者）。

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

关于性能问题，遍历循环可能是最快的方法。正则表达式必须通过状态机进行迭代，而集合平等解决方案则必须建立临时集合。然而，这种差异不太重要。如果这个函数的性能非常重要，请将其编写为一个C扩展模块，并使用switch语句（将编译为跳转表）。

以下是一个C实现，由于空间限制使用if语句。如果您绝对需要微小额外的速度，请写出switch-case语句。在我的测试中，它表现得非常好（与正则表达式相比，在基准测试中2秒 vs 9秒）。

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

将其包含在你的setup.py中：

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

使用方法：

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False