使用CountVectorizer时，针对单个字母的词汇为空。

Question

使用CountVectorizer时，针对单个字母的词汇为空。

pythonnlpvectorizationfeature-extractioncountvectorizer

10

尝试将字符串转换为数字向量，

### Clean the string
def names_to_words(names):
    print('a')
    words = re.sub("[^a-zA-Z]"," ",names).lower().split()
    print('b')

    return words


### Vectorization
def Vectorizer():
    Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000)
    return Vectorizer  


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

但是当我遇到以下情况时：

 ['g', 'o', 'm', 'd']

出现了错误：

ValueError: empty vocabulary; perhaps the documents only contain stop words

似乎单个字符的字符串存在问题。应该怎么办呢？谢谢。

- LookIntoEast

那么你想做什么？要把这些单字词汇加入你的词汇表吗？ - Vivek Kumar

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vivek Kumar · Accepted Answer

CountVectorizer中默认的token_pattern正则表达式会选择至少有2个字符的单词，这在文档中有说明：

token_pattern : 字符串

这是一个正则表达式，指定了什么构成“token”，只有当analyzer == 'word'时才会使用。默认的正则表达式会选择2个或更多字母数字字符的token（标点符号完全被忽略，并且总是被视为token分隔符）。

从CountVectorizer的源代码可以看到其默认的token_pattern是r"(?u)\b\w\w+\b

将其改为r"(?u)\b\w+\b以包括1个字母的单词。

按照上述建议修改您的代码（包括token_pattern参数）。

Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000,
                token_pattern = r"(?u)\b\w+\b")