Pandas Series.apply无法处理字符串类型数据。

Question

Pandas Series.apply无法处理字符串类型数据。

3

似乎存在与日语语言相关的问题，所以我也在Japanese StackOverflow上提出了问题。

当我使用字符串时，它可以正常工作。

我尝试进行编码，但无法找到此错误的原因。您能否给予我建议？

MeCab是一个开源文本分割库，用于处理用日语编写的文本，最初由奈良先端科学技术大学院大学开发，并由Taku Kudou（工藤拓）负责维护，作为Google日语输入项目的一部分。 https://en.wikipedia.org/wiki/MeCab sample.csv

0,今日も夜まで働きました。
1,オフィスには誰もいませんが、エラーと格闘中
2,デバッグばかりしていますが、どうにもなりません。

这是Pandas Python3代码

import pandas as pd
import MeCab  
# https://en.wikipedia.org/wiki/MeCab
from tqdm import tqdm_notebook as tqdm
# This is working...
df = pd.read_csv('sample.csv', encoding='utf-8')

m = MeCab.Tagger ("-Ochasen")

text = "りんごを食べました、そして、みかんも食べました"
a = m.parse(text)

print(a)# working! 

# But I want to use Pandas's Series



def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords



aa = extractKeyword(text) #working!!

me = df.apply(lambda x: extractKeyword(x))

#TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')

这是跟踪错误。

りんご リンゴ りんご 名詞-一般       
を   ヲ   を   助詞-格助詞-一般       
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
、   、   、   記号-読点       
そして ソシテ そして 接続詞     
、   、   、   記号-読点       
みかん ミカン みかん 名詞-一般       
も   モ   も   助詞-係助詞      
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
EOS

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-174-81a0d5d62dc4> in <module>()
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260                         f, axis,
4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
4263             else:
4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4356             try:
4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
4359                     keys.append(v.name)
4360             except Exception as e:

<ipython-input-174-81a0d5d62dc4> in <lambda>(x)
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

<ipython-input-174-81a0d5d62dc4> in extractKeyword(text)
    20     """Morphological analysis of text and returning a list of only nouns"""
    21     tagger = MeCab.Tagger('-Ochasen')
---> 22     node = tagger.parseToNode(text)
    23     keywords = []
    24     while node:

~/anaconda3/lib/python3.6/site-packages/MeCab.py in parseToNode(self, *args)
    280     __repr__ = _swig_repr
    281     def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
--> 282     def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)
    283     def parseNBest(self, *args): return _MeCab.Tagger_parseNBest(self, *args)
    284     def parseNBestInit(self, *args): return _MeCab.Tagger_parseNBestInit(self, *args)

TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')w

- YOSUKE

2

什么是 title？你能从 title 给出一些输入吗？ - Tai

@Tai 抱歉，我说错了，不是标题，所以我已经修复了，对不起。 - YOSUKE

1

你能展示完整的堆栈跟踪错误日志吗？ - dkato

@YOSUKE，你可以在函数extractKeyword中添加一行代码打印出text，看看哪一行CSV文件导致了错误吗？ - Tai

1

当然，没问题。如果有更多的信息更新，我会看看是否能提供帮助。祝你好运。 - Tai

显示剩余2条评论

2个回答

1

每次解析到节点都失败了，所以需要放置这段代码

 tagger.parseToNode('dummy')

之前

 node = tagger.parseToNode(text)

并且它已经起作用了！

但我不知道原因，也许parseToNode方法有问题...

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
   tagger = MeCab.Tagger('-Ochasen')
   tagger.parseToNode('ダミー') 
   node = tagger.parseToNode(text)
   keywords = []
   while node:
       if node.feature.split(",")[0] == u"名詞": # this means noun
           keywords.append(node.surface)
       node = node.next
   return keywords

- YOSUKE

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ahmed Fasih · Accepted Answer

我看到你在日本StackOverflow上得到了一些帮助，但这里是英文答案：

首先要解决的问题是read_csv将example.csv的第一行视为标题。要解决这个问题，请在read_csv中使用names参数。

接下来，df.apply默认会将函数应用于数据框架的列。您需要执行类似于df.apply(lambda x: extractKeyword(x['String']), axis=1)的操作，但这样不起作用，因为每个句子都可能有不同数量的名词，并且Pandas会抱怨不能在一个1x5数组的顶部堆叠一个1x2数组。最简单的方法是在String系列上应用apply。

最后的问题是，MeCab Python3绑定存在错误：请参见https://github.com/SamuraiT/mecab-python3/issues/3。通过两次运行parseToNode，您可以找到解决方案；您也可以在parseToNode之前调用parse。

将这三个问题综合起来：

import pandas as pd
import MeCab  
df = pd.read_csv('sample.csv', encoding='utf-8', names=['Number', 'String'])

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse(text)
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords

me = df['String'].apply(extractKeyword)
print(me)

当您运行此脚本时，使用您提供的example.csv文件：

➜  python3 demo.py
0                  [今日, 夜]
1    [オフィス, 誰, エラー, 格闘, 中]
2                   [デバッグ]
Name: String, dtype: object