查找错误: 找不到资源'corpora/stopwords'

Question

查找错误: 找不到资源'corpora/stopwords'

4

我将尝试使用Flask在Heroku上运行一个Web应用程序。该Web应用程序是使用Python编程，并使用了NLTK（自然语言工具包库）。

其中一个文件具有以下标题：

import nltk, json, operator
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer

当调用带有停用词代码的网页时，会出现以下错误：

LookupError: 
**********************************************************************
  Resource 'corpora/stopwords' not found.  Please use the NLTK  
  Downloader to obtain the resource:  >>> nltk.download()  
  Searched in:  
    - '/app/nltk_data'  
    - '/usr/share/nltk_data'  
    - '/usr/local/share/nltk_data'  
    - '/usr/lib/nltk_data'  
    - '/usr/local/lib/nltk_data'  
**********************************************************************

所使用的确切代码：

#remove punctuation  
toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True) 
data = toker.tokenize(data)  

#remove stop words and digits 
stopword = stopwords.words('english')  
data = [w for w in data if w not in stopword and not w.isdigit()]

当注释掉stopword = stopwords.words('english')时，Heroku上的webapp不会产生查找错误。

在我的本地计算机上，该代码运行顺畅。我已经使用所需的库在我的计算机上安装了它们。

pip install requirements.txt

Heroku提供的虚拟环境在我电脑测试代码时运行正常。

我也尝试了两个不同来源提供的NLTK，但仍然出现“LookupError”错误。我使用的两个来源分别是：
http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc4.zip
https://github.com/nltk/nltk.git

- user3534472

尝试使用此链接 https://github.com/heroku/heroku-buildpack-python/issues/444#issuecomment-850093747 - Harsh Gupta

2个回答

5

更新

正如 Kenneth Reitz 指出的那样，Heroku-Python-Buildpack 已经添加了一个更简单的解决方案。只需在您的根目录中添加一个名为 nltk.txt 的文件，并列出其中的语料库即可。详见https://devcenter.heroku.com/articles/python-nltk。

原始回答

以下是一种更干净的解决方案，它允许您在 Heroku 上直接安装 NLTK 数据而无需将其添加到您的 git 仓库中。

我使用了类似的步骤在 Heroku 上安装Textblob，它使用 NLTK 作为依赖项。我对步骤 3 和 4 进行了一些微小的调整，以适用于仅安装 NLTK 的情况。

默认的 Heroku Buildpack 包含一个post_compile 步骤，它会在所有默认的构建步骤完成后运行：

# post_compile
#!/usr/bin/env bash

if [ -f bin/post_compile ]; then
    echo "-----> Running post-compile hook"
    chmod +x bin/post_compile
    sub-env bin/post_compile
fi

如您所见，它在您的项目目录中寻找您自己的post_compile文件，并在bin目录中运行它（如果存在）。您可以使用这个钩子来安装nltk数据。

Create the bin directory in the root of your local project.

Add your own post_compile file to the bin directory.

# bin/post_compile
#!/usr/bin/env bash

if [ -f bin/install_nltk_data ]; then
    echo "-----> Running install_nltk_data"
    chmod +x bin/install_nltk_data
    bin/install_nltk_data
fi

echo "-----> Post-compile done"

Add your own install_nltk_data file to the bin directory.

# bin/install_nltk_data
#!/usr/bin/env bash

source $BIN_DIR/utils

echo "-----> Starting nltk data installation"

# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'

# Install the nltk data
# NOTE: The following command installs the stopwords corpora, 
# so you may want to change for your specific needs.  
# See http://www.nltk.org/data.html
python -m nltk.downloader stopwords

# If using Textblob, use this instead:
# python -m textblob.download_corpora lite

# Open the NLTK_DATA directory
cd ${NLTK_DATA}

# Delete all of the zip files
find . -name "*.zip" -type f -delete

echo "-----> Finished nltk data installation"

Add nltk to your requirements.txt file (Or textblob if you are using Textblob).
Commit all of these changes to your repo.
Set the NLTK_DATA environment variable on your heroku app.
```
$ heroku config:set NLTK_DATA='/app/nltk_data'
```
Deploy to Heroku. You will see the post_compile step trigger at the end of the deployment, followed by the nltk download.

我希望您觉得这篇文章有所帮助！祝您愉快！

- Michael Godshall

重要提示：heroku python构建包v97更改了行为，导致nltk_data目录被省略。请参见https://github.com/heroku/heroku-buildpack-python/issues/356以获取修复方法。 - Dan Grigsby

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gaurang · Accepted Answer

问题在于语料库（在这种情况下是“停用词”）没有上传到Heroku。你的代码在本地机器上可以工作，因为它已经有了NLTK语料库。请按照以下步骤解决此问题：

在您的项目中创建一个新目录（我们称其为'nltk_data'）
在该目录中下载NLTK语料库。您将在下载过程中进行配置。
告诉nltk查找特定路径。只需在实际使用nltk的Python文件中添加 nltk.data.path.append('path_to_nltk_data')。
现在将应用程序推送到Heroku。

希望能解决问题。对我起作用了！