Python 从文件中提取数据

Question

Python 从文件中提取数据

8

我有一个文本文件，只是这样说

text1 text2 text text
text text text text

我希望首先能够统计文件中字符串的数量（所有字符串由空格分隔），然后输出前两个文本（文本1文本2）。

有什么想法吗？

谢谢您的帮助。

编辑：这是我目前的进展：

>>> f=open('test.txt')
>>> for line in f:
    print line
ï»¿text1 text2 text text text text hello
>>> words=line.split()
>>> words
['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']
>>> len(words)
7
if len(words) > 2:
    print "there are more than 2 words"

我遇到的第一个问题是，我的文本文件内容为：text1 text2 text text text

但是当我提取单词输出时，得到的结果是： ['\xef\xbb\xbftext1', 'text2', 'text', 'text', 'text', 'text', 'hello']

这个 '\xef\xbb\xbf' 是从哪里来的呢？

- scrayon

9

你到目前为止尝试了什么？你遇到了哪些问题？这是相当基础的 Python，但如果你在代码上有具体问题，我们可以提供帮助。 - Martijn Pieters

已在原帖中更新。 - scrayon

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

要逐行读取文件，只需在for循环中遍历打开的文件对象:

for line in open(filename):
    # do something with line

要将一行按空格分割成单独的单词列表，请使用 str.split()：

words = line.split()

要计算Python列表中的项数，使用 len(yourlist)：

count = len(words)

要从 Python 列表中选择前两个项目，请使用切片：

firsttwo = words[:2]

关于编写完整程序的部分，我会留给您自己完成，但是您只需要使用上述内容再加上一个 if 语句来检查是否已经有了两个单词。

你在文件开头看到的三个额外字节是 UTF-8 BOM（字节顺序标记）；它标记着您的文件为 UTF-8 编码，但它是冗余的，在 Windows 上才真正有用。

您可以使用以下命令将其删除：

import codecs
if line.startswith(codecs.BOM_UTF8):
    line = line[3:]

你可能想要使用那种编码来将你的字符串解码为Unicode：

line = line.decode('utf-8')

您也可以使用codecs.open()打开文件：

file = codecs.open(filename, encoding='utf-8')

请注意，codecs.open() 方法不能为您删除文本文件开头的 BOM（字节顺序标记）；最简单的方法是使用 .lstrip() 函数：

import codecs
BOM = codecs.BOM_UTF8.decode('utf8')
with codecs.open(filename, encoding='utf-8') as f:
    for line in f:
        line = line.lstrip(BOM)