如何在Python中解析文件列表以仅获取文件名？

Question

如何在Python中解析文件列表以仅获取文件名？

pythonparsingscriptingftpftplib

6

假设我正在使用Python的ftplib从FTP服务器检索日志文件列表。如何解析文件列表以仅获取文件名（最后一列）并将其放入列表中？请参见上面链接的示例输出。

- Lawrence Johnston

8个回答

8

最佳答案

你可能想使用ftp.nlst()而不是ftp.retrlines()。这将给你想要的结果。

如果无法使用，可以阅读以下内容：

系统管理员进程的生成器

在他现在著名的评论中，David M. Beazley在《面向系统程序员的生成器技巧介绍》中提供了许多配方来回答这种数据问题，使用快速且可重用的代码。

例如：

# empty list that will receive all the log entry
log = [] 
# we pass a callback function bypass the print_line that would be called by retrlines
# we do that only because we cannot use something better than retrlines
ftp.retrlines('LIST', callback=log.append)
# we use rsplit because it more efficient in our case if we have a big file
files = (line.rsplit(None, 1)[1] for line in log)
# get you file list
files_list = list(files)

为什么我们不立即生成列表呢？

这是因为这样做可以提供更大的灵活性：在将其转换为 files_list 之前，您可以应用任何中间生成器来过滤文件，就像管道一样，添加一行，添加一个进程而不会过热（因为它是生成器）。如果你摆脱了 retrlines ，它仍然可以工作，但它甚至更好，因为你甚至没有存储列表一次。

编辑：好吧，我读了对另一个答案的评论，它说如果名称中有任何空格，这种方法就行不通了。

很酷，这将说明为什么此方法很方便。如果您想要更改流程中的某些内容，只需更改一行即可。交换:

files = (line.rsplit(None, 1)[1] for line in log)

并且

# join split the line, get all the item from the field 8 then join them
files = (' '.join(line.split()[8:]) for line in log)

好的，这在这里可能不太明显，但对于大规模批处理脚本来说，这很不错 :-)

- e-satis

似乎比我的情况下在空目录上挂起的nlst更加健壮。 - Lindlof

1

如果FTP服务器支持MLSD命令，请参见那个答案中的“单目录情况”部分。

使用FTPDirectory类的实例（假设为ftpd），在正确的文件夹中使用已连接的ftplib.FTP实例调用它的.getdata方法，然后你就可以：

directory_filenames= [ftpfile.name for ftpfile in ftpd.files]

- tzot

1

如果你被迫使用retrlines()，那么还有一种略微不太优化的方法，就是将一个函数作为retrlines()的第二个参数传递进去；它将会对列表中的每个项目进行调用。所以像这样做也可以（假设你有一个名为“ftp”的FTP对象）：

filenames = []
ftp.retrlines('LIST', lambda line: filenames.append(line.split()[-1]))

'filenames'列表将会是文件名的一个列表。

- James Bennett

如果文件名包含空格，那么这个方法就行不通了（Mohit Ranka的答案可能也有同样的问题，但我无法完全理解他的代码...） - Paige Ruten

1

由于输出中每个文件名都从同一列开始，所以您只需要获取第一行上点的位置：

drwxrwsr-x 5 ftp-usr pdmaint 1536 Mar 20 09:48 .

然后使用该点的位置作为起始索引，从其他行中切片出文件名。

由于点是行末的最后一个字符，因此可以使用行长度减1作为索引。因此，最终代码类似于以下内容：

lines = ftp.retrlines('LIST')
lines = lines.split("\n") # This should split the string into an array of lines

filename_index = len(lines[0]) - 1
files = []

for line in lines:
    files.append(line[filename_index:])

- Jeremy Ruten

我认为这是一种相当有创意的技术，但如果您正在列出顶级目录，则列表中可能没有任何点文件。 - Harmon

1

你有没有什么原因不能使用ftplib.FTP.nlst()？我刚刚检查了一下，它只返回给定目录中文件的名称。

- ayaz

哎呀，好的。没注意到詹姆斯已经建议使用nlst()了？ - ayaz

0

我相信这对你应该有效。

file_name_list = [' '.join(each_file.split()).split()[-1] for each_file_detail in file_list_from_log]

注意事项 -

在这里，我假设您想要程序中的数据（作为列表），而不是在控制台上。
each_file_detail是程序生成的每一行。
' '.join(each_file.split())

将多个空格替换为1个空格。

- Mohit Ranka

0

这会获取所有文件名及其大小的列表。它还会遍历子目录。

def ftp_login():
    """ Future FTP stuff """

    import os
    from ftplib import FTP
    ftp = FTP()
    ftp.connect('phone', 2221)
    ftp.login('android', 'android')
    print("ftp.getwelcome():", ftp.getwelcome())
    all_files = []

    def walk(suffix, all):
        """ walk the path """
        files = []
        ftp.dir(suffix, files.append)  # callback = files.append(line)
        # Filename could be any position on line so can't use line[52:] below
        # dr-x------   3 user group            0 Aug 27 16:32 Compilations
        for f in files:
            line = ' '.join(f.split())  # compress multiple whitespace to one space
            parts = line.split()  # split on one space
            size = parts[4]
            # Date format is either: MMM DD hh:mm or MMM DD  YYYY or MMM DD YYYY
            date3 = parts[7] + " "  # doesn't matter if the size is same as YEAR
            # No shortcut ' '.join(parts[8:]) - name could have had double space
            name = f.split(date3)[1]
            if f.startswith("d"):  # directory?
                new_suffix = suffix + name + os.sep
                walk(new_suffix, all)  # back down the rabbit hole
            else:
                # /path/to/filename.ext <SIZE>
                all.append(suffix + name + " <" + size.strip() + ">")

    walk(os.sep, all_files)  # 41 seconds
    print("len(all_files):", len(all_files))  # 4,074 files incl 163 + 289 subdirs

输出：

ftp.getwelcome(): 220 Service ready for new user.
len(all_files): 4074
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.wav <47480228>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.mp3 <7343013>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.flac <31112653>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.oga <8075357>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.m4a <7662899>
/Compilations/Don't Let Me Be Misunderstood/07 House Of The Rising Sun (Quasimot.m4a <8015709>
/Compilations/Don't Let Me Be Misunderstood/01 Don't Let Me Be Misunderstood.m4a <33668167>
/Compilations/Don't Let Me Be Misunderstood/03 You're My Everything.m4a <12505304>
/Compilations/Don't Let Me Be Misunderstood/02 Gloria.m4a <8115224>
/Compilations/Don't Let Me Be Misunderstood/04 Black Pot.m4a <14617541>

- WinEunuuchs2Unix

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- James Bennett · Accepted Answer

在这种情况下，使用retrlines()可能不是最好的选择，因为它只会在控制台打印，并且您需要做一些棘手的事情才能获得该输出。更好的选择可能是使用nlst()方法，它会返回您想要的东西：文件名列表。