Pandas Iterrows行号和百分比

7

我正在遍历一个有数千行的数据框。理想情况下,我想知道我的循环进度——即已完成多少行,已完成总行数的百分比等。

是否有一种方法可以打印出行号,甚至更好的是遍历的行数百分比?

目前我的代码如下。目前,打印它看起来显示了一些元组/列表,但我只需要行号。这可能很简单。

for row in testDF.iterrows():

        print("Currently on row: "+str(row))

理想的打印响应:

Currently on row 1; Currently iterated 1% of rows
Currently on row 2; Currently iterated 2% of rows
Currently on row 3; Currently iterated 3% of rows
Currently on row 4; Currently iterated 4% of rows
Currently on row 5; Currently iterated 5% of rows

你为什么要使用循环呢?很可能有更好的方法。如果你必须使用循环,那么可以使用 enumerate 来轻松计算进度,它返回当前行的索引(以及行本身),可以将其除以总行数。for index, row in enumerate(testDF.iterrows()): ... progress = index / len(testDF) - DeepSpace
我正在使用iterrows循环,因为我正在创建一个具有地理编码数据的新列。大多数允许您进行地理编码的服务都有限制,所以我在循环中还添加了0.1秒的延迟。 - christaylor
3个回答

11

首先,iterrows会给出包含索引和行的元组。因此,正确的代码如下:

for index, row in testDF.iterrows():

通常情况下,索引不是行号,而是一些标识符(这是pandas的优势,但它会导致一些混淆,因为它的行为与python中普通的list不同,那里的索引是行号)。这就是为什么我们需要独立计算行数的原因。我们可以引入 line_number = 0 并在每个循环中递增它 line_number += 1。但是Python为我们提供了一个现成的工具:enumerate,它返回元组 (line_number, value) 而不仅仅是 value。所以我们得到了以下代码:

for line_number, (index, row) in enumerate(testDF.iterrows()):
    print("Currently on row: {}; Currently iterated {}% of rows".format(
          line_number, 100*(line_number + 1)/len(testDF)))

顺便提一句,当你在python2中除以整数时,会返回整数,这就是为什么999/1000 == 0,这是你不期望的。因此,你可以强制转换成浮点数或者在开头添加100*来获得百分比。


5

使用format的一种可能的解决方案,如果索引是唯一且单调递增(0,1,2,...):

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))

示例:

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)))
print (testDF)
   0  1  2
0  8  1  9
1  4  3  5
2  0  1  3
3  1  8  6
4  7  4  7
5  7  5  3
6  7  9  9
7  0  1  2
8  1  3  4
9  0  0  3

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))
Currently on row: 0; Currently iterrated 10.0% of rows
Currently on row: 1; Currently iterrated 20.0% of rows
Currently on row: 2; Currently iterrated 30.0% of rows
Currently on row: 3; Currently iterrated 40.0% of rows
Currently on row: 4; Currently iterrated 50.0% of rows
Currently on row: 5; Currently iterrated 60.0% of rows
Currently on row: 6; Currently iterrated 70.0% of rows
Currently on row: 7; Currently iterrated 80.0% of rows
Currently on row: 8; Currently iterrated 90.0% of rows
Currently on row: 9; Currently iterrated 100.0% of rows

编辑:

如果有一些自定义的索引值,可以使用zipnumpy.arange的解决方案,其中索引的长度数据框的长度相同。

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)), index=[2,4,5,6,7,8,2,1,3,5])
print (testDF)
   0  1  2
2  8  1  9
4  4  3  5
5  0  1  3
6  1  8  6
7  7  4  7
8  7  5  3
2  7  9  9
1  0  1  2
3  1  3  4
5  0  0  3

for i, (idx, row) in zip(np.arange(len(testDF.index)), testDF.iterrows()):
    print("Currently on row: {}; Currently iterrated {}% of rows".format(idx, (i + 1)/len(testDF.index) * 100))

Currently on row: 2; Currently iterrated 10.0% of rows
Currently on row: 4; Currently iterrated 20.0% of rows
Currently on row: 5; Currently iterrated 30.0% of rows
Currently on row: 6; Currently iterrated 40.0% of rows
Currently on row: 7; Currently iterrated 50.0% of rows
Currently on row: 8; Currently iterrated 60.0% of rows
Currently on row: 2; Currently iterrated 70.0% of rows
Currently on row: 1; Currently iterrated 80.0% of rows
Currently on row: 3; Currently iterrated 90.0% of rows
Currently on row: 5; Currently iterrated 100.0% of rows

你觉得是按照你的方式打印还是像下面这样打印更好?print('currently at row',i,'. iterated through ',100 * i / testDF.shape[0],'%') 为什么?感谢你的回答。 - Rayhane Mama
1
@RayhaneMama - 我认为有很多种可能的方法,你的工作也是。我更喜欢使用len(df.index),因为它是最快的方式。 - jezrael
1
请注意,这里的 i 是每一行的索引。它适用于索引包含从0到len(df)-1的整数的情况,但如果testDF使用自定义索引值,则不适用。 - Adrien Matissart
@AdrienMatissart - 你是对的,这更加复杂了,我添加解决方案。 - jezrael

2

对于大型数据框,最好限制打印输出,因为这是一个耗时的任务。以下是一种解决方法:

dftest=pd.DataFrame(np.random.rand(10**5,5))

percent=0
n=len(dftest)//100

for i,row in dftest.iterrows():
    if (i+1)//n>percent :
        percent +=1
        print (percent, "% realized")
    dftest.iloc[i] = 2*row #a job

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接