Pandas Iterrows行号和百分比

Question

Pandas Iterrows行号和百分比

7

我正在遍历一个有数千行的数据框。理想情况下，我想知道我的循环进度——即已完成多少行，已完成总行数的百分比等。

是否有一种方法可以打印出行号，甚至更好的是遍历的行数百分比？

目前我的代码如下。目前，打印它看起来显示了一些元组/列表，但我只需要行号。这可能很简单。

for row in testDF.iterrows():

        print("Currently on row: "+str(row))

理想的打印响应：

Currently on row 1; Currently iterated 1% of rows
Currently on row 2; Currently iterated 2% of rows
Currently on row 3; Currently iterated 3% of rows
Currently on row 4; Currently iterated 4% of rows
Currently on row 5; Currently iterated 5% of rows

- christaylor

你为什么要使用循环呢？很可能有更好的方法。如果你必须使用循环，那么可以使用 enumerate 来轻松计算进度，它返回当前行的索引（以及行本身），可以将其除以总行数。for index, row in enumerate(testDF.iterrows()): ... progress = index / len(testDF) - DeepSpace

我正在使用iterrows循环，因为我正在创建一个具有地理编码数据的新列。大多数允许您进行地理编码的服务都有限制，所以我在循环中还添加了0.1秒的延迟。 - christaylor

3个回答

5

使用format的一种可能的解决方案，如果索引是唯一且单调递增(0,1,2,...)：

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))

示例：

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)))
print (testDF)
   0  1  2
0  8  1  9
1  4  3  5
2  0  1  3
3  1  8  6
4  7  4  7
5  7  5  3
6  7  9  9
7  0  1  2
8  1  3  4
9  0  0  3

for i, row in testDF.iterrows():
        print("Currently on row: {}; Currently iterrated {}% of rows".format(i, (i + 1)/len(testDF.index) * 100))
Currently on row: 0; Currently iterrated 10.0% of rows
Currently on row: 1; Currently iterrated 20.0% of rows
Currently on row: 2; Currently iterrated 30.0% of rows
Currently on row: 3; Currently iterrated 40.0% of rows
Currently on row: 4; Currently iterrated 50.0% of rows
Currently on row: 5; Currently iterrated 60.0% of rows
Currently on row: 6; Currently iterrated 70.0% of rows
Currently on row: 7; Currently iterrated 80.0% of rows
Currently on row: 8; Currently iterrated 90.0% of rows
Currently on row: 9; Currently iterrated 100.0% of rows

编辑：

如果有一些自定义的索引值，可以使用zip和numpy.arange的解决方案，其中索引的长度与数据框的长度相同。

np.random.seed(1332)
testDF = pd.DataFrame(np.random.randint(10, size=(10, 3)), index=[2,4,5,6,7,8,2,1,3,5])
print (testDF)
   0  1  2
2  8  1  9
4  4  3  5
5  0  1  3
6  1  8  6
7  7  4  7
8  7  5  3
2  7  9  9
1  0  1  2
3  1  3  4
5  0  0  3

for i, (idx, row) in zip(np.arange(len(testDF.index)), testDF.iterrows()):
    print("Currently on row: {}; Currently iterrated {}% of rows".format(idx, (i + 1)/len(testDF.index) * 100))

Currently on row: 2; Currently iterrated 10.0% of rows
Currently on row: 4; Currently iterrated 20.0% of rows
Currently on row: 5; Currently iterrated 30.0% of rows
Currently on row: 6; Currently iterrated 40.0% of rows
Currently on row: 7; Currently iterrated 50.0% of rows
Currently on row: 8; Currently iterrated 60.0% of rows
Currently on row: 2; Currently iterrated 70.0% of rows
Currently on row: 1; Currently iterrated 80.0% of rows
Currently on row: 3; Currently iterrated 90.0% of rows
Currently on row: 5; Currently iterrated 100.0% of rows

- jezrael

你觉得是按照你的方式打印还是像下面这样打印更好？print('currently at row',i,'. iterated through ',100 * i / testDF.shape[0],'%') 为什么？感谢你的回答。 - Rayhane Mama

1

@RayhaneMama - 我认为有很多种可能的方法，你的工作也是。我更喜欢使用len(df.index)，因为它是最快的方式。 - jezrael

1

请注意，这里的 i 是每一行的索引。它适用于索引包含从0到len(df)-1的整数的情况，但如果testDF使用自定义索引值，则不适用。 - Adrien Matissart

@AdrienMatissart - 你是对的，这更加复杂了，我添加解决方案。 - jezrael

2

对于大型数据框，最好限制打印输出，因为这是一个耗时的任务。以下是一种解决方法：

dftest=pd.DataFrame(np.random.rand(10**5,5))

percent=0
n=len(dftest)//100

for i,row in dftest.iterrows():
    if (i+1)//n>percent :
        percent +=1
        print (percent, "% realized")
    dftest.iloc[i] = 2*row #a job

- B. M.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Leonid Mednikov · Accepted Answer

首先，iterrows会给出包含索引和行的元组。因此，正确的代码如下：

for index, row in testDF.iterrows():

通常情况下，索引不是行号，而是一些标识符（这是pandas的优势，但它会导致一些混淆，因为它的行为与python中普通的list不同，那里的索引是行号）。这就是为什么我们需要独立计算行数的原因。我们可以引入 line_number = 0 并在每个循环中递增它 line_number += 1。但是Python为我们提供了一个现成的工具：enumerate，它返回元组 (line_number, value) 而不仅仅是 value。所以我们得到了以下代码：

for line_number, (index, row) in enumerate(testDF.iterrows()):
    print("Currently on row: {}; Currently iterated {}% of rows".format(
          line_number, 100*(line_number + 1)/len(testDF)))

顺便提一句，当你在python2中除以整数时，会返回整数，这就是为什么999/1000 == 0，这是你不期望的。因此，你可以强制转换成浮点数或者在开头添加100*来获得百分比。