Python pandas中类似于R函数str()、summary()和head()的等效函数是什么？

Question

Python pandas中类似于R函数str()、summary()和head()的等效函数是什么？

pythonrpandas

81

我只知道describe()函数。是否有其他类似于str()，summary()和head()的函数？

- megashigger

也许这个链接有帮助：http://pandas.pydata.org/pandas-docs/stable/basics.html - akrun

8个回答

43

这提供了类似于 R 中 str() 的输出。它会呈现唯一的值而不是初始值。

def rstr(df): return df.shape, df.apply(lambda x: [x.unique()])

print(rstr(iris))

((150, 5), sepal_length    [[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3,...
sepal_width     [[3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7,...
petal_length    [[1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9,...
petal_width     [[0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3,...
class            [[Iris-setosa, Iris-versicolor, Iris-virginica]]
dtype: object)

- jjurach

假设 df 是 DataFrame 或拥有这些方法的对象。相比之下，str（）适用于任何对象，包括元组、矩阵等。对于 numpy 数组，np.info() 很有用。 - PatrickT

32

summary() ~ describe()
head() ~ head()

我不确定str()的等效方法。

- omer sagi

4

dtypes()是str()的粗略等价。 - yosemite_k

2

head()？你是指只适用于少数数据类型的.head()方法吗？ - Hack-R

Python的describe函数结果为：count、mean、std、min、%25等。然而，summary()函数提供了最小值、第一四分位数、...最大值。因此这两个函数并不等价，它们只是相似的。 - Suat Atan PhD

总结函数的结果取决于每个列的数据类型。而描述函数只是尝试计算日期的平均值。我认为它们仅在非常有限的使用情况下（仅数字DataFrame）才相似。 - Olsgaard

25

Pandas提供了一个广泛的与R / R库的比较。最明显的区别是R更喜欢函数式编程，而Pandas是面向对象的，数据框是其主要对象。另一个R和Python之间的区别是，Python从0开始数组，而R从1开始。

R               | Pandas
-------------------------------
summary(df)     | df.describe()
head(df)        | df.head()
dim(df)         | df.shape
slice(df, 1:10) | df.iloc[:9]

- Martin Thoma

11

若要在 Python 中实现与 R 中 str() 函数等价的功能，我使用 dtypes 方法。该方法将提供每列数据的数据类型。

In [22]: df2.dtypes
Out[22]: 
Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

- fubar2021

7

我仍然更喜欢使用str()，因为它列出了一些示例。 info的一个令人困惑的方面是其行为取决于一些环境设置，如pandas.options.display.max_info_columns。

我认为最好的替代方法是使用一些其他参数调用info，以强制执行固定行为：

df.info(null_counts=True, verbose=True)

另外，针对您的其他功能：

summary(df)     | df.describe()
head(df)        | df.head()
dim(df)         | df.shape

- neves

3

我认为在Pandas中没有直接等价于str()函数（或来自dplyr的glimpse()）提供相同信息的功能。我认为一个等价函数应该显示以下内容：

数据框中的行数和列数
所有列的名称
每列存储的数据类型
每列前几个值的快速查看

在@jjurach的答案基础上，我编写了一个辅助函数，作为R str或glimpse函数的替代品，以快速获取我的数据框的概述。下面是带有示例代码的代码：

import pandas as pd
import random

# an example dataframe to test the helper function
example_df = pd.DataFrame({
    "var_a": [random.choice(["foo","bar"]) for i in range(20)],
    "var_b": [random.randint(0, 1) for i in range(20)],
    "var_c": [random.random() for i in range(20)]
})

# helper function for viewing pandas dataframes
def glimpse_pd(df, max_width=76):

    # find the max string lengths of the column names and dtypes for formatting
    _max_len = max([len(col) for col in df])
    _max_dtype_label_len = max([len(str(df[col].dtype)) for col in df])

    # print the dimensions of the dataframe
    print(f"{type(df)}:  {df.shape[0]} rows of {df.shape[1]} columns")

    # print the name, dtype and first few values of each column
    for _column in df:

        _col_vals = df[_column].head(max_width).to_list()
        _col_type = str(df[_column].dtype)

        output_col = f"{_column}:".ljust(_max_len+1, ' ')
        output_dtype = f" {_col_type}".ljust(_max_dtype_label_len+3, ' ')

        output_combined = f"{output_col} {output_dtype} {_col_vals}"

        # trim the output if too long
        if len(output_combined) > max_width:
            output_combined = output_combined[0:(max_width-4)] + " ..."

        print(output_combined)

运行该函数将返回以下输出：

glimpse_pd(example_df)
<class 'pandas.core.frame.DataFrame'>:  20 rows of 3 columns
var_a:  object    ['foo', 'bar', 'foo', 'foo', 'bar', 'bar', 'foo', 'bar ...
var_b:  int64     [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, ...
var_c:  float64   [0.7346545694885085, 0.7776711488732364, 0.49558114902 ...

- Cameron Raynor

1

我对R不是很了解，但这里有一些线索：

str =>

比较困难的一个问题...如果您想要使用函数，可以使用dir()命令。在数据集上运行dir()命令将提供您所有的方法，所以也许这不是您想要的...

summary => describe.

查看自定义结果的参数。

head => your can use head(), or use slices.

像你现在做的一样，使用head方法获取数据集 ds 的前10行 ds[:10] 同理，tail方法获取后10行 ds[:-10]

- Wakaru44

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- reedcourty · Accepted Answer

在pandas中，info()方法创建了一个与R的str()非常相似的输出：

> str(train)
'data.frame':   891 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 $ Child      : num  0 0 0 0 0 NA 0 1 0 1 ...


train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB