Python - Pandas相较于Numpy/Scipy有哪些重要的改进？

Question

Python - Pandas相较于Numpy/Scipy有哪些重要的改进？

pythonnumpypandasscipydata-analysis

7

我一直在使用numpy / scipy进行数据分析。最近我开始学习Pandas。

我已经学习了一些教程，正在努力理解Pandas相对于Numpy / Scipy的主要改进之处。

在我看来，Pandas的关键思想是将不同的numpy数组包装成一个数据框，并提供一些实用函数。

那么，Pandas有什么革命性的地方吗？可能是我太傻了，错过了什么重要的内容吗？

- CuriousMind

3个回答

8

I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.

Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.

For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.

对于时间序列来说，使用日期时间索引可以使处理时间序列数据更加方便，您可以平滑地重新采样到不同的间隔，填充值并且非常容易地绘制系列。

由于我的许多问题最初都在电子表格中出现，因此我也非常感激在.xls和.xlsx格式中相对透明地处理Excel文件的统一接口。

此外，还有更大的生态系统，例如seaborn等软件包，使得比基本的numpy/scipy工具更流畅的统计分析和模型拟合成为可能。

- chthonicdaemon

1

一个主要的观点是，它引入了新的数据结构，如数据帧、面板等，并且与其他结构和库有良好的接口。因此，它通常更像是Python生态系统的一个伟大扩展，而不是其他库的改进。对我来说，它是像numpy、bcolz等其他工具中的一个很棒的工具。我经常使用它来重塑我的数据，在开始进行数据挖掘等操作之前获取概述。

- PlagTag

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Travis Oliphant · Accepted Answer

Pandas是一个并不特别 "革命性" 的库，它使用NumPy和SciPy的生态系统以及一些关键的Cython代码来实现其目标。它可以被看作是对功能的更简单的API，同时添加了诸如联接和更简单的分组能力等关键工具，这对于那些具有表格数据或时间序列的人特别有用。但是，虽然不是革命性的，Pandas确实具有关键优势。

有一段时间，我也认为Pandas只是针对喜欢DataFrame接口的人的NumPy工具。但是，现在我认为Pandas提供了这些关键功能（这不是全面的）：

结构数组（独立存储非连续类型而不是NumPy中结构化数组的连续存储）-这将在许多情况下允许更快速的处理。
更简单的常见操作接口（文件加载、绘图、选择和连接/对齐数据）使得可以用更少的代码做更多的工作。
索引数组意味着操作始终对齐，而不需要自己跟踪对齐。
拆分-应用-合并是一种强大的思考和实施数据处理的方式。

然而，Pandas也有一些缺点：

Pandas基本上是一个用户界面库，而不是特别适合编写库代码。 "自动"功能可能会使您反复使用它们，即使您不需要它们，并减慢需要重复调用的代码。
Pandas通常占用更多内存，因为它慷慨地创建对象数组来解决诸如字符串处理等棘手问题。
如果您的用例超出了Pandas的设计范围，则很快就会变得笨拙。但是，在其设计范围内，Pandas对于快速数据分析非常强大且易于使用。