用Python将Twitter动态解析成表格

3

我有一组推文已经保存到了一个 .txt 文件中。

我想要在 Python 中将某些属性放入 SQLite 表格中。我成功地创建了这个表格。

import pandas
import sqlite3
conn = sqlite3.connect('twitter.db')
c = conn.cursor()

c.execute(CREATE TABLE Tweet
(
   created_at VARCHAR2(25),
   id VARCHAR2(25),
   text VARCHAR2(25)
   source VARCHAR2(25),
   in-reply_to_user_ID VARCHAR2(25), 
   retweet_Count VARCHAR2(25)

)

在尝试将解析的数据添加到数据库之前,我试图创建一个数据框来查看它。

tweets =pandas.read_table('file.txt', sep=',')

I get the error:

CParserError: Error tokenizing data. C error: Expected 63 fields in line 3, saw 69

我猜测在这些字段中,不仅有逗号分隔,字符串内部也有逗号。

此外,Twitter数据的格式我之前没有接触过。每个字段都以括号中的变量名开始,后跟冒号和用更多括号分隔的数据。例如:

"created_at":"Fri Oct 11 00:00:03 +0000 2013",

那么我该如何将这个内容转换成标准的表格格式,其中变量名在顶部呢?

一个完整的推文示例如下:

{"created_at":"Fri Oct 11 00:00:03 +0000 2013","id":388453908911095800,"id_str":"388453908911095809","text":"LAGI PUN VISITORS DATANG PUKUL 9 AH","source":"<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":447800506,"id_str":"447800506","name":"§yazwina·","screen_name":"_SAireen","location":"SSP","url":"http://flavors.me/syazwinaaireen#","description":"Absence makes the heart grow fonder. Stay us x @_DFitri's","protected":false,"followers_count":806,"friends_count":702,"listed_count":2,"created_at":"Tue Dec 27 08:29:53 +0000 2011","favourites_count":7478,"utc_offset":28800,"time_zone":"Beijing","geo_enabled":true,"verified":false,"statuses_count":32558,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http://a0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_tile":true,"profile_image_url":"http://a0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_image_url_https":"https://si0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_banner_url":"https://pbs.twimg.com/profile_banners/447800506/1369969522","profile_link_color":"FA0096","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"it"}

2
很遗憾,您不能将嵌套的JSON直接转换为平面的表格或pandas DataFrame,因为它们是本质上不同的结构。请查看Python的JSON库和pandas的read_json方法。您需要对Twitter数据进行一些处理才能将其转换为表格格式。 - Greg Reda
1个回答

0

我想已经有一个Python库可以做到这一点了,但是一旦我替换了这些未被引用的术语,我就能够将您的推文字符串解析为字典。

 false to False 
 true to True
 null to None

我刚刚将整个括号表达式分配给一个变量,创建了一个字典。然后你可以通过遍历并打印键作为标题和每个值作为条目。

修复或引用这三个值也可能使 pandas 解析器更加愉快,尽管我认为 csv 读取器可能会更好地处理所有嵌入的逗号和单引号和双引号。JSON 解析器仍然无法处理 URL 带有冒号的情况,我想。如果您要尝试 JSON,则可以尝试转义它们。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接