如何将PGN数据读入DataFrame

4

我有一个包含大量棋谱游戏的单个.pgn(便携式游戏符号)文件。该文件中的游戏内容如下:

[Event "FIDE World Cup 2017"]
[Site "Tbilisi GEO"]
[Date "2017.09.05"]
[Round "1.1"]
[White "Carlsen, Magnus"]
[Black "Balogun, Oluwafemi"]
[Result "1-0"]
[WhiteTitle "GM"]
[BlackTitle "FM"]
[WhiteElo "2822"]
[BlackElo "2255"]
[ECO "B00"]
[Opening "King's pawn opening"]
[WhiteFideId "1503014"]
[BlackFideId "8501246"]
[EventDate "2017.09.03"]

1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O         
8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 
14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 
20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 
26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. 
Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 
38. Nxd6 Kg6 39. Nf5 1-0

[Event "FIDE World Cup 2017"]    
etc...

我想用这些数据创建一个数据框,其中列标题是每个字符串左侧的单词,数据是字符串。然后为PGN字符串单独创建一列。
我已经参考了R: How to read in a PGN as a Data Frame中的方法尝试了一下,得出了以下结论:
pgn <- read.table("~/Desktop/GitHub/Chess/test.pgn", quote="", 
stringsAsFactors=FALSE)

# get column names
column_names <- sub("\\[(\\w+).+", "\\1", pgn[1:17,1])
column_names[17] <- "PGN"
#create DF
pgn.df <- data.frame(matrix(sub("\\[\\w+ \\\"(.+)\\\"\\]", "\\1", 
                     pgn[,1]),byrow=TRUE, ncol=17))

names(pgn.df) <- column_names

这里的问题是我的pgn信息有多行。那么有没有办法在正则表达式中考虑到这一点?或者有没有自动将文件更改为使pgn成为单行的方法?
谢谢!
3个回答

1

我建议先使用一个(更新的)替换正则表达式,在预处理步骤中去除不必要的换行符,如下所示:

/(?:[^\[\]\n\S])\s*\n/ /g

你可以在这里在线测试(以PGN作为输入文本)。但我在R中遇到了一些特殊字符转义的问题。
因此,我决定改用Perl。
use strict;
use File::Slurp;
my $text = read_file($ARGV[0]);
$text =~ s/(?:[^\[\]\n\S])\s*\n/ /g;
write_file($ARGV[0], $text);

这可以在R中这样调用

system("perl ~/Desktop/regex.pl ~/Desktop/test.pgn")

正则表达式在网站上似乎可以工作,但当我将其放入代码中时失败了。我收到以下错误消息:`pgn <- sub("/?:[^\[\]\n])[\n]/g", pgn)` `Error: '\[' is an unrecognized escape in character string starting ""/?:[^\["`你知道为什么吗?能否在代码中向我展示一下?谢谢。 - Griffin Kennedy

1
我还没有在Windows或Linux上测试过,但是这个包基于的C代码库声称非常易于移植。您需要一个能够从源代码编译的R设置(例如,如果您使用的是Windows,则需要Rtools)。
安装:
devtools::install_github("hrbrmstr/pigeon")

使用 (tidyverse 对于包的工作并不是必需的,但在我看来它比内置的基本 R 打印函数更清晰地打印数据框):

library(pigeon)
library(tidyverse)

这是一个带有内置数据集的小测试,可能是你正在使用的那个:
fide <- read_pgn(system.file("extdata", "r7.pgn", package="pigeon"))

fide
## # A tibble: 2 x 12
##            Event    Site       Date Round               White               Black  Result WhiteElo BlackElo   ECO
## *          <chr>   <chr>      <chr> <chr>               <chr>               <chr>   <chr>    <chr>    <chr> <chr>
## 1 World Cup 2017 Tbilisi 2017.09.23  44.1 Aronian Levon (ARM)    Ding Liren (CHN) 1/2-1/2     2799     2777   A18
## 2 World Cup 2017 Tbilisi 2017.09.24  45.1    Ding Liren (CHN) Aronian Levon (ARM) 1/2-1/2     2777     2799   E06
## # ... with 2 more variables: LiveChessVersion <chr>, Moves <list>

glimpse(fide)
## Observations: 2
## Variables: 12
## $ Event            <chr> "World Cup 2017", "World Cup 2017"
## $ Site             <chr> "Tbilisi", "Tbilisi"
## $ Date             <chr> "2017.09.23", "2017.09.24"
## $ Round            <chr> "44.1", "45.1"
## $ White            <chr> "Aronian Levon (ARM)", "Ding Liren (CHN)"
## $ Black            <chr> "Ding Liren (CHN)", "Aronian Levon (ARM)"
## $ Result           <chr> "1/2-1/2", "1/2-1/2"
## $ WhiteElo         <chr> "2799", "2777"
## $ BlackElo         <chr> "2777", "2799"
## $ ECO              <chr> "A18", "E06"
## $ LiveChessVersion <chr> "1.4.8", "1.4.8"
## $ Moves            <list> [c("c4", "Nf6", "Nc3", "e6", "e4", "d5", "cxd5", "exd5", "e5", "Ne4", "Nf3", "Bf5", "Be2"...

这是一个更大的测试:
tf <- tempfile(fileext = ".zip")
td <- tempdir()
download.file("https://www.pgnmentor.com/players/Adams.zip",  tf)
fil <- unzip(tf, exdir = td)

adams <- read_pgn(fil)

adams
## # A tibble: 2,982 x 11
##             Event      Site       Date Round              White              Black  Result WhiteElo BlackElo   ECO
##  *          <chr>     <chr>      <chr> <chr>              <chr>              <chr>   <chr>    <chr>    <chr> <chr>
##  1 Lloyds Bank op    London 1984.??.??     1     Adams, Michael    Sedgwick, David     1-0                     C05
##  2 Lloyds Bank op    London 1984.??.??     3     Adams, Michael  Dickenson, Neil F     1-0              2230   C07
##  3 Lloyds Bank op    London 1984.??.??     4       Hebden, Mark     Adams, Michael     1-0     2480            B10
##  4 Lloyds Bank op    London 1984.??.??     5    Pasman, Michael     Adams, Michael     0-1     2310            D42
##  5 Lloyds Bank op    London 1984.??.??     6     Adams, Michael   Levitt, Jonathan 1/2-1/2              2370   B99
##  6 Lloyds Bank op    London 1984.??.??     9     Adams, Michael Saeed, Saeed Ahmed     1-0              2430   B56
##  7         BCF-ch Edinburgh 1985.??.??     1     Adams, Michael   Singh, Sukh Dave 1/2-1/2     2360     2080   B70
##  8         BCF-ch Edinburgh 1985.??.??     2 Abayasekera, Roger     Adams, Michael     1-0     2200     2360   B13
##  9         BCF-ch Edinburgh 1985.??.??     3     Adams, Michael    Jackson, Sheila 1/2-1/2     2360     2225   C85
## 10         BCF-ch Edinburgh 1985.??.??     4     Muir, Andrew J     Adams, Michael 1/2-1/2     2285     2360   E45
## # ... with 2,972 more rows, and 1 more variables: Moves <list>

glimpse(adams)
## Observations: 2,982
## Variables: 11
## $ Event    <chr> "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds Bank op", "Lloyds ...
## $ Site     <chr> "London", "London", "London", "London", "London", "London", "Edinburgh", "Edinburgh", "Edinburgh",...
## $ Date     <chr> "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1984.??.??", "1985.??.??", ...
## $ Round    <chr> "1", "3", "4", "5", "6", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "?", "1", "...
## $ White    <chr> "Adams, Michael", "Adams, Michael", "Hebden, Mark", "Pasman, Michael", "Adams, Michael", "Adams, M...
## $ Black    <chr> "Sedgwick, David", "Dickenson, Neil F", "Adams, Michael", "Adams, Michael", "Levitt, Jonathan", "S...
## $ Result   <chr> "1-0", "1-0", "1-0", "0-1", "1/2-1/2", "1-0", "1/2-1/2", "1-0", "1/2-1/2", "1/2-1/2", "1-0", "1/2-...
## $ WhiteElo <chr> "", "", "2480", "2310", "", "", "2360", "2200", "2360", "2285", "2360", "2250", "2360", "2225", "2...
## $ BlackElo <chr> "", "2230", "", "", "2370", "2430", "2080", "2360", "2225", "2360", "2245", "2360", "2260", "2360"...
## $ ECO      <chr> "C05", "C07", "B10", "D42", "B99", "B56", "B70", "B13", "C85", "E45", "C84", "B10", "C85", "A22", ...
## $ Moves    <list> [c("e4", "e6", "d4", "d5", "Nd2", "Nf6", "e5", "Nfd7", "f4", "c5", "c3", "Nc6", "Ndf3", "cxd4", "...

使用一个已经成熟的 C “库”(严格来说它并不是一个库,但我硬塞进去了)的好处之一是它不仅仅可以进行模式匹配。如果游戏文件格式错误,它将不能正确解析(正如它不应该那样)。我需要通过 ASAN/UBSAN/Valgrind 运行它,以确保没有内存泄漏,但如果这对您有用,请告诉我,我会完善 pkg 的细节。

哇,多么有用的“库”啊!非常感谢。我能使用你的库来处理自己的数据吗?还是必须使用已经在“鸽子”中的数据? - Griffin Kennedy
对于某些数据集,例如KingBase,我会收到错误消息,如“错误:词法错误:字符串内部有无效字符。:“Bxh6”}],{“Event”:“Mnster Open”,“Site”:“Mnste(就在这里)------^”。即使我从@wp78de代码运行Perl代码也是如此。我该怎么办? - Parseltongue
@Parseltongue你能在GH问题中放置一个或两个链接吗?我可能能够为畸形文件制定一个权宜之计。 - hrbrmstr
太棒了——我已经绞尽脑汁思考了两个小时。现在将创建一个Github请求。 - Parseltongue

0

你可以选择另一种方法,将.pgn转换为.csv格式,这是panda最容易解析的文件结构。

https://pypi.org/project/pgn2data/

from converter.pgn_data import PGNData as pgnd
import pandas as pd

# This creates two output files, one for game info 
# (white_elo, black_elo, rating_diff, time_control... etc), 
# and one for moves.
 
filename = 'path to .pgn file'
pgn_data = pgnd(filename)
result = pgn_data.export()
result.print_summary()

# Then read the csv with pandas
# Change path to where your files output

path = 'Documents/github/project/folder/'
df_info = pd.read_csv(path + '_game_info.csv')
df_moves = pd.read_csv(path + '_moves.csv')

这个与Python相关的答案如何与这个“R”问题相关联? - Martin Gal

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接