将字符型字符串转换为数据框

3

这里是一个我拥有的(没有空格)较大字符串的小样本,其中包含虚构个体的详细信息。

每个个体由.分隔。每个个体都有10个属性。

txt = "EREKSON(Andrew,Hélène),female10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."

我希望将其解析为数据框,每个变量有10列,每列有一个观测值。

我尝试使用正则表达式,并查看stackoverflow上的其他文本提取解决方案,但未能获得所需的输出。

根据字符输入,这是我心目中的最终数据框 -

result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
                    family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
                    gender = c('male','male','female','female'),
                    birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
                    birth_city = c('Geneva','London','Paris','Berlin'),
                    birth_country = c('Switzerland','England','France','Germany'),
                    acc_type = c('PPF','PPF','PPF','VAT'),
                    acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
                    district = c('dist.093','dist.097','dist.088','dist.078'),
                    code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))

非常感谢您的帮助。


我猜你可以从以下代码开始:`library(tidyverse)txt %>% str_split("(?<=\d)\.(?=[A-Z])") %>% enframe %>% unnest(everything()) %>% mutate(value = str_split(value, "\),")) %>% unnest_wider(value)`... - PaulS
似乎没有以正确的方式进行分割。 - Varun
那只是一个开始:你需要在它上面再多做些工作。 - PaulS
2个回答

3
这里有一个整洁的解决方案,使用 tidyr 的函数 separate_rowsextract
library(tidyr)
data.frame(txt) %>%
  # separate `txt` into rows using the dot `.` *if* 
  # preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
  separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
  extract(
          # select column from which to extract:
          txt,
          # define column names into which to extract:
          into = c("family_name","first_names","gender",
                   "birthday","birth_city","birth_country",
                   "acc_type","acc_num","district","code"),
          # describe the string exhaustively using capturing groups
          # `(...)` to delimit what's to be extracted:
          regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)@(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
  family_name first_names    gender birthday   birth_city birth_country acc_type acc_num  
  <chr>       <chr>          <chr>  <chr>      <chr>      <chr>         <chr>    <chr>    
1 EREKSON     Andrew,Peter   male   10/06/2011 Geneva     Switzerland   PPF      2000X007…
2 OBAMA       Barack,Hussian male   04/12/1956 London     England       PPF      2001X005…
3 CLINTON     Hillary        female 25/06/2013 Paris      France        PPF      2009X005…
4 GATES       Melinda        female 03/03/1980 Berlin     Germany       VAT      2010X006…
# … with 2 more variables: district <chr>, code <chr>

任何帮助他人使用正则表达式的人都值得鼓励!恭喜Chris~ - LDT

1

以下是使用tidyverse的解决方案,它将不同的stringr函数串联在一起以清理字符串,然后让readr读取它,基本上就像CSV一样:

library(dplyr, warn.conflicts = FALSE) # for pipes

df <- 
  txt %>% 
  
  # Replace "." sep with newline
  stringr::str_replace_all(
    "\\.[A-Z]", 
    function(x) stringr::str_replace(x, "\\.", "\n")
  ) %>% 
  
  # Replace all commas in (First[,Middle1,Middle2,...]) with space
  stringr::str_replace_all(
    # Match anything inside brackets, but as few times as possible, so we don't
    # match multiple brackets
    "\\(.*?\\)", 
    # Inside the regex that was matched, replace comma with space
    function(x) stringr::str_replace_all(x, ",", " ")
  ) %>% 
  
  # Replace ( with ,
  stringr::str_replace_all("\\(", ",") %>%
  
  # Remove )
  stringr::str_remove_all("\\)") %>%
  
  # Replace @ with ,
  stringr::str_replace_all("@", ",") %>%
  
  # Remove the last "."
  stringr::str_replace_all("\\.$", "\n") %>% 
  
  # Add , after female/male
  stringr::str_replace_all("male", "male,") %>% 
  
  # Read as comma delimited file (works since string contains \n)
  readr::read_delim(
    file = .,
    delim = ",",
    col_names = FALSE,
    show_col_types = FALSE
  )

# Add names (could also be done directly in read_delim with col_names argument)
names(df) <- c(
  "family_name",
  "first_names",
  "gender",
  "birthday",
  "birth_city",
  "birth_country",
  "acc_type",
  "acc_num",
  "district",
  "code"
)

df
#> # A tibble: 4 × 10
#>   family_name first_names      gender birthday birth_city birth_country acc_type
#>   <chr>       <chr>            <chr>  <chr>    <chr>      <chr>         <chr>   
#> 1 EREKSON     Andrew Hélène    female 10/06/2… Geneva     Switzerland   PPF     
#> 2 BOUKAR      Mohamed El-Hadi  male   04/12/1… London     England       PPF     
#> 3 HARIMA      Olak N’nassik G… female 25/06/2… Paris      France        PPF     
#> 4 THOMAS      Hajil Pau Joëli  female 03/03/1… Berlin     Germany       VAT     
#> # … with 3 more variables: acc_num <chr>, district <chr>, code <chr>

本文是由reprex软件包(v2.0.1)于2022年3月20日创建的

请注意,可能存在更有效的正则表达式可供使用,但我认为这种方法更简单,以后也更容易更改。


谢谢!我注意到当名字中包含特殊字符如'è','’','-','ï','''等时,名字的拆分不正确。 - Varun
我稍微修改了输入的“txt”,以考虑个人有3个名字的情况(在某些情况下,原始数据框中甚至存在4个名字)。代码如何适应考虑多个名字? - Varun
这看起来很不错,现在带有特殊字符的多个名字似乎运行良好!我注意到只有少数情况下它仍然无法正常工作。我已更新问题以反映它们。你有任何想法为什么这些不起作用但其他的起作用了吗? - Varun
谢谢。这个通用的正则表达式是否也能识别字符à? - Varun
1
明白了,我需要加强正则表达式的学习。非常感谢!接受你的答案。 - Varun
显示剩余5条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接