在R中将文本转换为句子格式

3
我正在尝试在R中清理以下数据 我有一个字符串向量,看起来像这样 -
    /organization/-fame
    /ORGANIZATION/-QOUNTER
    /organization/-qounter
    /ORGANIZATION/-THE-ONE-OF-THEM-INC-
    /organization/0-6-com
    /ORGANIZATION/004-TECHNOLOGIES
    /organization/01games-technology
    /ORGANIZATION/0NDINE-BIOMEDICAL-INC
    /organization/0ndine-biomedical-inc
    /ORGANIZATION/0XDATA
    /organization/0xdata
    /ORGANIZATION/0XDATA
    /organization/0xdata
    /ORGANIZATION/1
    /organization/1
    /ORGANIZATION/1
    /organization/1-2-3-listo
    /ORGANIZATION/1-4-ALL
    /organization/1-618-technology
    /ORGANIZATION/1-800-DENTIST
    /organization/1-800-doctors
    /ORGANIZATION/1-800-PUBLICRELATIONS-INC-
    /organization/1-mainstream
    /ORGANIZATION/1-OF-99
    /organization/10-20-media
    /ORGANIZATION/10-20-MEDIA

我想将字符串中的每个单词的大小写更改为句子大小写。因此,在更改后,它应该全部看起来像 -
    /Organization/-Fame
    /Organization/-Qounter
    /Organization/-The-One-Of-Them-Inc-
    /Organization/0-6-Com
    /Organization/004-Technologies
    /Organization/01Games-Technology
    /Organization/0Ndine-Biomedical-Inc
    /Organization/0Xdata
    /Organization/1
    /Organization/1-2-3-Listo
    /Organization/1-4-All
    /Organization/1-618-Technology
    /Organization/1-800-Dentist
    /Organization/1-800-Doctors
    /Organization/1-800-Publicrelations-Inc-
    /Organization/1-Mainstream
    /Organization/1-Of-99
    /Organization/10-20-Media
1个回答

3

您可以使用正则表达式。根据您提供的示例输入:

x<-c("/organization/-fame", "/ORGANIZATION/-QOUNTER", "/organization/-qounter", 
"/ORGANIZATION/-THE-ONE-OF-THEM-INC-", "/organization/0-6-com", 
"/ORGANIZATION/004-TECHNOLOGIES", "/organization/01games-technology", 
"/ORGANIZATION/0NDINE-BIOMEDICAL-INC", "/organization/0ndine-biomedical-inc", 
"/ORGANIZATION/0XDATA", "/organization/0xdata", "/ORGANIZATION/0XDATA", 
"/organization/0xdata", "/ORGANIZATION/1", "/organization/1", 
"/ORGANIZATION/1", "/organization/1-2-3-listo", "/ORGANIZATION/1-4-ALL", 
"/organization/1-618-technology", "/ORGANIZATION/1-800-DENTIST", 
"/organization/1-800-doctors", "/ORGANIZATION/1-800-PUBLICRELATIONS-INC-", 
"/organization/1-mainstream", "/ORGANIZATION/1-OF-99", "/organization/10-20-media", 
"/ORGANIZATION/10-20-MEDIA")

您可以运行

gsub("([[:alpha:]])([[:alpha:]]+)", "\\U\\1\\L\\2", x, perl=TRUE)

获取

 [1] "/Organization/-Fame"                     
 [2] "/Organization/-Qounter"                  
 [3] "/Organization/-Qounter"                  
 [4] "/Organization/-The-One-Of-Them-Inc-"     
 [5] "/Organization/0-6-Com"                   
 [6] "/Organization/004-Technologies"          
 [7] "/Organization/01Games-Technology"        
 [8] "/Organization/0Ndine-Biomedical-Inc"     
 [9] "/Organization/0Ndine-Biomedical-Inc"     
[10] "/Organization/0Xdata"                    
[11] "/Organization/0Xdata"                    
[12] "/Organization/0Xdata"                    
[13] "/Organization/0Xdata"                    
[14] "/Organization/1"                         
[15] "/Organization/1"                         
[16] "/Organization/1"                         
[17] "/Organization/1-2-3-Listo"               
[18] "/Organization/1-4-All"                   
[19] "/Organization/1-618-Technology"          
[20] "/Organization/1-800-Dentist"             
[21] "/Organization/1-800-Doctors"             
[22] "/Organization/1-800-Publicrelations-Inc-"
[23] "/Organization/1-Mainstream"              
[24] "/Organization/1-Of-99"                   
[25] "/Organization/10-20-Media"               
[26] "/Organization/10-20-Media"        

以上代码没有给我答案 - 它给了我一个警告消息, 警告信息: 在 [<-.factor(*tmp*, 1, value = c(NA, 3L, 2L, 4L, 5L, 6L, 7L, : 无效的因子水平,生成 NA,并将字符串转换为 NA。同样,许多字符串也被转换为 NA。 - snk
你尝试使用我提供的数据了吗?错误听起来像是你有一个因子向量而不是字符向量,并且你正在尝试重新分配该向量。这是你原始帖子中没有包含的信息。 - MrFlick
那么,如果我将因子向量转换为字符向量,您的解决方案是否有效?让我试试。 - snk

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接