在OLS线性回归中找到p值和z统计量

4
我能够从线性回归中找到系数和截距,但无法找到适合获取各自变量趋势的p值和z值的方法。此外,我也无法找到一种将输出结果保存为Excel格式的方法。数据在这里。共有24个变量与时间相关。我没有得到z统计量和p值,同时第一种方法的估计值也不正确。我错在哪里了?
library("trend")

# read ozone data (I converted to a text file first)
otm <- read.table("D:/data.txt",header=T)


#  make a data frame version
otm_df <- data.frame(otm)
markers <- sample(0:1, replace = T, size = 11)

# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])

我尝试了这个方法,但是我没有得到z统计量并且无法将其保存为Excel格式。
library(reshape2)
DF <- reshape2::melt(otm, id.var = "Year")
library(broom); library(tidyverse)
ols <- DF %>% nest(data = -variable) %>% 
  mutate(model = map(data, ~lm(value ~ Year, data = .)), 
         tidied = map(model, tidy)) %>% 
  unnest(tidied)

#to save the results in excel format (not working here for me)
capture.output(summary(ols), file = "ols.csv" )
write.csv(ols, file.path('E:/',filename = "ols2.csv"), row.names = TRUE) 

# A tibble: 48 x 8
   variable data              model  term         estimate std.error statistic p.value
   <fct>    <list>            <list> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 BanTES   <tibble [11 x 2]> <lm>   (Intercept) -236.       488.       -0.483   0.641
 2 BanTES   <tibble [11 x 2]> <lm>   Year           0.139      0.242     0.572   0.582
 3 SriTES   <tibble [11 x 2]> <lm>   (Intercept)  220.       351.        0.627   0.546
 4 SriTES   <tibble [11 x 2]> <lm>   Year          -0.0935     0.174    -0.536   0.605
 5 AfgTES   <tibble [11 x 2]> <lm>   (Intercept)  364.       444.        0.820   0.434
 6 AfgTES   <tibble [11 x 2]> <lm>   Year          -0.161      0.221    -0.730   0.484
 7 BhuTES   <tibble [11 x 2]> <lm>   (Intercept)  373.       831.        0.449   0.664
 8 BhuTES   <tibble [11 x 2]> <lm>   Year          -0.170      0.413    -0.412   0.690
 9 IndTES   <tibble [11 x 2]> <lm>   (Intercept) -342.       213.       -1.60    0.143
10 IndTES   <tibble [11 x 2]> <lm>   Year           0.190      0.106     1.80    0.106 

summary(ols)

    variable  data.Length  data.Class  data.Mode model.Length  model.Class  model.Mode     term          
 BanTES : 2   2       tbl_df  list               12    lm    list                      Length:48         
 SriTES : 2   2       tbl_df  list               12    lm    list                      Class :character  
 AfgTES : 2   2       tbl_df  list               12    lm    list                      Mode  :character  
 BhuTES : 2   2       tbl_df  list               12    lm    list                                        
 IndTES : 2   2       tbl_df  list               12    lm    list                                        
 NepTES : 2   2       tbl_df  list               12    lm    list                                        
 (Other):36   2       tbl_df  list               12    lm    list  

任何帮助都将非常有用。提前感谢您!

help("summary.lm") - Roland
2
当然,这是一个t统计量而不是z统计量。 - Roland
你确定在Excel中使用的是相同的DV和IV吗?初学者有时会混淆这些。 - Roland
3
请提供一个最小可复现示例。 - Roland
你是不是想要使用 map(model, tidy)?你现在使用的是 map(models, tidy) - Robin Gertenbach
显示剩余4条评论
3个回答

3

如@Roland所言,线性回归不提供Z统计量,而是提供t统计量。您可以使用下面的代码以Excel格式保存系数、t统计量和p值。

library(tidyverse)
library(broom)
library(openxlsx)

# read ozone data (I converted to a text file first)
otm <- read.table("data.txt", header=T)
head(otm, 2)

otm %>% 
  pivot_longer(-c("Year")) %>%
  group_by(name) %>% 
  do(fitlm = tidy(lm(value ~ Year, data = ., na.action=na.omit))) %>% 
  unnest(fitlm) %>% 
  write.xlsx("Linear_trend_1.xlsx")

在 "Linear_trend_1.xlsx" 文件中,每个位置的 Year 行提供了斜率,而统计量是 t 统计量。
由于第一种方法使用标记作为因变量,并将 otm_df 中的所有变量用作自变量,所以第一种方法的值与第二种方法不匹配。在第二种方法中,年份被用作自变量,而所有地点的数值则为因变量。
另外,您正在使用样本函数创建标记,每次运行都会给出不同的输出结果。因此,您必须使用 set.seed 使其对每次运行保持恒定。
set.seed(123)
otm %>% 
  mutate(markers = sample(0:1, replace = T, size = 11)) %>% 
  pivot_longer(-c("Year", "markers")) %>%
  group_by(name) %>% 
  do(fitlm = tidy(lm(markers ~ value, data = ., na.action=na.omit))) %>% 
  unnest(fitlm) %>% 
  write.xlsx("Linear_trend_2.xlsx")

# A tibble: 48 x 6
   name   term        estimate std.error statistic p.value
   <chr>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 AfgERA (Intercept)  -8.58      7.79      -1.10    0.299
 2 AfgERA value         0.577     0.498      1.16    0.276
 3 AfgTES (Intercept)  -1.62      2.99      -0.542   0.601
 4 AfgTES value         0.0521    0.0750     0.695   0.505
 5 AfgTOM (Intercept) -25.3      19.5       -1.30    0.225
 6 AfgTOM value         1.94      1.46       1.33    0.218
 7 BanERA (Intercept)  -2.28      5.03      -0.453   0.662
 8 BanERA value         0.201     0.370      0.543   0.600
 9 BanTES (Intercept)  -3.42      2.80      -1.22    0.253
10 BanTES value         0.0892    0.0644     1.39    0.199
# ... with 38 more rows

如果我按照以下方式修改你的第一种方法,你会发现上述代码和你的代码输出基本相同。
#  make a data frame version
otm_df <- data.frame(otm)
set.seed(123) #To have same output from sample function for every run
markers <- sample(0:1, replace = T, size = 11)

# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])

Year.x     BanTES.x     SriTES.x     AfgTES.x     BhuTES.x     IndTES.x 
 0.054545455  0.089176876  0.092629314  0.052132553  0.001602003  0.107446434 
    NepTES.x     PakTES.x      SATES.x     BanERA.x     SriERA.x     AfgERA.x 
 0.065125607 -0.115108438  0.315928753  0.200712285 -0.519996739  0.577317012 
    BhuERA.x     IndERA.x     NepERA.x     PakERA.x      SAERA.x     BanTOM.x 
-0.140990921  0.110578204 -0.030546686  0.265909319  0.176797510  1.897234338 
    SriTOM.x     AfgTOM.x     BhuTOM.x     IndTOm.x     NepTOM.x     PakTOM.x 
 5.380610477  1.935170281  1.761172711  2.248531891  2.107380452  2.011356580 
     SATOM.x 
 2.214848959 

这是dput格式的数据

dput(otm)
structure(list(Year = 2008:2018, BanTES = c(44.06247376, 43.81239107, 
40.68010622, 46.97760506, 37.49591135, 43.81239107, 43.81239107, 
43.81239107, 43.81239107, 45.27803189, 44.06247376), SriTES = c(32.07268265, 
35.01918477, 29.91018035, 34.1291577, 28.5258431, 32.07268265, 
32.07268265, 32.07268265, 32.07268265, 30.96753552, 32.07268265
), AfgTES = c(39.19328867, 42.06325898, 42.31015918, 40.54543762, 
34.28696385, 40.54543762, 40.54543762, 40.54543762, 40.54543762, 
37.38974643, 39.19328867), BhuTES = c(34.08241824, 36.95440954, 
30.41561338, 30.37004394, 19.8861367, 30.41561338, 30.41561338, 
30.41561338, 30.41561338, 32.09933763, 32.09933763), IndTES = c(40.05913352, 
41.54741392, 38.88957844, 42.47544504, 43.24350644, 41.54741392, 
41.54741392, 41.54741392, 41.54741392, 42.65820983, 42.47544504
), NepTES = c(38.12979871, 37.62785275, 34.40488247, 37.7995467, 
39.64286364, 37.7995467, 37.7995467, 37.7995467, 37.7995467, 
30.63632105, 37.7995467), PakTES = c(41.38388734, 41.99865359, 
42.16236093, 42.51838941, 43.4444952, 42.16236093, 42.16236093, 
42.16236093, 42.16236093, 44.96627251, 42.51838941), SATES = c(40.03077, 
41.52302, 39.6327, 41.9098, 41.11191, 41.11191, 41.11191, 41.11191, 
41.11191, 41.57009, 41.52302), BanERA = c(13.76686693, 13.904453, 
13.40584856, 13.45199721, 13.47657436, 12.8992102, 13.3586098, 
14.23223365, 13.4228729, 13.21487616, 14.50830571), SriERA = c(11.81852768, 
11.79187354, 11.51484349, 11.50552588, 11.489789, 11.23384852, 
10.61182708, 11.33951759, 11.6357584, 11.74685028, 12.14987906
), AfgERA = c(15.44115983, 15.425995, 15.6161623, 15.47751927, 
15.81748069, 15.47498417, 15.41748855, 16.06462541, 15.61143062, 
15.32810621, 16.39162424), BhuERA = c(14.34493453, 14.28085419, 
14.24728543, 14.03202106, 14.04152053, 13.42977221, 13.22665229, 
14.344052, 13.58792484, 13.28851619, 14.28029524), IndERA = c(14.08262362, 
14.11485037, 13.80713493, 13.86114379, 13.92607879, 13.37996473, 
13.45767152, 14.49365275, 13.88142768, 13.73986257, 14.77032404
), NepERA = c(14.93883379, 14.896056, 14.78607828, 14.50880606, 
14.69793299, 13.96309811, 14.18825383, 15.32530354, 14.38700954, 
13.98545482, 14.9828434), PakERA = c(14.89773191, 14.87337075, 
14.89837223, 14.76236826, 15.13918051, 14.61385609, 14.54589641, 
15.40150813, 14.8588883, 14.62185208, 15.6575491), SAERA = c(14.38877, 
14.40468, 14.22069, 14.20561, 14.35855, 13.8399, 13.88027, 14.83054, 
14.24554, 14.06201, 15.09615), BanTOM = c(9.317937851, 9.308046341, 
9.327401161, 9.319338799, 9.311285019, 9.300317764, 9.292790413, 
9.540946007, 9.04840374, 9.300317764, 9.317937851), SriTOM = c(5.437336445, 
5.436554909, 5.435492039, 5.440690994, 5.436693192, 5.440601349, 
5.427892685, 5.54946661, 5.427827358, 5.440601349, 5.437336445
), AfgTOM = c(13.31581736, 13.30339324, 13.30090284, 13.29781604, 
13.33800817, 13.31919873, 13.30073023, 13.62503445, 13.16488469, 
13.30073023, 13.31581736), BhuTOM = c(11.69911337, 11.67375898, 
11.71142626, 11.69099903, 11.68556881, 11.68714046, 11.65387106, 
11.97064924, 11.44872904, 11.67375898, 11.69099903), IndTOm = c(9.709311898, 
9.704142364, 9.703938368, 9.72520479, 9.709638531, 9.708799697, 
9.690851817, 9.952517961, 9.499369441, 9.704142364, 9.709638531
), NepTOM = c(12.45835066, 12.43677187, 12.48850822, 12.49002218, 
12.46283817, 12.50376368, 12.44072294, 12.78685617, 12.27891684, 
12.44072294, 12.46283817), PakTOM = c(12.37911913, 12.38028261, 
12.37067625, 12.38315158, 12.38352468, 12.36856567, 12.37349086, 
12.67422019, 12.18962786, 12.37349086, 12.38315158), SATOM = c(10.63543, 
10.62967, 10.62981, 10.64489, 10.63941, 10.63525, 10.6195, 10.89613, 
10.44028, 10.62967, 10.63941)), class = "data.frame", row.names = c(NA, 
-11L))

谢谢你们所有人。 - Lalantra
我只是在检查结果,似乎第一种方法给出了OLS的正确结果而不是第二种方法。 BanTES的斜率值从第一种方法得到0.138595178(正确),而第二种方法给出了一个错误的值0.089176876。 - Lalantra
实际上,第一种方法是正确的。但为了与您的问题保持相似性,我展示了第二种方法。您可以根据需要获取输出。 - UseR10085

2
您可以使用glance代替tidy来计算p值:
ols <- DF %>% 
  nest(data = -variable) %>% 
  mutate(model  = map(data, ~lm(value ~ Year, data = .)), 
         tidied = map(model, glance)) %>% 
  unnest(tidied)

> ols
# A tibble: 24 x 15
   variable data              model  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
   <fct>    <list>            <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
 1 BanTES   <tibble [11 x 2]> <lm>     0.0350        -0.0722 2.54     0.327    0.582     1 -24.8   55.5  56.7    58.2            9    11
 2 SriTES   <tibble [11 x 2]> <lm>     0.0309        -0.0767 1.83     0.287    0.605     1 -21.2   48.3  49.5    30.1            9    11
 3 AfgTES   <tibble [11 x 2]> <lm>     0.0559        -0.0490 2.32     0.533    0.484     1 -23.7   53.5  54.7    48.2            9    11
 4 BhuTES   <tibble [11 x 2]> <lm>     0.0185        -0.0905 4.33     0.170    0.690     1 -30.6   67.3  68.4   169.             9    11
 5 IndTES   <tibble [11 x 2]> <lm>     0.264          0.183  1.11     3.23     0.106     1 -15.7   37.3  38.5    11.1            9    11
 6 NepTES   <tibble [11 x 2]> <lm>     0.0689        -0.0345 2.49     0.666    0.435     1 -24.5   55.0  56.2    55.6            9    11
 7 PakTES   <tibble [11 x 2]> <lm>     0.243          0.159  0.872    2.89     0.123     1 -13.0   32.0  33.2     6.84           9    11
 8 SATES    <tibble [11 x 2]> <lm>     0.221          0.135  0.625    2.56     0.144     1  -9.34  24.7  25.9     3.52           9    11
 9 BanERA   <tibble [11 x 2]> <lm>     0.0252        -0.0831 0.482    0.233    0.641     1  -6.49  19.0  20.2     2.09           9    11
10 SriERA   <tibble [11 x 2]> <lm>     0.00230       -0.109  0.416    0.0208   0.889     1  -4.87  15.7  16.9     1.56           9    11

谢谢,但斜率值在哪里?它是什么统计值?是Z统计还是t统计?如果是t统计,那么为什么你估计的值与我推导的值不匹配?如何以Excel格式保存结果,包括系数、Z统计和p值? - Lalantra
1
@Bappa Das和@Roland回答了Z-stat和t-stat问题。对于斜率和截距,您可以返回到“tidy”而不是“glance”。为了保存结果,我通常更喜欢使用write.csv2函数。 - Mata

2

关于将数据保存到Excel,如果您的数据在tibble或data.frame中,则可以使用以下函数:

  • readr::write_csv 将数据保存为CSV文件,可以轻松地在Excel中打开;
  • writexl::write_xlsx 将数据保存为本地Excel文件。

任何一个函数都会将所有行和列的数据保存到文件中。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接