如何通过更改x轴将已知值拟合到已知曲线上

12

这是对一个交叉验证问题的延续,我询问了可能的方法来解决这个问题。这个问题更加偏向于编程方面,所以我在Stack Overflow上发布了它。

背景

我有一条曲线,其日期已知,跨越了一年。这条曲线的y值是从每日温度和盐度记录计算出来的d18O值的预测。我还有从碳酸钙组成的贝壳中测量得到的d18O值。这些值沿着距离轴进行测量,其中第一个和最后一个测量大约(但不完全)在曲线的开头和结尾时进行。

It is known that d18O values match with the predicted values in the curve within some unknown random error. I want to get the best fit for the measured values to the curve by changing the x-axis for the measured values (or at least by matching the index with the index in the curve). In this way I can get estimates for the dates of the measured values and can further estimate the growth rate for the shell over the year. The growth rate is expected to be variable and there might be a growth hiatus (i.e. the growth stops). However, the growth between the measured values has to be > 0 (a constraint).

Example data

Here are the example datasets (curve and measured):

meas <- structure(list(index = 1:10, distance = c(0.1, 1, 3, 5, 7, 8, 
13, 20, 22, 25), value = c(3.5, 4.2, 4.5, 4.4, 4.7, 4.8, 5.1, 
4.9, 4.1, 3.7)), .Names = c("index", "distance", "value"), class = "data.frame",
row.names = c(NA, -10L))   

curve <- structure(list(date = structure(c(15218, 15219, 15220, 15221, 
15222, 15223, 15224, 15225, 15226, 15227, 15228, 15229, 15230, 
15231, 15232, 15233, 15234, 15235, 15236, 15237, 15238, 15239, 
15240, 15241, 15242, 15243, 15244, 15245, 15246, 15247, 15248, 
15249, 15250, 15251, 15252, 15253, 15254, 15255, 15256, 15257, 
15258, 15259, 15260, 15261, 15262, 15263, 15264, 15265, 15266, 
15267, 15268, 15269, 15270, 15271, 15272, 15273, 15274, 15275, 
15276, 15277, 15278, 15279, 15280, 15281, 15282, 15283, 15284, 
15285, 15286, 15287, 15288, 15289, 15290, 15291, 15292, 15293, 
15294, 15295, 15296, 15297, 15298, 15299, 15300, 15301, 15302, 
15303, 15304, 15305, 15306, 15307, 15308, 15309, 15310, 15311, 
15312, 15313, 15314, 15315, 15316, 15317, 15318, 15319, 15320, 
15321, 15322, 15323, 15324, 15325, 15326, 15327, 15328, 15329, 
15330, 15331, 15332, 15333, 15334, 15335, 15336, 15337, 15338, 
15339, 15340, 15341, 15342, 15343, 15344, 15345, 15346, 15347, 
15348, 15349, 15350, 15351, 15352, 15353, 15354, 15355, 15356, 
15357, 15358, 15359, 15360, 15361, 15362, 15363, 15364, 15365, 
15366, 15367, 15368, 15369, 15370, 15371, 15372, 15373, 15374, 
15375, 15376, 15377, 15378, 15379, 15380, 15381, 15382, 15383, 
15384, 15385, 15386, 15387, 15388, 15389, 15390, 15391, 15392, 
15393, 15394, 15395, 15396, 15397, 15398, 15399, 15400, 15401, 
15402, 15403, 15404, 15405, 15406, 15407, 15408, 15409, 15410, 
15411, 15412, 15413, 15414, 15415, 15416, 15417, 15418, 15419, 
15420, 15421, 15422, 15423, 15424, 15425, 15426, 15427, 15428, 
15429, 15430, 15431, 15432, 15433, 15434, 15435, 15436, 15437, 
15438, 15439, 15440, 15441, 15442, 15443, 15444, 15445, 15446, 
15447, 15448, 15449, 15450, 15451, 15452, 15453, 15454, 15455, 
15456, 15457, 15458, 15459, 15460, 15461, 15462, 15463, 15464, 
15465, 15466, 15467, 15468, 15469, 15470, 15471, 15472, 15473, 
15474, 15475, 15476, 15477, 15478, 15479, 15480, 15481, 15482, 
15483, 15484, 15485, 15486, 15487, 15488, 15489, 15490, 15491, 
15492, 15493, 15494, 15495, 15496, 15497, 15498, 15499, 15500, 
15501, 15502, 15503, 15504, 15505, 15506, 15507, 15508, 15509, 
15510, 15511, 15512, 15513, 15514, 15515, 15516, 15517, 15518, 
15519, 15520, 15521, 15522, 15523, 15524, 15525, 15526, 15527, 
15528, 15529, 15530, 15531, 15532, 15533, 15534, 15535, 15536, 
15537, 15538, 15539, 15540, 15541, 15542, 15543, 15544, 15545, 
15546, 15547, 15548, 15549, 15550, 15551, 15552, 15553, 15554, 
15555, 15556, 15557, 15558, 15559, 15560, 15561, 15562, 15563, 
15564, 15565, 15566, 15567, 15568, 15569, 15570, 15571, 15572, 
15573, 15574, 15575, 15576, 15577, 15578, 15579, 15580, 15581, 
15582, 15583, 15584), class = "Date"), index = 1:367, value = c(3.33, 
3.35, 3.36, 3.38, 3.4, 3.42, 3.43, 3.45, 3.47, 3.48, 3.5, 3.52, 
3.53, 3.55, 3.56, 3.58, 3.6, 3.61, 3.63, 3.64, 3.66, 3.67, 3.69, 
3.7, 3.72, 3.73, 3.75, 3.76, 3.78, 3.79, 3.81, 3.82, 3.83, 3.85, 
3.86, 3.88, 3.89, 3.9, 3.92, 3.93, 3.94, 3.96, 3.97, 3.98, 3.99, 
4.01, 4.02, 4.03, 4.04, 4.06, 4.07, 4.08, 4.09, 4.1, 4.11, 4.13, 
4.14, 4.15, 4.16, 4.17, 4.18, 4.19, 4.2, 4.21, 4.22, 4.23, 4.24, 
4.25, 4.26, 4.27, 4.28, 4.28, 4.29, 4.3, 4.31, 4.32, 4.33, 4.33, 
4.34, 4.35, 4.36, 4.36, 4.37, 4.38, 4.38, 4.39, 4.4, 4.41, 4.41, 
4.42, 4.42, 4.43, 4.44, 4.44, 4.45, 4.45, 4.46, 4.46, 4.47, 4.47, 
4.47, 4.48, 4.48, 4.49, 4.49, 4.49, 4.5, 4.5, 4.5, 4.51, 4.51, 
4.51, 4.52, 4.52, 4.53, 4.53, 4.53, 4.54, 4.54, 4.54, 4.55, 4.55, 
4.56, 4.57, 4.57, 4.58, 4.58, 4.59, 4.6, 4.61, 4.61, 4.62, 4.63, 
4.64, 4.64, 4.65, 4.66, 4.67, 4.67, 4.68, 4.69, 4.7, 4.7, 4.71, 
4.72, 4.72, 4.73, 4.74, 4.74, 4.75, 4.75, 4.75, 4.76, 4.76, 4.76, 
4.76, 4.76, 4.76, 4.76, 4.76, 4.76, 4.75, 4.75, 4.75, 4.75, 4.74, 
4.74, 4.73, 4.73, 4.73, 4.72, 4.72, 4.72, 4.71, 4.71, 4.71, 4.71, 
4.7, 4.7, 4.7, 4.71, 4.71, 4.71, 4.71, 4.72, 4.72, 4.73, 4.74, 
4.75, 4.75, 4.76, 4.78, 4.79, 4.8, 4.81, 4.82, 4.83, 4.84, 4.85, 
4.86, 4.88, 4.89, 4.9, 4.91, 4.92, 4.92, 4.93, 4.94, 4.95, 4.95, 
4.95, 4.96, 4.96, 4.96, 4.96, 4.96, 4.95, 4.95, 4.95, 4.94, 4.93, 
4.92, 4.92, 4.91, 4.9, 4.89, 4.88, 4.87, 4.86, 4.85, 4.84, 4.83, 
4.82, 4.8, 4.79, 4.78, 4.77, 4.76, 4.75, 4.75, 4.74, 4.73, 4.72, 
4.72, 4.71, 4.71, 4.71, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 
4.7, 4.7, 4.7, 4.7, 4.7, 4.69, 4.69, 4.69, 4.69, 4.69, 4.69, 
4.69, 4.69, 4.68, 4.68, 4.68, 4.67, 4.67, 4.67, 4.66, 4.65, 4.65, 
4.64, 4.63, 4.62, 4.61, 4.6, 4.59, 4.58, 4.57, 4.56, 4.55, 4.54, 
4.53, 4.51, 4.5, 4.49, 4.48, 4.47, 4.46, 4.45, 4.43, 4.42, 4.41, 
4.4, 4.39, 4.38, 4.37, 4.36, 4.35, 4.34, 4.33, 4.32, 4.32, 4.31, 
4.3, 4.29, 4.28, 4.28, 4.27, 4.26, 4.25, 4.24, 4.24, 4.23, 4.22, 
4.21, 4.21, 4.2, 4.19, 4.18, 4.17, 4.17, 4.16, 4.15, 4.14, 4.14, 
4.13, 4.12, 4.12, 4.11, 4.1, 4.09, 4.08, 4.08, 4.07, 4.06, 4.05, 
4.05, 4.04, 4.03, 4.02, 4.02, 4.01, 4, 4, 3.99, 3.98, 3.97, 3.97, 
3.96, 3.95, 3.94, 3.94, 3.93, 3.92, 3.92, 3.91, 3.9, 3.9, 3.89, 
3.88)), .Names = c("date", "index", "value"), row.names = c(NA, 
-367L), class = "data.frame")

...这是它的外观:

library(ggplot2)
library(scales)
library(gridExtra)

p.curve <- ggplot() + geom_line(data = curve, aes(x = date, y = value)) + scale_x_date(name = "Month", breaks = date_breaks("months"), labels = date_format("%b")) + labs(title = "curve")
p.meas <- ggplot(meas, aes(x = distance, y = value)) + geom_point(color = "red") + labs(title = "measured", x = "Distance (mm)")

grid.arrange(p.curve, p.meas, ncol = 1)

enter image description here

实践中的问题

我希望找到一种数学/统计方法,使用R语言将meas拟合到curve上,通过改变meas的x轴。此外,我想获得某种拟合优度统计量,以比较不同约束条件下拟合的“x轴”(如果我运行具有不同约束条件的多个模型)。我把“x轴模型”称为增长模型,因为本质上就是这样。我想通过指定meas值之间的距离必须大于0来限制拟合。即index == 2Meas值必须在index == 1的值之后出现。我还希望能够限制增长率(即相邻两个索引点之间的最大距离)。为了演示这一点,我将手动完成它:

ggplot() + geom_line(data = curve, aes(x = index, y = value)) + geom_line(data = meas, aes(x = index, y = value), color = "red", linetype = 2) + scale_x_continuous(breaks = seq(0,370,10)) + scale_y_continuous(breaks = seq(3,5,0.1))

enter image description here

首先,meas(红色虚线)中的一些指数必须锚定到curve(黑线)的指数上。我选择将第一个和最后一个点以及最高值的点作为锚点。

anchor <- data.frame(meas.index = c(1,7,10), curve.index = c(11,215,367))

example.fit <- merge(meas, anchor, by.x = "index", by.y = "meas.index", all = T, sort = F)
example.fit <- example.fit[with(example.fit, order(distance)),]

然后,我假设在这些锚定点之间有线性增长。增长将沿着曲线索引进行。每天曲线只有一个值。因此,增长将在日度尺度上进行。
library(zoo)
example.fit$curve.index <- round(na.approx(example.fit$curve.index),0)

在此之后,我将指数替换为日期,并绘制结果。
library(plyr)

example.fit$date <- as.Date(mapvalues(example.fit$curve.index, from = curve$index, to = as.character(curve$date)))

a <- ggplot() + geom_line(data = curve, aes(x = date, y = value)) + geom_point(data = example.fit, aes(x = date, y = value), color = "red") + scale_x_date(limits = range(curve$date), name = "Month", breaks = date_breaks("months"), labels = date_format("%b"))

b <- ggplot(example.fit, aes(x = date, y = distance)) + geom_line() + scale_x_date(limits = range(curve$date), name = "Month", breaks = date_breaks("months"), labels = date_format("%b"))

grid.arrange(a,b)

enter image description here

上面的图显示了基于三个锚点的拟合结果。下面的图显示了每天的时间间隔内建模的增长情况。在三月初的增长曲线中的弯曲是由于zoo包中的na.approx函数引起的某些有趣的数学现象,我不理解。

我尝试过什么

我的上一个问题中,我了解到动态时间规整可能是一种解决方案。我还发现了一个包含dtw函数的R软件包。很好。实际上,动态时间规整已经在我的那个问题的示例数据集中起作用了(除了设置约束条件),但我无法将其应用于这个数据集,因为curvemeas(在先前的问题中称为points)具有更多的数据点。我会尝试节省一些空间,不会在此处复制代码/图像。您可以在我对该问题的回答中看到我尝试过的内容。问题似乎是除了最简单的步骤模式之外,没有任何步骤模式可以处理这些类型的数据。最简单的步骤模式多次匹配测量值和曲线,这是我想避免的,因为我需要每个测量点的定义日期。此外,设置生长率必须在测量点之间>0的约束条件似乎很困难。

问题

我的问题有两个方面:首先,是否有比动态时间规整更好的方法来解决这个问题?其次,在R中如何实践?

编辑 2013年12月9日 我试图让问题更清晰。


我无法完全理解这个问题(例如,您写的是什么增长?)。我一直在思考这是否是我的错还是问题应该改进。我认为这是两者的混合。就目前而言,您的问题可能只能由该主题的专家回答,而且不太可能有其中之一掉落。 - Roland
正如@Roland所指出的那样,这个问题可能不容易理解。我不知道现在是否已经清楚了,但我尝试让它更清晰明了。 - Mikko
@Largh,它是动态时间规整,而不是包装。 - nograpes
@nograpes 好的,已经修复了。 - Mikko
1个回答

20
我不确定我完全理解目标是什么,但如果您想将测量点拟合到参考曲线,则使用dtw似乎是明智的选择。将10个测量点拟合到370多个曲线点确实会得到略微奇怪的结果(这只是对称步骤模式优化的结果)。我认为这在很大程度上取决于点数较少。
一个可能有帮助的选项是使用ggplot()(或其他函数)来平滑测量曲线并提供一些额外的匹配点。但显然它只能做那么多,具体取决于测量点的限制。由于点数如此之少,您可能会在拟合数据的过程中丢失信息。
如果您可以将curve修剪为与meas观测的第一个和最后一个点完全同步,那也会有所帮助,因为您正在与open.begin和open.end FALSE匹配,但我不确定确切的日期是否可用。
这显示了将meas平滑到80个点,并将10个点的原始数据和80个点的平滑数据映射到参考曲线curve。
require(ggplot2)
require(scales)
require(gridExtra)
require(dtw)
require(plyr)

# use ggplot default to smooth the 10 point curve
meas.plot.smooth<-ggplot(meas, aes(x = distance, y = value)) + geom_line() + labs(title = "ggplot smoothed (blue curve)")+geom_smooth()
# use ggplot_build() to get the smoothed points
meas.curve.smooth<-ggplot_build(meas.plot.smooth)$data[[2]]

orig.align<-dtw(meas$value,curve$value,keep=T,step.pattern=symmetric1)
orig.freqs<-count(orig.align$index1)
# reference the matching points (which are effectively dates)
orig.freqs$cumsum<-cumsum(orig.freqs$freq)  

g.10<-ggplot() + geom_line(data = curve, aes(x = date, y = value)) +
  geom_line(aes(x = curve[orig.freqs$cumsum,"date"], y = meas$value),color="red") +
  geom_text(aes(x = curve[orig.freqs$cumsum,"date"], y = meas$value, label=orig.freqs$x),color="red",size=5) + 
  scale_x_date(name = "Month", breaks = date_breaks("months"), labels = date_format("%b")) + 
  labs(title = "Native 10 pt curve - dtw mapped")


smooth.align<-dtw(meas.curve.smooth$y,curve$value,keep=T,step.pattern=symmetric1)
smooth.freqs<-count(smooth.align$index1)
smooth.freqs$cumsum<-cumsum(smooth.freqs$freq)

g.80<-ggplot() + geom_line(data = curve, aes(x = date, y = value)) +
  geom_line(aes(x = curve[smooth.freqs$cumsum,"date"], y = meas.curve.smooth$y),color="red") +
  scale_x_date(name = "Month", breaks = date_breaks("months"), labels = date_format("%b")) + 
  geom_text(aes(x = curve[smooth.freqs$cumsum,"date"], y = meas.curve.smooth$y, label=smooth.freqs$x),color="red",size=3.5,position="jitter") + 
  labs(title = "80 point loess curve - dtw mapped")

grid.arrange(meas.plot.smooth,g.10,g.80,ncol=1)

enter image description here

编辑

显然,问题的一部分在于置信区间。我在这里放了一个例子,在标准误差水平上建立一个随机曲线。正如你所看到的,它与使用预测曲线本身非常不同。我认为问题在于,当你试图将10个测量值映射到一个370点参考曲线上时,除非它们极度紧密地跟踪,否则很难得到精确的预测。

rand.align<-dtw(meas.curve.smooth$ymin+(meas.curve.smooth$ymax-meas.curve.smooth$ymin)*runif(length(meas.curve.smooth$ymin)),curve$value,keep=T,step.pattern=symmetric1)
rand.freqs<-count(rand.align$index1)
rand.freqs$cumsum<-cumsum(rand.freqs$freq)

g.rand<-ggplot() + geom_line(data = curve, aes(x = date, y = value)) +
  geom_line(aes(x = curve[rand.freqs$cumsum,"date"], y = meas.curve.smooth$y),color="red") +
  scale_x_date(name = "Month", breaks = date_breaks("months"), labels = date_format("%b")) + 
  geom_text(aes(x = curve[rand.freqs$cumsum,"date"], y = meas.curve.smooth$y, label=rand.freqs$x),color="red",size=3.5,position="jitter") + 
  labs(title = "Random curve within standard CI - dtw mapped")

grid.arrange(meas.plot.smooth,g.10,g.80,g.rand,ncol=1)

这里输入图像描述

编辑:更新包括模拟。

好的,这个更新运行了1000次模拟。它创建了从95% CI内随机选择的用于映射的曲线。我将geom_smooth()函数中的n从80更改为10,以尽可能保留从测量曲线中获得的信息。

它模拟了累积增长(假设未测量的天数之间是线性增长)

不确定它是否完全有用,但提供了一种很好的可视化不确定性的方法。

get_scenario<-function(i){
  set.seed(i)
  # create random curve within the CI
  rand.align<-dtw(meas.curve.smooth$ymin+(meas.curve.smooth$ymax-meas.curve.smooth$ymin)*runif(length(meas.curve.smooth$ymin)),curve$value,keep=T,step.pattern=symmetric1)
  rand.freqs<-count(rand.align$index1)
  rand.freqs$cumsum<-cumsum(rand.freqs$freq)
  growth.index<-data.frame(cumsum=curve$index,val=curve$value)
  merged<-merge(growth.index,rand.freqs,by="cumsum")
  return(data.frame(x=merged$cumsum,growth=cumsum(merged$val*merged$freq),scenario=i))  
}

scenario.set <- ldply(lapply(1:1000,function(l)get_scenario(l)), data.frame)

g.s<-ggplot(scenario.set,aes(x,growth)) + 
      geom_line(aes(,group=scenario,color=scenario),alpha=0.25) + 
      scale_colour_gradient(low = "yellow", high = "orangered") +
      xlab("Days from start") + ylab("Cumulative Growth")
g.xmax<-max(scenario.set$x)  # get the final day (or set to another day)
g.xmin<-g.xmax-30            # thirty day window from end
b<-ggplot_build(g.s)
build.data<-b$data[[1]]
ylims<-build.data[build.data$x<=g.xmax & build.data$x>=g.xmin,]$y

g.subplot<-g.s+geom_point(aes(x,growth,color=scenario),alpha=0.25,size=5,position="jitter")+coord_cartesian(xlim=c(g.xmin,g.xmax),ylim=c(min(ylims),max(ylims)))

grid.arrange(meas.plot.smooth,g.s,g.subplot,ncol=1)    

enter image description here

以下是其他查看尾部的方式:

g.s<-ggplot(scenario.set,aes(x,growth)) + 
      geom_line(aes(,group=scenario,color=scenario),alpha=0.25) + 
      scale_colour_gradient(low = "yellow", high = "orangered") +
      xlab("Days from start") + ylab("Cumulative Growth")
g.xmax<-max(scenario.set$x)  # get the final day (or set to another day)
g.xmin<-g.xmax-50            # thirty day window from end
b<-ggplot_build(g.s)
build.data<-b$data[[1]]
ylims<-build.data[build.data$x<=g.xmax & build.data$x>=g.xmin,]$y

g.subplot<-g.s+geom_point(aes(x,growth,color=scenario),alpha=0.25,size=5,position="jitter")+coord_cartesian(xlim=c(g.xmin,g.xmax),ylim=c(min(ylims),max(ylims)))

grid.arrange(meas.plot.smooth,g.s,g.subplot,ncol=1)    

g.box<-ggplot(build.data)+
  geom_boxplot(aes(x,y,group=cut(x,max(x)/7),fill=cut(x,max(x)/7)),alpha=0.5)+ # bucket by group
  theme(legend.position="none")+
  coord_cartesian(xlim=c(g.xmin,g.xmax),ylim=c(min(ylims)-50,max(ylims)+50))

build.data.sum<-ddply(build.data,.(x),summarise,ymax=max(y),ymin=min(y),mean=mean(y))

g.spots<-ggplot(build.data)+
  geom_point(aes(x,y,color=group),size=10,alpha=0.25,position="jitter")+
  theme(legend.position="none")+scale_colour_gradient(low = "yellow", high = "orangered")+
  geom_ribbon(data=build.data.sum,aes(x,ymax=ymax,ymin=ymin),alpha=0.25)+
  coord_cartesian(xlim=c(g.xmax-50,g.xmax+1),ylim=c(min(ylims)-50,max(ylims)+50))+geom_smooth(aes(x,y),n=max(build.data$x))

grid.arrange(g.box,g.spots,ncol=1)    

enter image description here


非常抱歉我没有早些回复。我需要思考一下。这个解决方案有点可行,但是必须从80个点的loess曲线中获取本地点。需要拟合本地点,并且必须知道它们的位置。等我有时间再仔细思考一下这个问题。 - Mikko
@Largh - 更新后,展示了一种将测量随机平滑映射到标准误差范围内的参考曲线的方法。正如您所看到的,它可能会有很大的变化,但也许您可以在大型模拟中使用类似的东西?我认为问题在于,除非您的增长轨迹几乎完全跟踪d18O,否则仅从10个点进行准确预测将非常困难,不是吗? - Troy
谢谢更新!没错,也许不可能做出精确的预测,相反,我们可以想出不同的情景。我一直在寻找表达不确定性的方法,你的解决方案是个好开端。我会等待一天左右,除非有人提出更好地解决问题的方法,否则我会接受你的答案。 - Mikko
@Largh 我认为场景模拟可能会有所帮助 - 我会考虑一个例子并编辑答案。 - Troy
哇,这是一个令人印象深刻的详细回答 =)可能有些地方可以简化,但这绝对有助于让我开始,并且您应该得到这个答案的积分(以及更多!)。谢谢!也许在我的真实数据会话之后,我有一天会在这里发布另一个答案,但那需要几周甚至几个月的时间。 - Mikko
3
@Troy 这是一篇非常详细(并且经过耐心编辑!)的回答。做得好 +1。 - Simon O'Hanlon

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接