Aligning sequences with missing values

The language I use is R, but you don't have to know about R to answer the question.

Question: I have a sequence that can be considered the main truth, and the other is a shifted version of the first, with some missing values. I would like to know how to align two.

Customization

I have a sequence ground.truththat basically is a set of times:

ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
                rep( seq(0,length.out=5,by=4*10+30), each=10 )

Think about ground.truthhow the times when I do the following:

{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5

I have a second sequence observationsthat is ground.truthshifted from 20% of the missing values:

nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs     <- length(observations)

If I draw these vectors, this is what it looks like (remember, think of it as time):

enter image description here

, . :

  • (theLag )
  • idx , ground.truth[idx] == observations - theLag

, theLag. , ground.truth[1] observations[1]-theLag. , lagI ground.truth[1] == observations[1+lagI]-theLag.

, , - (ccf).

, , . - 0, ground.truth[1] == observations[1] - theLag. , , observations[1] - theLag ground.truth[1] (.. idx_to_keep, , 1).

theLag - ( ccf(x,y) == ccf(x,y-constant)?), .

, , observations , ground.truth? , theLag==0, - , , .

- , R-/, ?

.

+3
2

() :

diffs <- outer(observations, ground.truth, '-')

, length(observations) :

which(table(diffs) == length(observations))
# 55.715382960625 
#              86 

:

theLag
# [1] 55.71538

, theLag:

idx <- which(ground.truth %in% (observations - theLag))
+6

, .

, - , , .

# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]

, , , , "".

# Loss function
library(sqldf)
f <- function(u) {
  # Put all the values in a data.frame
  d1 <- data.frame(g="truth",    value=x)
  d2 <- data.frame(g="observed", value=y+u)
  d <- rbind(d1,d2)
  # For each observed value, find the next truth value
  # (we could take the nearest, on either side, 
  # but it would be more complicated)
  d <- sqldf("
    SELECT A.g, A.value, 
           ( SELECT MIN(B.value) 
             FROM   d AS B 
             WHERE  B.g='truth' 
             AND    B.value >= A.value
           ) AS next
    FROM   d AS A
    WHERE  A.g = 'observed'
  ")
  # If u is greater than the lag, there are missing values.
  # If u is smaller, the differences decrease 
  # as we approach the lag.
  if(any(is.na(d))) {
    return(Inf)
  } else {
    return( sum(d$`next` - d$value, na.rm=TRUE) )
  }
}

.

# Look at the loss function
sapply( seq(-2,2,by=.1), f )

# Minimize the loss function.
# Change the interval if it does not converge, 
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time
+2

All Articles