This continues from this question that I asked the other day (now I think I should have asked about it at the same time).
Data
token.dt is a list consisting of data tables, each of which corresponds to n in n-grams and includes n-grams (i.e. a sequence of words) and their ratings.
> head(token.dt[[2]])
V1 V2 mi2
1: 0 0 6.494179
2: 0 001 13.249067
3: 0 002 13.249067
4: 0 005 13.249067
5: 0 025 13.249067
6: 0 039 13.249067
> head(token.dt[[5]])
V1 V2 V3 V4 V5 mi5
1: 0 0 1 0 1 10.353265
2: 0 001 in apart for 6.807743
3: 0 001 in thick and 5.190449
4: 0 002 on each side 11.688710
5: 0 005 m in f 9.940322
6: 0 025 in aluminum which 8.249075
Task
The task is to select n-grams (i.e. rows of tables in token.dt) that satisfy the following condition. The algorithm stores n-grams only if its score is higher than that of n-1 grams, and n + 1 grams are identified as follows:
- n-1 grams that correspond to the first n-1 words of the n-gram and
- n + 1 grams whose first n words correspond to n-grams.
As an example, consider the following.
> for (i in 2:n) setkeyv(token.dt[[i]], paste0("V", 1:i))
> token.dt[[2]][J("0", "1")]
V1 V2 mi2
1: 0 1 7.135725
> token.dt[[3]][J("0", "1")]
V1 V2 V3 mi3
1: 0 1 0 9.803035
2: 0 1 2 6.809646
3: 0 1 f 6.142258
4: 0 1 m 7.315181
5: 0 1 milligram 13.517241
6: 0 1 mv 13.517241
7: 0 1 of 1.151899
8: 0 1 the 0.214648
9: 0 1 to 3.633922
> token.dt[[4]][J("0", "1")]
V1 V2 V3 V4 mi4
1: 0 1 0 1 10.507784
2: 0 1 2 3 11.541023
3: 0 1 f the 3.927859
4: 0 1 m neutral 13.621798
5: 0 1 milligram of 3.852570
6: 0 1 milligram per 10.638304
7: 0 1 mv m 11.260860
8: 0 1 of making 12.235372
9: 0 1 the number 9.707556
10: 0 1 to 0 12.669723
11: 0 1 to 5 11.158356
( ) 0 1 0 , , (0 1), (9.803035 > 7.135725), 4-, , (0 1 0 1), , (10.507784 > 9.803035).
0 1 , , bigram, (13.517241 > 7.135725) 4-, (13.517241 > 3.852570, 13.517241 > 10.638304).
.
> z <- token.dt[[4]][token.dt[[3]][token.dt[[2]], allow.cartesian = TRUE], list(k = all(mi3 > max(mi2, mi4)), mi3), allow.cartesian = TRUE][(k)]
> head(z)
V1 V2 V3 k mi3
1: 0 1 milligram TRUE 13.51724
2: 0 1 mv TRUE 13.51724
3: 0 15 g TRUE 12.24260
4: 0 2 gram TRUE 13.52079
5: 0 2 mrads TRUE 13.34449
6: 0 3 wt TRUE 13.28771
, , - , .. (, mi3, mi4 ..).
paste0 with = FALSE .
> i <- 3
> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(paste0("mi", i) > max(paste0("mi", i - 1), paste0("mi", i + 1))), paste0("mi", i)), with = FALSE, allow.cartesian = TRUE][(k)]
Error in abs(j) : non-numeric argument to mathematical function
. envir = .SD eval , .
> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(eval(parse(text = paste0("mi", i))) > max(eval(parse(text = paste0("mi", i - 1))), eval(parse(text = paste0("mi", i + 1))))), eval(parse(text = paste0("mi", i)))), allow.cartesian = TRUE][(k)]
Error in eval(expr, envir, enclos) : object 'mi3' not found
, , - , .
> for (j in 2:4) {
+ if (j == 2) {
+ all <- copy(token.dt[[j]])
+ } else {
+ all <- token.dt[[j]][all, allow.cartesian = TRUE]
+ }
+ }
> head(all)
V1 V2 V3 V4 mi4 mi3 mi2
1: 0 0 1 0 13.292479 9.766820 6.494179
2: 0 001 in apart 13.233742 5.624795 13.249067
3: 0 001 in thick 13.005608 5.624795 13.249067
4: 0 002 on each 10.416711 7.301489 13.249067
5: 0 005 m in 5.625874 11.205271 13.249067
6: 0 025 in aluminum 13.443647 5.624795 13.249067
> z <- all[1:1000 , list(k = all(eval(parse(text = paste0("mi", i)), envir = .SD) > max(eval(parse(text = paste0("mi", i - 1)), envir = .SD), eval(parse(text = paste0("mi", i + 1)), envir = .SD))), mi = eval(parse(text = paste0("mi", i)), envir = .SD)), by = c(paste0("V", 1:i))][(k)]
> z <- unique(z)
> head(z)
V1 V2 V3 k mi
1: 0 1 milligram TRUE 13.51724
2: 0 1 mv TRUE 13.51724
3: 0 15 g TRUE 12.24260
4: 0 2 gram TRUE 13.52079
5: 0 2 mrads TRUE 13.34449
6: 0 3 wt TRUE 13.28771
. 1000 () 970 696 . , , , , , , .
, , , .
token.dt <- list()
types <- combn(LETTERS, 3, paste, collapse = "")
set.seed(1)
data <- data.table(matrix(sample(types, 4 * 1E6, replace = TRUE), ncol = 4))
setkey(data, V1, V2, V3, V4)
set.seed(1)
for (n in 2:4) {
token.dt[[n]] <- unique(cbind(data[ , 1:n, with = FALSE]))
token.dt[[n]][ , paste0("mi", n) := runif(nrow(token.dt[[n]])) * 10]
}
.