I have a huge 1000 × 100000 data, for example, the following, to transcode numerical values.
myd <- data.frame (v1 = sample (c("AA", "AB", "BB", NA), 10, replace = T),
v2 = sample (c("CC", "CG", "GG", NA), 10, replace = T),
v3 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v4 = sample (c("AA", "AT", "TT", NA) , 10, replace = T),
v5 = sample (c("CC", "CA", "AA", NA) , 10, replace = T)
)
myd
v1 v2 v3 v4 v5
1 AB CC <NA> <NA> AA
2 AB CG TT TT AA
3 AA GG AT AT CA
4 <NA> <NA> <NA> AT <NA>
5 AA <NA> AA <NA> CA
6 BB <NA> TT TT CC
7 AA GG AA AT CA
8 <NA> GG <NA> AT CA
9 AA <NA> AT <NA> CC
10 AA GG TT AA CC
Each variable has potentially four unique values.
unique(myd$v1)
[1] AB AA <NA> BB
Levels: AA AB BB
unique(myd$v2)
[1] CC CG GG <NA>
Levels: CC CG GG
Such unique values can be any combination, however, consists of two alphabets (except A). For example, "A", "B" in the first case will make combinations "AA", "AB", "BB". The numerical code for them will be 1, 0, -1, respectively. Similarly, for second-order alphabets, “C” “G” makes “CC”, “CG”, “GG”, so the numeric codes will be 1, 0, -1, respectively. Thus, the above myd needs to be transcoded to:
myd
v1 v2 v3 v4 v5
1 0 1 <NA> <NA> 1
2 0 0 -1 -1 1
3 1 -1 0 0 0
4 <NA> <NA> <NA> 0 <NA>
5 1 <NA> 1 < NA> 0
6 -1 <NA> -1 -1 -1
7 1 -1 1 0 0
8 <NA> -1 <NA> 0 0
9 1 <NA> 0 <NA> -1
10 1 -1 -1 1 -1