How to separate POS from words

Question

How to separate POS from words

You must create a text sparse matrix (DTM) for classification. To prepare the text, first I need to remove (split) the POS text tags. My guess was to do this as shown below. I am new to R and don't know how to undo REGEX (see NOT! Below).

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

My guess is how this might work:

(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS"  "/VVFIN" "./$."  

[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."  

[[3]]
[1] "-/TRUNC" "/APPR"   "/KON"   

[[4]]
[1] "/PIS"  "/ADJD" "./$." 

[[5]]
[1] "/NN"    "!!!/NE"

But do not specify how to cancel an expression like:

#                          VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn"  "ausläuft"  

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "Durch"   "und"   

[[4]]
[1] "man"  "zügig"

[[5]]
[1] "empfehlung"

+3

regex r pos-tagging

alex Feb 22 '14 at 3:49

source share

1 answer

alex · Answer 1 · 2014-02-22T04:24:53+0000

One of the possibilities is to exclude tags, search for POS tags and replace them with ''(i.e. empty text):

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn"     "ausläuft"

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "-RRB"  "Durch" "und"  

[[4]]
[1] "man"   "zügig"

[[5]]
[1] "empfehlung"

Friendly help rawr

How to separate POS from words

More articles: