Regular expression to extract unique fields from a .sdf file in R

I am looking for a regular expression in R to extract the fields specified in the chemical .sdf file. The fields in this case are limited to the <> symbol and follow the ">" at the beginning of the line. For instance. when

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

he would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they can occur several times, so I only need those unique()) Any thoughts?

amuses Tom

+3
source share
2 answers

Try it. It extracts the portion of the string that matches the portion of the regular expression surrounded by parentheses, and then simplifies it with unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED slight simplification.

+3
source

gregexpr regmatches unique .

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"
+3

All Articles