Regular expression to extract unique fields from a .sdf file in R

Question

Regular expression to extract unique fields from a .sdf file in R

I am looking for a regular expression in R to extract the fields specified in the chemical .sdf file. The fields in this case are limited to the <> symbol and follow the ">" at the beginning of the line. For instance. when

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

he would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they can occur several times, so I only need those unique()) Any thoughts?

amuses Tom

+3

string regex grep r chemistry

Tom wenseleers Feb 22 '14 at 20:15

source share

2 answers

gregexpr regmatches unique .

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"

+3

Sven Hohenstein 22 . '14 20:25

G. grothendieck · Accepted Answer · 2014-02-22T20:27:58+0000

Try it. It extracts the portion of the string that matches the portion of the regular expression surrounded by parentheses, and then simplifies it with unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED slight simplification.

Regular expression to extract unique fields from a .sdf file in R

More articles: