Remove the end of the URL string in R

I am trying to clear and remove directories from a list of urls in R

What I have:

http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc

What I want:

domain.com
sub.domain1.com
domain2.com

I have a little long way to clear the start of the URL

url <- c("http://domain.com/123", "http://www.sub.domain1.com/222","http://www.domain2.com/1233/abc"

cleanurl <- gsub("http://","",url)
cleanurl2 <- gsub("www.","",cleanurl)

(Please let me know if there is an easier way to clean http: // and www.)

Now I am having problems with regex and deleting everything after /at the end. I tried this

cleanurl3 <- gsub("/*","",cleanurl2)

But this is just a removal /, not after it.

Thanks in advance for your help!

+5
source share
3 answers

I come up with strsplit/ gsubcombo (and not just gsubb / c sometimes so quickly calculate strsplit, because it is very intuitive):

x <- readLines(n=3)
http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc

gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))

## > gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))
## [1] "domain.com"      "sub.domain1.com" "domain2.com"

Edit
strsplit ( ):

sapply(strsplit(x, "(//|/)(www[.])?"), "[", 2)
+5

:

cleanurl <- sub("^http://(?:www[.])?(.*)$", "\\1", url)
cleanurl
## [1] "domain.com/123"       "sub.domain1.com/222"  "domain2.com/1233/abc"

:

cleanurl <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", url)
cleanurl
## [1] "domain.com"      "sub.domain1.com" "domain2.com" 
+4

This should work:

cleanurl <- gsub("http://","",url)
cleanurl2 <- gsub("www.","",cleanurl)

sapply(strsplit(cleanurl2,"/"),"[",1)
[1] "domain.com"      "sub.domain1.com"
[3] "domain2.com" 
+1
source

All Articles