Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- === Extract domains only from a list of URLs ===
- I am at best a slightly knowledgeable novice in regex, and I need some help with URLs.
- I have a list of URLs in a LibreOffice Calc sheet that look like this:
- http://www.localpetcare.com
- https://app.clickfunnels.com/users/sign_in
- https://www.timetopet.com/login#
- https://www.bankatfirst.com/content/first-financial-bank/home/
- https://app.truecoach.co/login
- https://shield.mycoseva.com/qcoseva/ordernow.dhtml
- https://d.comenity.net/as/authorization.oauth2?client_id=ngac
- (These are part of a larger CSV dataset; thus the use of LibreOffice Calc.)
- I need to strip out the protocol, any subdomains, and anything that occurs after the domain proper, so that the list will look like this:
- localpetcare.com
- clickfunnels.com
- timetopet.com
- bankatfirst.com
- truecoach.co
- mycoseva.com
- comenity.net
- Finding an expression that strips the protocol was easy. However, after a couple of days of online research, I can find nothing that will eliminate any arbitrary subdomain(s) and any text following the domain.
- Any help gratefully accepted.
Comments
-
- This problem will get thorny, some people—who no doubt know more than I do—recommend using a library or better yet, a browser's URL logic. You will encounter edge cases in the real world, for example your data set doesn't include invalid URLs nor the full set of possible characters in a domain name.
- But given that it's fun to bang your head against problems and learn in the process, here's a regex that will extract the domain. There's only one capture group, $1. The parts of the URI that you don't care about are matched as non-capture groups with (?: opening each group. So obviously those could be capturing groups too if needed.
- /^(?:https?:\/\/)(?:(?:www|ww\d)\.)?(?:(?:[^:/]+)\.)*([^:/]+\.[a-z0-9-]+)(?:\/.+)?$/
- Mess around with it on regexr! hover over parts of the regex to see how they function
- regexr.com/7mkbc
-
- @TimRenner, thanks so much for this. I'll definitely check it out.
- JGB
Add Comment
Please, Sign In to add comment