jgb2185

Extract domains only from a list of URLs

Apr 16th, 2022 (edited)
202
0
Never
2
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.09 KB | None | 0 0
  1. === Extract domains only from a list of URLs ===
  2.  
  3. I am at best a slightly knowledgeable novice in regex, and I need some help with URLs.
  4.  
  5. I have a list of URLs in a LibreOffice Calc sheet that look like this:
  6.  
  7. http://www.localpetcare.com
  8. https://app.clickfunnels.com/users/sign_in
  9. https://www.timetopet.com/login#
  10. https://www.bankatfirst.com/content/first-financial-bank/home/
  11. https://app.truecoach.co/login
  12. https://shield.mycoseva.com/qcoseva/ordernow.dhtml
  13. https://d.comenity.net/as/authorization.oauth2?client_id=ngac
  14.  
  15.  
  16. (These are part of a larger CSV dataset; thus the use of LibreOffice Calc.)
  17.  
  18. I need to strip out the protocol, any subdomains, and anything that occurs after the domain proper, so that the list will look like this:
  19.  
  20. localpetcare.com
  21. clickfunnels.com
  22. timetopet.com
  23. bankatfirst.com
  24. truecoach.co
  25. mycoseva.com
  26. comenity.net
  27.  
  28. Finding an expression that strips the protocol was easy. However, after a couple of days of online research, I can find nothing that will eliminate any arbitrary subdomain(s) and any text following the domain.
  29.  
  30. Any help gratefully accepted.
Comments
  • TimRenner
    234 days
    # text 0.82 KB | 1 0
    1. This problem will get thorny, some people—who no doubt know more than I do—recommend using a library or better yet, a browser's URL logic. You will encounter edge cases in the real world, for example your data set doesn't include invalid URLs nor the full set of possible characters in a domain name.
    2.  
    3. But given that it's fun to bang your head against problems and learn in the process, here's a regex that will extract the domain. There's only one capture group, $1. The parts of the URI that you don't care about are matched as non-capture groups with (?: opening each group. So obviously those could be capturing groups too if needed.
    4.  
    5. /^(?:https?:\/\/)(?:(?:www|ww\d)\.)?(?:(?:[^:/]+)\.)*([^:/]+\.[a-z0-9-]+)(?:\/.+)?$/
    6.  
    7. Mess around with it on regexr! hover over parts of the regex to see how they function
    8. regexr.com/7mkbc
  • jgb2185
    234 days
    # text 0.07 KB | 0 0
    1. @TimRenner, thanks so much for this. I'll definitely check it out.
    2.  
    3. JGB
    4.  
Add Comment
Please, Sign In to add comment