Coder Perfect

Extract hostname name from string

Problem

I’d like to match only the root of a URL from a text string, not the entire URL. Given:

http://www.youtube.com/watch?v=ClkQA2Lb_iE

http://www.example.com/12xy45
http://example.com/random

I’d like the last two instances to resolve to www.example.com or example.com.

I’ve heard regex is sluggish, and this would be my second regex expression on the page, so please let me know if there’s a way to do it without it.

I’m looking for a variation of this solution that uses JS/jQuery.

Asked by Chamilyan

Solution #1

Without using regular expressions, here’s a nice trick:

var tmp        = document.createElement ('a');
;   tmp.href   = "http://www.example.com/12xy45";

// tmp.hostname will now contain 'www.example.com'
// tmp.host will now contain hostname and port 'www.example.com:80'

Wrap the above in a function like the one below, and you’ve got yourself a fantastic way to extract the domain part of a URI.

function url_domain(data) {
  var    a      = document.createElement('a');
         a.href = data;
  return a.hostname;
}

Answered by Filip Roséen – refp

Solution #2

I recommend installing the psl npm package (Public Suffix List). The “Public Suffix List” is a list of all valid domain suffixes and rules, not just Country Code Top-Level domains, but unicode characters as well that would be considered the root domain (i.e. www.食狮.公司.cn, b.c.kobe.jp, etc.). Read more about it here.

Try:

npm install --save psl

Then execute the following code with my “extractHostname” implementation:

let psl = require('psl');
let url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
psl.get(extractHostname(url)); // returns youtube.com

Because I don’t have access to a npm package, the tests below are limited to extractHostname.

You can extract the domain even if you don’t have the protocol or even the port number. This is a pretty simple, non-regex solution, and I believe it will enough.

URL(url). Although hostname is a viable answer, it does not function well with some of the edge circumstances I’ve addressed. It doesn’t like certain of the URLs, as you can see in my previous test. However, you may certainly utilize a combination of my solutions to make everything work.

*Thank you for your suggestions, @Timmerz, @renoirb, @rineez, @BigDong, @ra00l, @ILikeBeansTacos, and @CharlesRobertson! Thank you for reporting the bug, @ross-allen!

Answered by lewdev

Solution #3

There’s no need to parse the string; simply provide it to the URL constructor as an argument:

const url = 'http://www.youtube.com/watch?v=ClkQA2Lb_iE';
const { hostname } = new URL(url);

console.assert(hostname === 'www.youtube.com');

Answered by Pavlo

Solution #4

Try this:

var matches = url.match(/^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)/i);
var domain = matches && matches[1];  // domain will be null if no match is found

Use this expression instead if you wish to omit the port from your result:

/^https?\:\/\/([^\/:?#]+)(?:[\/:?#]|$)/i

Use a negative lookahead to prevent specific domains from matching. (?!youtube.com)

/^https?\:\/\/(?!(?:www\.)?(?:youtube\.com|youtu\.be))([^\/:?#]+)(?:[\/:?#]|$)/i

Answered by gilly3

Solution #5

Depending on whether you need to optimize for performance or not (and without other dependencies! ), there are two viable options:

Using URL.hostname is the simplest and most straightforward method.

Except for Internet Explorer, all major browsers provide URL.hostname as part of the URL API (caniuse). If you need to support older browsers, use a URL polyfill.

You’ll also have access to other URL properties and methods if you use the URL constructor!

For the most part, URL.hostname should be your first choice. However, it’s still a lot slower than this regex (see jsPerf for yourself):

URL.hostname is definitely the best option. Consider RegEx if you need to process a large number of URLs in a short amount of time (and performance is a concern).

Answered by Robin Métral

Post is based on https://stackoverflow.com/questions/8498592/extract-hostname-name-from-string