Coder Perfect

Node.js HTML parser [closed]

Problem

Is there a nodejs equivalent of Ruby’s nokogiri? I’m talking about a user-friendly HTML parser.

I saw various parsers on the Node.js modules website, but I couldn’t find anything nice and new.

Asked by asci

Solution #1

jsdom is a tool that may be used to create DOM.

There’s also cheerio, which has a jQuery interface and is far quicker than prior versions of jsdom, though their speed is now comparable.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 appears to be a viable option as well. It’s fairly active (as of this update, 11 days since the last commit), WHATWG-compliant, and utilized in jsdom, Angular, and Polymer.

YQL1 can also be used to parse HTML for web scraping purposes. It has its own node module. If your HTML comes from a static website, I believe YQL is the greatest approach because you’re relying on a service rather than your own code and processing power. It’s worth noting that if the page is blocked by the website’s robot.txt, YQL will not work with it.

If the webpage you’re scraping is dynamic, you should use phantomjs or another headless browser. If you’re contemplating phantomjs, you should also look into casperjs. SpookyJS allows you to control CasperJS from Node.

There’s also zombiejs, in addition to phantomjs. Zombiejs is a node module, unlike phantomjs, which cannot be embedded in nodejs.

For the latter options, there is a nettuts+ toturial.

1 The YUI library, which is required for YQL, has not been actively developed since August 2014, according to the source code.

Answered by Farid Nouri Neshat

Solution #2

Try https://github.com/tmpvar/jsdom, which takes some HTML and generates a DOM for you.

Answered by thejh

Solution #3

x-ray can also be found at https://github.com/lapwinglabs/x-ray.

Answered by png

Post is based on https://stackoverflow.com/questions/7977945/html-parser-on-node-js