A note about HTML (4 or 5) vs XHTML. My suggestion for XHTML is purely pragmatical: XHTML is XML, then one can just use any XML library to parse the document. Now I know that it "seems" that there are many HTML parsers out there, unfortunately this is not true... There are a few, at least for the most popular programming languages, however they are "bloated" and full of issues... I know this because I've tried once to use such tools and tried "Beautiful Soup" for Python and failed... Then I've settled on https://github.com/ericchiang/pup and exported the whole thing as JSON and moved on from there... Ciprian. BTW, the tool I've mentioned `pup` can be used to for HTML meta-data extraction.