Develop Analyzer - HtmlDocument Class

content variable has a property named AsHtml. This property provides parsed html content according to Data property of content variable. AsHtml use Agility HTML pack to parse html content and store Agility HTML Document variable in Document property. If AsHtml could not parse HTML will return null (None) value.

Methods and Properties

TypeNameData TypeDescription
Content Class PropertiesDocumentAgility HTML DocumentHTML Agility pack is an open source library for C# to parse HTML documents. It is very powerfull library and we used it to parse HTML documents
GetElementByTagList<HTMLNode>Get all nodes from HTML with specific name.

Samples

This sample check HTML pages to have only one title tag. According to HTML specification all pages must contain one title tag. We check this rule .

 if content.AsHtml != None:
<TAB>TitleTags = content.AsHtml.GetElementByTag("title")
<TAB>if len(TitleTags) != 1:
<TAB><TAB>content.AddMessage("HTML Page should contain one title tag", "e")

This sample count the number of diffrent images in HTML page and if they were more than 20 add a warning message to content.

if content.AsHtml != None:
<TAB>Tags = content.AsHtml.GetElementByTag("img")
<TAB>List = []
<TAB>for t in Tags:
<TAB><TAB>if not t in List:
<TAB><TAB><TAB>List.append(t)
<TAB>if len(List) > 20:
<TAB><TAB>content.AddMessage("There are " + str(len(List)) + " diffrent images in this page. Try to keep images less than 20 in a page.", "w")

The Code below finds all hyper links in a page add them to page sub URLs so the application try to follow the URLs.

if content.AsHtml != None:
<TAB>for i in content.AsHtml.GetElementByTag("a"):
<TAB><TAB>if i.Attributes["href"] != None:
<TAB><TAB><TAB>content.AddUrl(i.Attributes["href"].Value)
<TAB><TAB>else:
<TAB><TAB><TAB>content.AddMessage("An 'anchor' tag without href found", "w", i.Line, i.LinePosition)

©Copyright Hamed J.I 2010. Hosted on SourceForge