EFTIDYCOM
|
Before I go in detail ,I want you to known what actually EfTidy is, EfTidy is Wrapper Component of Tidy Library and if you don’t know what is Tidy, here is little description.
“TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools.” --By Mr. Dave Raggett
So What I am doing with This Library
Recently one of my company client requested us to make TidyAtl class for new TidyLibrary, as last ATL component or Active X wrapper for this Tidy library is built in 2002, So my company assign me task of creating ATL Library for this component , After completion of the Component, my BOSS told me "Alok, this is open source component and other programmer deserve to use it ". So here I am, presenting you this Component with supporting source code and a brief overview of each function.
CharEncodingType |
typedef [public] enum tagCharEncodingType { ASCII, LATIN1, RAW, UTF8, ISO2022, MAC, WIN1252, UTF16LE, UTF16BE, UTF16, BIG5, SHIFTJIS } CharEncodingType; |
OutputType |
typedef [public] enum tagOutputType |
IndentScheme |
typedef [public] enum IndentScheme { NOINDENT=0, INDENTBLOCKS, AUTOINDENT }IndentScheme; |
DoctypeModes |
typedef [public] enum { DoctypeOmit, /**< Omit DOCTYPE altogether */
DoctypeAuto, /**< Keep DOCTYPE in input. Set version to content */ DoctypeStrict, /**< Convert document to HTML 4 strict content model */ DoctypeLoose, /**< Convert document to HTML 4 transitional content model */ DoctypeUser /**< Set DOCTYPE FPI explicitly */ } DoctypeModes; |
EfTidyMainNode |
typedef [public] enum { TIDY_ROOT, //Return Tidy ROOT Node
|
Now Lets Take Each Interface one by one:-
First check out each every Method or property present in this interface, and function it perform.
Property/Method Name |
Parameters |
Get/Put |
Description |
TidyFiletoMem (method) | [in] BSTR sourceFile, [out, retval] BSTR* result | n/a | write output to memory |
TidyFileToFile (method) | [in] BSTR sourceFile, [in] BSTR destFile | n/a | Write output in file |
TidyMemToMem (method) | [in] BSTR sourceStr, [out, retval] BSTR* result | n/a | Write output to memory |
TidyMemtoFile (method) | [in] BSTR buffer, [in] BSTR destFile | n/a | Take input as buffer and output in File |
TotalWarnings (Property) | ([out, retval] long *pVal); | Get | Return total number of warning after above four operation |
TotalErrors (property) | ([out, retval] long *pVal); | Get | Return total number of Errors after above four operation |
ErrorWarning | [out, retval] BSTR *pVal | Get | Return buffer, which contain human readable errors/ warnings. |
Option (property) | [out, retval] ItidyOption* *pVal | Get | Set the Option for the tidy library |
EfTidyNode (method) | [in]EfTidyMainNode Type,[out,retval]IEfTidyNode **ppNewEfTidyNode | n/a | As html page has tree structure. This method returns you tidyNode,that assist you to read each every tag and its attribute.this is latest addition to tidy library |
here is list of properties for ItidyOption Interface
Property/Method Name |
Parameter |
Get/Put | Description |
LoadConfigFile (method) | BSTR | n/a | Load option settings from a configuration file |
ResetToDefaultValue | Void | n/a | Reset options to default settings |
Doctype | BSTR | BOTH | Doctype declaration generated by Tidy |
TidyMark | VARIANT_BOOL | BOTH | For meta element indicating tidied doc |
HideEndTag | VARIANT_BOOL | BOTH | Suppress optional end tags |
EncloseText | VARIANT_BOOL | BOTH | If yes text at body is wrapped in <p> |
EncloseBlockText | VARIANT_BOOL | BOTH | If yes text in blocks is wrapped in <p> |
LogicalEmphasis | VARIANT_BOOL | BOTH | Replace i by em and b by strong |
DefaultAltText | BSTR | BOTH | Default text for alt attribute |
Clean | VARIANT_BOOL | BOTH | Replace presentational clutter by style rules |
DropFontTags | VARIANT_BOOL | BOTH | Discard presentation tags |
DropEmptyParas | VARIANT_BOOL | BOTH | Discard empty p elements |
Word2000 | VARIANT_BOOL | BOTH | Both Draconian cleaning for Word2000 |
FixBadComment | VARIANT_BOOL | BOTH | Both Fix comments with adjacent hyphens |
FixBackslash | VARIANT_BOOL | BOTH | Both Fix URLs by replacing \ with / |
NewEmptyTags | BSTR | BOTH | Declared empty tags |
NewInlineTags | BSTR | BOTH | Declared inline tags |
NewBlockLevelTags | BSTR | BOTH | Declared block tags |
NewPreTags | BSTR | BOTH | Declared pre tags |
OutputType | OutputType *pVal | BOTH | Both You can set Output type from here Like you can get output as XML,XHtml or pure HTML |
InputAsXML | VARIANT_BOOL | BOTH | Treat input as XML |
ADDXmlDecl | VARIANT_BOOL | BOTH | Add >?xml ?< for XML docs |
AddXmlSpace | VARIANT_BOOL | BOTH | If set to yes adds xml: space attr as needed |
Bare | VARIANT_BOOL | BOTH | Make bare HTML |
AssumeXmlProcins | VARIANT_BOOL | BOTH | If set to yes PIs must end with ?> |
CharEncoding | CharEncodingType | BOTH | Set/GET In/out character encoding |
InCharEncoding | CharEncodingType | BOTH | Input character encoding (if different) |
OutCharEncoding | CharEncodingType | BOTH | Output character encoding (if different) |
NumericsEntities | VARIANT_BOOL | BOTH | Use numeric entities for symbols |
QuoteMarks | VARIANT_BOOL | BOTH | Output " marks as " |
QuoteNBSP | VARIANT_BOOL | BOTH | Both Output non-breaking space as entity |
QuoteAmpersand | VARIANT_BOOL | BOTH | Output naked ampersand as & amp |
OutputTagInUpperCase | VARIANT_BOOL | BOTH | Output tags in upper not lower case |
OutputAttrInUpperCase | VARIANT_BOOL | BOTH | Output attributes in upper not lower case |
WrapScriptlets | VARIANT_BOOL | BOTH | Wrap within JavaScript string literals |
WrapAttVals | VARIANT_BOOL | BOTH | Wrap within attribute values |
WrapSection | VARIANT_BOOL | BOTH | Wrap within section tags |
WrapAsp | VARIANT_BOOL | BOTH | Wrap within ASP pseudo elements |
WrapJste | VARIANT_BOOL | BOTH | Wrap within JSTE pseudo elements |
WrapPhp | VARIANT_BOOL | BOTH | Wrap within PHP pseudo elements |
Indent | IndentScheme | BOTH | Indent content of appropriate tags |
IndentSpace | long | BOTH | Indentation n spaces |
WrapLen | long | BOTH | Set wrap margin for output |
TabSize | long | BOTH | Expand tabs to n spaces |
IndentAttributes | long | BOTH | Newline+indent before each attribute |
BreakBeforeBR | VARIANT_BOOL | BOTH | Output newline before
or not |
LiteralAttribs | VARIANT_BOOL | BOTH | If true attributes may use newlines |
MarkUp | VARIANT_BOOL | BOTH | |
ShowWarnings | VARIANT_BOOL | BOTH | On/Off |
Quiet | VARIANT_BOOL | BOTH | No 'Parsing X', guessed DTD or summary |
KeepTime | VARIANT_BOOL | BOTH | If yes last modied time is preserved |
ErrorFile | BSTR | BOTH | File name to write errors to |
GnuEmacs | VARIANT_BOOL | BOTH | If true format error output for GNU Emacs |
FixUrl | VARIANT_BOOL | BOTH | Applies URI encoding if necessary |
BodyOnly | VARIANT_BOOL | BOTH | Output BODY content only |
HideComments | VARIANT_BOOL | BOTH | Hides all (real) comments in output |
DoctypeMode | DoctypeModes | BOTH | Set the doctype mode for output |
here is list of properties for IEfTidyNode Interface
Property/Method Name |
Parameter |
Get/Put | Description |
Name | BSTR *pVal | Get | return the name of Current Tag. |
GetFirstChildNode | IEfTidyNode | n/a | Return First Child Node |
GetNextChildNode | IEfTidyNode | n/a | Using his you can enum rest of Tags |
GetFirstAttribute | IEfTidyAttr | n/a | Return first Attribute of current Tag |
GetNextAttribute | IEfTidyAttr | n/a | Return rest of Attribute one by one |
here is list of properties for IEfTidyAttr Interface
Property/Method Name |
Parameter |
Get/Put | Description |
Name | BSTR *pVal | Get | Name of attribute |
Value | BSTR *pVal | Get | Value of attribute |
Almost every component was developed to use with Visual Basic and other COM friendly language. So all the code describes here is in visual basic.I am going to use some test case to explain working of component.
I have used the Test.htm (included with Project) to test EfTidy responses.
Here is what Test.htm contains
<html>
<head>
<title>tidy Library</title>
</head>
<body>
<blockquote>
<p> </p> --(1)
<p><fontsize="5"color=
"#FF00FF">TidyLibrary</font></p></blockquote><P><p><fontsize="5"color="#FF00FF"></font></p>
<table border="1" cellpadding="0" cellspacing="0"
style="border-collapse: collapse" bordercolor="#111111" width="100%"
id="AutoNumber1">
<tr>
<td width="50%" style="border-left-style:
solid; border-left-width: 1; border-right-style: none; border-right-width:
medium; border-top-style: solid; border-top-width: 1; border-bottom-style:
none; border-bottom-width: medium"> --(2)
</td>
<td width="50%" style="border-left-style: none; border-left-width: medium;
border-right-style:solid; border-right-width: 1; border-top-style: solid;
border-top-width: 1;border-bottom-style: none; border-bottom-width: medium">
</td>
</tr>
</table>
<b>Tidy --- (3)
</h1> <tidy> ---(4)
</body>
</html>
in test.htm I have added following mistake
a Dummy <Tidy> tag at (4),
missing <h1> tag at (4)
empty Para <p> tag (1)
unclosed <b> tag at (3)
First Create Object to Our Component,here is listing how to achieve that.
Private Sub Form_Load()
Dim TidyCOMObj as EFTIDYLib.tidyCom
Set TidyCOMObj = New EFTIDYLib.tidyCom
End Sub
Now Clean the test.htm file using this object , code listing for that is
Private Sub cmdMemtoMem_Click()
Dim result As String
TidyCOMObj.TidyFileToFile("test.htm","test1.htm")
‘check No of error in the HTML
txtError = TidyCOMObj.TotalErrors
‘check no of warning in above HTML
txtWarning = TidyCOMObj.TotalWarnings
End Sub
And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain
<html>
<head>
<meta name="generator"
content= "HTML Tidy for Windows (vers 1st September 2004), see www.w3.org">
<title>tidy Library</title>
</head>
<body>
<blockquote>
<p> </p>
<p><font size="5" color="#FF00FF">Tidy Library</font>
</p>
</blockquote>
<p><font size="5" color= "#FF00FF"> </font></p>
<table border="1" cellpadding="0" cellspacing="0" style= "border-collapse:
collapse" bordercolor="#111111" width="100%" id= "AutoNumber1">
<tr>
<td width="50%" style= "border-left-style: solid; border-left-width: 1;
border-right-style: none; border-right-width: medium;
border-top-style: solid; border-top-width: 1; border-bottom-style: none;
border-bottom-width: medium">
</td>
<td width="50%" style= "border-left-style: none;border-left-width: medium;
border-right-style: solid; border-right-width: 1;border-top-style: solid;
border-top-width: 1; border-bottom-style: none;border-bottom-width: medium">
</td>
</tr>
</table>
<b>Tidy</b> --(1)
</body>
</html>
if you see the Above cleaned HTML page - Dummy <tidy> tag and </h1> has been removed near (1) and </b> is added after Tidy at (1)
here is Summary of Error/Warning Produced By EfTidyCom ,showing detail of each action it has performed
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1 - Error: <tidy> is not recognized!
line 23 column 1 - Warning: discarding unexpected <tidy>
line 15 column 1 - Warning: <table> proprietary attribute "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary
5 warnings, 1 error were found!
Now Apply some Option to Test.htm get Custom Output. so i am using these Options
Here is Code Listing to achieve above
Private Sub cmdMemtoMem_Click()
Dim me1 As String
TidyCOMObj.Option.Clean = True
TidyCOMObj.Option.NewInlineTags = "tidy"
TidyCOMObj.Option.OutputType = XhtmlOut
'our string shown in the cleaned html
'only if the doctype mode is User
TidyCOMObj.Option.DoctypeMode = DoctypeUser
TidyCOMObj.Option.Doctype = "Ef Tidy library"
TidyCOMObj.TidyFileToFile("test.htm","test1.htm")
txtError = TidyCOMObj.TotalErrors
txtWarning = TidyCOMObj.TotalWarnings
End Sub
And here is the result produced by tidy Listing showing what test1.htm (created by EfTidyCom) contain after applying out options
<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1) <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="HTML Tidy for Windows (vers 1st September 2004), see www.w3.org" /> <title>tidy Library</title> <style type="text/css"> --(2) /*<![CDATA[*/ table.c4 {border-collapse: collapse} td.c3 {border-left-style: none; border-left-width: medium; border-right-style: solid; border-right-width: 1; border-top-style: solid; border-top-width: 1; border-bottom-style: none; border-bottom-width: medium} td.c2 {border-left-style: solid; border-left-width: 1; border-right-style: none; border-right-width: medium; border-top-style: solid; border-top-width: 1; border-bottom-style: none; border-bottom-width: medium} h2.c1 {color: #FF00FF} /*]]>*/ </style> </head> <body> <blockquote> <p> </p> <h2 class="c1">Tidy Library</h2> </blockquote> <h2 class="c1"> </h2> <table border="1" cellpadding="0" cellspacing="0" class="c4" bordercolor="#111111" width="100%" id="AutoNumber1"> <tr> <td width="50%" class="c2"> </td> ----(3) <td width="50%" class="c3"> </td> </tr> </table> <b>Tidy <tidy></tidy></b> ----(4) </body> </html>
Now Let see What Tidy Clean for us
here is summary of all the Error/Warning
line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 22 column 10 - Warning: discarding unexpected </h1> line 23 column 1 - Warning: <tidy> is not approved by W3C line 23 column 1 - Warning: missing </tidy> before </body> line 22 column 2 - Warning: missing </b> before </body> line 15 column 1 - Warning: <table> proprietary attribute "bordercolor" line 15 column 1 - Warning: <table> lacks "summary" attribute Info: Document content looks like HTML Proprietary 7 warnings, 0 errors were found!
This two Interface will help you gather node by node and Attribute by attribute information from Tree Structure of Html cleaned by Tidy libraray. here is code listing for Finding the <Head> tag and Enumerate all the Attribute in that.
Note :always use the these two interface on html cleaned by Tidy.
Private Sub cmdGetNode_Click()
‘assuming TidyDoc contain Cleaned HTML
‘after applying any of four ITidyCom method
‘here TidyDoc is Object of iTidyCom
a = TIDY_HEAD
‘give the <head> Node
Set tidyNode = TidyDoc.EfTidyNode(a)
‘display name
txtNodeName = tidyNode.Name
If tidyNode Is Nothing Then
Else
‘Enumerate all attribute in the head if any
Set atr = tidyNode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & " " & atr.Value
Set atr = tidyNode.GetNextAttribute
Loop
End If
End Sub
Now how to Enumerate child in the Head Node and get attribute of each, I am finding first child for you here, the code listing for that is -->
Private Sub cmdGetFirstChildNode_Click()
Dim localnode As EfTidyNode
Set localnode = tidyNode.GetFirstChildNode
txtNodeName = localnode.Name
If localnode Is Nothing Then
Else
Set atr = localnode.GetFirstAttribute
Do Until atr Is Nothing
lstAttr.AddItem atr.Name & " " & atr.Value
Set atr = localnode.GetNextAttribute
Loop
End If
End Sub
wait a min, I has shot a nice snapshot after clicking on clicking on above code button
Here,All i have given small overview of tidyLibrary and EfTidyCom.For more information about Tidy library visit tidy Home Page http://tidy.sourceforge.com
I know there is much scope for improvement in this Component especially in Interfaces IEfTidyNode and IEfTidyAttr. I promise these improvement will there in next version/update of library
Keep a running update of any changes or improvements you've made here.
Source File Contain -
Author is working as Software Developer at
EfExtra ESolution pvt Ltd.,Noida,INDIA. you can contact
author by sending a mail to alok@efextra.com
or visiting author personel webpage at http://www.thisisalok.tk
From |