Class HxmlTokeniser

java.lang.Object
   |
   +----HxmlTokeniser

public class HxmlTokeniser
extends Object
A StringTokenizer like, XML/HTML parser.

I did make use of some Aelfred ideas, especially the nice way that the parsing was carried out, but generally apart from some method naming and small pieces of code/logic I utilised, this parser is different, more based on my original HtmlStreamTokenizer, which is implemented along the lines of java.util.StringTokenizer.


Note: Tag, Entity, ProcessingInstruction and Function names must be of the following form:
First char:	('_'|':'|[a-zA-Z])
The rest:	('_'|'.'|[a-zA-Z0-9])

At this point even though this parser is written to correctly recognise and return COMMENT, CDATA and PI the default is to parse them, but then to put them into the dataBuffer as normal text. This is controlled by three private boolean variables. (ignoreComments, ignoreCData, ignorePI)

Recognises
&entity;
<Start Tag>
<Empty Tag/>
</End Tag>
$function(...)

If you define tags for the parser to look for, it will not find any &entity; or $function(...) tags inside the tags. The same goes if you specify that the parser should look for Program Instructions or CDATA sections. Nothing will be be returned from inside the instructions.

Copyright (C)2001 Jason Pell.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Email: jasonpell@hotmail.com Url: http://www.geocities.com/SiliconValley/Haven/9778


Variable Index

 o CDATA
XML CDATA (Character Data)
 o COMMENT
Comment
 o EMPTY_TAG
XML Empty tag.
 o END_TAG
HTML/XML End tag.
 o ENTITY
HTML/XML &entity;
 o FUNCTION
Specific to my purpose, is a special new token $function(...)
 o PI
XML Processing Instruction <?application ...
 o START_TAG
Html/XML start tag.

Constructor Index

 o HxmlTokeniser(String[], String[], String[])
 o HxmlTokeniser(String[], String[], String[], boolean, boolean, boolean)
Constructor which specifies whether we are ignoring comments, CDATA or Programming Instruction sections.

Method Index

 o getArguments()
if tokenType==FUNCTION, will return an array of arguments for the token encountered via nextToken()
 o getAttribute(String)
Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null.
 o getAttributes()
Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null.
 o getLineNumber()
Returns the current line number.
 o getText()
Will return null, if no text available.
 o getTokenContent()
Get the text of the last token encountered.
 o getTokenName()
Return name of last token found with nextToken.
 o getTokenType()
Return type of last token found with nextToken.
 o getTypeAsString()
Debug method.
 o ignoreCData(boolean)
Set ignoreCData indicator
 o ignoreComments(boolean)
Set ignoreComments indicator
 o ignorePI(boolean)
Set ignorePI indicator
 o isIgnoreCData()
Ignore CDATA
 o isIgnoreComments()
Ignore comments
 o isIgnorePI()
Ignore Programming Instructions
 o nextToken()
If a parse exception occurs.
 o parse(Reader)
 o parse(String)
 o reset(String[], String[], String[])
Will reset the parser with the new definitions of tags and functions.

Variables

 o START_TAG
 public static final int START_TAG
Html/XML start tag.

 o END_TAG
 public static final int END_TAG
HTML/XML End tag.

 o EMPTY_TAG
 public static final int EMPTY_TAG
XML Empty tag.

 o ENTITY
 public static final int ENTITY
HTML/XML &entity;

 o FUNCTION
 public static final int FUNCTION
Specific to my purpose, is a special new token $function(...)

 o CDATA
 public static final int CDATA
XML CDATA (Character Data)

 o PI
 public static final int PI
XML Processing Instruction <?application ... ?>

 o COMMENT
 public static final int COMMENT
Comment

Constructors

 o HxmlTokeniser
 public HxmlTokeniser(String tags[],
                      String entities[],
                      String functions[],
                      boolean ignoreComments,
                      boolean ignoreCData,
                      boolean ignorePI)
Constructor which specifies whether we are ignoring comments, CDATA or Programming Instruction sections.

 o HxmlTokeniser
 public HxmlTokeniser(String tags[],
                      String entities[],
                      String functions[])
Parameters:
tags - Specify any tag names that are recognised.
entities - Specify any entity names that are recognised.
functions - Specify any function names that are to be recognised. tags, entities and functions are all case insensitive.

Methods

 o reset
 public void reset(String tags[],
                   String entities[],
                   String functions[])
Will reset the parser with the new definitions of tags and functions. The current Reader will be set to null, as will all other state variables. This is the same as if you had just created the parser.

Parameters:
tags - Specify any tag names that are recognised.
entities - Specify any entity names that are recognised.
functions - Specify any function names that are to be recognised. You will need to call parse(...) again to be able to carry out parsing again.
 o parse
 public void parse(Reader reader)
 o parse
 public void parse(String s)
 o getTokenType
 public int getTokenType()
Return type of last token found with nextToken.

 o getTypeAsString
 public String getTypeAsString()
Debug method.

 o getTokenName
 public String getTokenName()
Return name of last token found with nextToken. If tokenType == START_TAG, END_TAG or EMPTY_TAG, the tokenName, will be the tag name. If tokenType == ENTITY, the tokenName will be the actual entity. If tokenType == FUNCTION, the tokenName will be the function name (minus arguments) If tokenType == PI, the tokenName is the Application. Otherwise it will be null.

 o getAttribute
 public String getAttribute(String name)
Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null. Even if START_TAG or EMPTY_TAG, the attribute may not be found, in which case this method will return null anyway.

 o getAttributes
 public Enumeration getAttributes()
Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null. The Enumeration may be empty.

 o getArguments
 public String[] getArguments()
if tokenType==FUNCTION, will return an array of arguments for the token encountered via nextToken()

 o nextToken
 public boolean nextToken() throws IOException
If a parse exception occurs.

Throws: IOException
Bubbles up from unread(...), read(...) or readInt(...).
 o getTokenContent
 public String getTokenContent()
Get the text of the last token encountered. May not return the complete contents of START_TAG, EMPTY_TAG, END_TAG, ENTITY or FUNCTION. Really only designed for use by COMMENT, CDATA and PI.

 o getText
 public String getText()
Will return null, if no text available.

 o getLineNumber
 public int getLineNumber()
Returns the current line number. The line number returned from the reader, is actually increased by 1 (one) before return, to take into account the first line, which is not counted until its end.

 o isIgnoreCData
 public boolean isIgnoreCData()
Ignore CDATA

 o isIgnorePI
 public boolean isIgnorePI()
Ignore Programming Instructions

 o isIgnoreComments
 public boolean isIgnoreComments()
Ignore comments

 o ignoreComments
 public void ignoreComments(boolean b)
Set ignoreComments indicator

 o ignoreCData
 public void ignoreCData(boolean b)
Set ignoreCData indicator

 o ignorePI
 public void ignorePI(boolean b)
Set ignorePI indicator

1