Class HxmlTokeniser

java.lang.Object
   |
   +----HxmlTokeniser

public class HxmlTokeniser
extends Object

A StringTokenizer like, XML/HTML parser.

I did make use of some Aelfred ideas, especially the nice way that the parsing was carried out, but generally apart from some method naming and small pieces of code/logic I utilised, this parser is different, more based on my original HtmlStreamTokenizer, which is implemented along the lines of java.util.StringTokenizer.

Note: Tag, Entity, ProcessingInstruction and Function names must be of the following form:
First char:	('_'|':'|[a-zA-Z])
The rest:	('_'|'.'|[a-zA-Z0-9])

At this point even though this parser is written to correctly recognise and return COMMENT, CDATA and PI the default is to parse them, but then to put them into the dataBuffer as normal text. This is controlled by three private boolean variables. (ignoreComments, ignoreCData, ignorePI)

Recognises: &entity;; <Start Tag>; <Empty Tag/>; </End Tag>; $function(...)

If you define tags for the parser to look for, it will not find any &entity; or $function(...) tags inside the tags. The same goes if you specify that the parser should look for Program Instructions or CDATA sections. Nothing will be be returned from inside the instructions.

Copyright (C)2001 Jason Pell.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.


This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.


You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.


Email: 	jasonpell@hotmail.com
Url:	http://www.geocities.com/SiliconValley/Haven/9778

CDATA: XML CDATA (Character Data)
COMMENT: Comment
EMPTY_TAG: XML Empty tag.
END_TAG: HTML/XML End tag.
ENTITY: HTML/XML &entity;
FUNCTION: Specific to my purpose, is a special new token $function(...)
PI: XML Processing Instruction <?application ...
START_TAG: Html/XML start tag.

HxmlTokeniser(String[], String[], String[])
HxmlTokeniser(String[], String[], String[], boolean, boolean, boolean): Constructor which specifies whether we are ignoring comments, CDATA or Programming Instruction sections.

getArguments(): if tokenType==FUNCTION, will return an array of arguments for the token encountered via nextToken()
getAttribute(String): Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null.
getAttributes(): Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null.
getLineNumber(): Returns the current line number.
getText(): Will return null, if no text available.
getTokenContent(): Get the text of the last token encountered.
getTokenName(): Return name of last token found with nextToken.
getTokenType(): Return type of last token found with nextToken.
getTypeAsString(): Debug method.
ignoreCData(boolean): Set ignoreCData indicator
ignoreComments(boolean): Set ignoreComments indicator
ignorePI(boolean): Set ignorePI indicator
isIgnoreCData(): Ignore CDATA
isIgnoreComments(): Ignore comments
isIgnorePI(): Ignore Programming Instructions
nextToken(): If a parse exception occurs.
parse(Reader)
parse(String)
reset(String[], String[], String[]): Will reset the parser with the new definitions of tags and functions.

START_TAG

 public static final int START_TAG

Html/XML start tag.

END_TAG

 public static final int END_TAG

HTML/XML End tag.

EMPTY_TAG

 public static final int EMPTY_TAG

XML Empty tag.

ENTITY

 public static final int ENTITY

HTML/XML &entity;

FUNCTION

 public static final int FUNCTION

Specific to my purpose, is a special new token $function(...)

CDATA

 public static final int CDATA

XML CDATA (Character Data)

 public static final int PI

XML Processing Instruction <?application ... ?>

COMMENT

 public static final int COMMENT

Comment

HxmlTokeniser

 public HxmlTokeniser(String tags[],
                      String entities[],
                      String functions[],
                      boolean ignoreComments,
                      boolean ignoreCData,
                      boolean ignorePI)

Constructor which specifies whether we are ignoring comments, CDATA or Programming Instruction sections.

HxmlTokeniser

 public HxmlTokeniser(String tags[],
                      String entities[],
                      String functions[])

Parameters:: tags - Specify any tag names that are recognised.; entities - Specify any entity names that are recognised.; functions - Specify any function names that are to be recognised. tags, entities and functions are all case insensitive.

reset

 public void reset(String tags[],
                   String entities[],
                   String functions[])

Will reset the parser with the new definitions of tags and functions. The current Reader will be set to null, as will all other state variables. This is the same as if you had just created the parser.

Parameters:: tags - Specify any tag names that are recognised.; entities - Specify any entity names that are recognised.; functions - Specify any function names that are to be recognised. You will need to call parse(...) again to be able to carry out parsing again.

parse

 public void parse(Reader reader)

parse

 public void parse(String s)

getTokenType

 public int getTokenType()

Return type of last token found with nextToken.

getTypeAsString

 public String getTypeAsString()

Debug method.

getTokenName

 public String getTokenName()

Return name of last token found with nextToken. If tokenType == START_TAG, END_TAG or EMPTY_TAG, the tokenName, will be the tag name. If tokenType == ENTITY, the tokenName will be the actual entity. If tokenType == FUNCTION, the tokenName will be the function name (minus arguments) If tokenType == PI, the tokenName is the Application. Otherwise it will be null.

getAttribute

 public String getAttribute(String name)

Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null. Even if START_TAG or EMPTY_TAG, the attribute may not be found, in which case this method will return null anyway.

getAttributes

 public Enumeration getAttributes()

Will return attributes if getTokenType()==START_TAG or EMPTY_TAG, otherwise return null. The Enumeration may be empty.

getArguments

 public String[] getArguments()

if tokenType==FUNCTION, will return an array of arguments for the token encountered via nextToken()

nextToken

 public boolean nextToken() throws IOException

If a parse exception occurs.

Throws: IOException: Bubbles up from unread(...), read(...) or readInt(...).

getTokenContent

 public String getTokenContent()

Get the text of the last token encountered. May not return the complete contents of START_TAG, EMPTY_TAG, END_TAG, ENTITY or FUNCTION. Really only designed for use by COMMENT, CDATA and PI.

getText

 public String getText()

Will return null, if no text available.

getLineNumber

 public int getLineNumber()

Returns the current line number. The line number returned from the reader, is actually increased by 1 (one) before return, to take into account the first line, which is not counted until its end.

isIgnoreCData

 public boolean isIgnoreCData()

Ignore CDATA

isIgnorePI

 public boolean isIgnorePI()

Ignore Programming Instructions

isIgnoreComments

 public boolean isIgnoreComments()

Ignore comments

ignoreComments

 public void ignoreComments(boolean b)

Set ignoreComments indicator

ignoreCData

 public void ignoreCData(boolean b)

Set ignoreCData indicator

ignorePI

 public void ignorePI(boolean b)

Set ignorePI indicator