Class HxmlTokeniser
java.lang.Object
|
+----HxmlTokeniser
- public class HxmlTokeniser
- extends Object
A StringTokenizer like, XML/HTML parser.
I did make use of some Aelfred ideas, especially the nice way that the parsing was
carried out, but generally apart from some method naming and small pieces of code/logic
I utilised, this parser is different, more based on my original HtmlStreamTokenizer, which
is implemented along the lines of java.util.StringTokenizer.
Note: Tag, Entity, ProcessingInstruction and Function names must be of the following form:
First char: ('_'|':'|[a-zA-Z])
The rest: ('_'|'.'|[a-zA-Z0-9])
At this point even though this parser is written to correctly recognise and return COMMENT, CDATA and PI
the default is to parse them, but then to put them into the dataBuffer as normal text. This is controlled
by three private boolean variables. (ignoreComments, ignoreCData, ignorePI)
- Recognises
- &entity;
- <Start Tag>
- <Empty Tag/>
- </End Tag>
- $function(...)
If you define tags for the parser to look for, it will not find any &entity; or $function(...) tags
inside the tags. The same goes if you specify that the parser should look for Program Instructions
or CDATA sections. Nothing will be be returned from inside the instructions.
Copyright (C)2001 Jason Pell.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Email: jasonpell@hotmail.com
Url: http://www.geocities.com/SiliconValley/Haven/9778
-
CDATA
- XML CDATA (Character Data)
-
COMMENT
- Comment
-
EMPTY_TAG
- XML Empty tag.
-
END_TAG
- HTML/XML End tag.
-
ENTITY
- HTML/XML &entity;
-
FUNCTION
- Specific to my purpose, is a special new token $function(...)
-
PI
- XML Processing Instruction <?application ...
-
START_TAG
- Html/XML start tag.
-
HxmlTokeniser(String[], String[], String[])
-
-
HxmlTokeniser(String[], String[], String[], boolean, boolean, boolean)
- Constructor which specifies whether we are ignoring comments,
CDATA or Programming Instruction sections.
-
getArguments()
- if tokenType==FUNCTION, will return an array of arguments for the token
encountered via nextToken()
-
getAttribute(String)
- Will return attributes if getTokenType()==START_TAG or EMPTY_TAG,
otherwise return null.
-
getAttributes()
- Will return attributes if getTokenType()==START_TAG or EMPTY_TAG,
otherwise return null.
-
getLineNumber()
- Returns the current line number.
-
getText()
- Will return null, if no text available.
-
getTokenContent()
- Get the text of the last token encountered.
-
getTokenName()
- Return name of last token found with nextToken.
-
getTokenType()
- Return type of last token found with nextToken.
-
getTypeAsString()
- Debug method.
-
ignoreCData(boolean)
- Set ignoreCData indicator
-
ignoreComments(boolean)
- Set ignoreComments indicator
-
ignorePI(boolean)
- Set ignorePI indicator
-
isIgnoreCData()
- Ignore CDATA
-
isIgnoreComments()
- Ignore comments
-
isIgnorePI()
- Ignore Programming Instructions
-
nextToken()
- If a parse exception occurs.
-
parse(Reader)
-
-
parse(String)
-
-
reset(String[], String[], String[])
- Will reset the parser with the new definitions of tags and functions.
START_TAG
public static final int START_TAG
- Html/XML start tag.
END_TAG
public static final int END_TAG
- HTML/XML End tag.
EMPTY_TAG
public static final int EMPTY_TAG
- XML Empty tag.
ENTITY
public static final int ENTITY
- HTML/XML &entity;
FUNCTION
public static final int FUNCTION
- Specific to my purpose, is a special new token $function(...)
CDATA
public static final int CDATA
- XML CDATA (Character Data)
PI
public static final int PI
- XML Processing Instruction <?application ... ?>
COMMENT
public static final int COMMENT
- Comment
HxmlTokeniser
public HxmlTokeniser(String tags[],
String entities[],
String functions[],
boolean ignoreComments,
boolean ignoreCData,
boolean ignorePI)
- Constructor which specifies whether we are ignoring comments,
CDATA or Programming Instruction sections.
HxmlTokeniser
public HxmlTokeniser(String tags[],
String entities[],
String functions[])
- Parameters:
- tags - Specify any tag names that are recognised.
- entities - Specify any entity names that are recognised.
- functions - Specify any function names that are to be recognised.
tags, entities and functions are all case insensitive.
reset
public void reset(String tags[],
String entities[],
String functions[])
- Will reset the parser with the new definitions of tags and functions. The
current Reader will be set to null, as will all other state variables. This is
the same as if you had just created the parser.
- Parameters:
- tags - Specify any tag names that are recognised.
- entities - Specify any entity names that are recognised.
- functions - Specify any function names that are to be recognised.
You will need to call parse(...) again to be able to carry out parsing again.
parse
public void parse(Reader reader)
parse
public void parse(String s)
getTokenType
public int getTokenType()
- Return type of last token found with nextToken.
getTypeAsString
public String getTypeAsString()
- Debug method.
getTokenName
public String getTokenName()
- Return name of last token found with nextToken.
If tokenType == START_TAG, END_TAG or EMPTY_TAG, the
tokenName, will be the tag name.
If tokenType == ENTITY, the tokenName will be the actual entity.
If tokenType == FUNCTION, the tokenName will be the function name (minus arguments)
If tokenType == PI, the tokenName is the Application.
Otherwise it will be null.
getAttribute
public String getAttribute(String name)
- Will return attributes if getTokenType()==START_TAG or EMPTY_TAG,
otherwise return null. Even if START_TAG or EMPTY_TAG, the attribute
may not be found, in which case this method will return null anyway.
getAttributes
public Enumeration getAttributes()
- Will return attributes if getTokenType()==START_TAG or EMPTY_TAG,
otherwise return null. The Enumeration may be empty.
getArguments
public String[] getArguments()
- if tokenType==FUNCTION, will return an array of arguments for the token
encountered via nextToken()
nextToken
public boolean nextToken() throws IOException
- If a parse exception occurs.
- Throws: IOException
- Bubbles up from unread(...), read(...) or readInt(...).
getTokenContent
public String getTokenContent()
- Get the text of the last token encountered.
May not return the complete contents of START_TAG, EMPTY_TAG, END_TAG, ENTITY or FUNCTION.
Really only designed for use by COMMENT, CDATA and PI.
getText
public String getText()
- Will return null, if no text available.
getLineNumber
public int getLineNumber()
- Returns the current line number. The line number
returned from the reader, is actually increased by
1 (one) before return, to take into account the first
line, which is not counted until its end.
isIgnoreCData
public boolean isIgnoreCData()
- Ignore CDATA
isIgnorePI
public boolean isIgnorePI()
- Ignore Programming Instructions
isIgnoreComments
public boolean isIgnoreComments()
- Ignore comments
ignoreComments
public void ignoreComments(boolean b)
- Set ignoreComments indicator
ignoreCData
public void ignoreCData(boolean b)
- Set ignoreCData indicator
ignorePI
public void ignorePI(boolean b)
- Set ignorePI indicator