NAME
HTML::Object::Element - HTML Element Object
SYNOPSIS
use HTML::Object::Element;
my $this = HTML::Object::Element->new ||
die( HTML::Object::Element->error, "\n" );
VERSION
v0.2.9
DESCRIPTION
This interface implement a core element for HTML::Object parser. An element can be one or more space, a text, a tag, a comment, or a document, all of the above inherit from this core interface.
For a more elaborate interface and a close implementation of the Web Document Object Model (a.k.a. DOM), see HTML::Object::DOM::Element and the DOM parser
METHODS
address
This method is purely for compatibility with "address" in HTML::Element. Please, refer to its documentation for its use.
all_attr
Returns an hash (not an hash reference) of the element's attributes as a key-value pairs.
This is provided in compatibility with HTML::Element
my %attributes = $e->all_attr;
all_attr_names
Returns a list of all the element's attributes in no particular order.
my @attributes = $e->all_attr_names;
as_html
This is an alias for "as_string"
as_string
Returns a string representation of the current element and its underlying descendants.
If a cached version of that string exists, it is returned instead.
as_text
Returns a string representation of the text content of the current element and its descendant.
If a cached version of that string exists, it is returned instead.
It takes an optional hash or hash reference of parameters:
callback
This is a callback subroutine reference of anonymous subroutine. It is called for each textual element found and is passed as its sole argument, the element object.
unescape
Boolean. If true, the value of textual elements found will be unescaped before being returned. This means that
<
will be converted back to<
and>
to>
and<br >
followed by a new line will be removed to only leave the new line.
See also "innerText" in HTML::Object::DOM::Element, "textContent" in HTML::Object::Node and "text" in HTML::Object::XQuery
as_trimmed_text
Return the value returned by "as_text", only its leading and trailing spaces, if any, are trimmed.
as_xml
This is merely an alias for as_string
attr
Provided with an attribute name
and this will return it. If an attribute value
is also provided, it will set or replace the attribute valu accordingly. If that attribute value provided is undef
, this will remove the attribute altogether.
attributes
Returns an hash object of all the attributes key-value pairs.
Be careful this is a 'live' object, and if you make change to it directly, you could damage the hierarchy or introduce errors.
attributes_sequence
Returns an array object containing the attribute names in their order of appearance.
checksum
Returns the element checksum, used to determine if any change was made.
children
Returns an array object containing all the element's children.
class
Returns this element class, e.g. HTML::Object::Element
or HTML::Object::Document
clone
Returns a copy of the current element, and recursively all of its descendants,
The cloned element, that is returned, has no parent.
clone_list
Clone all the element children and return a new array object of the cloned children.
This is quite different from HTML::Element
equivalent that is accessed as a class method and takes an arbitrary list of elements.
close
Close the current tag, if necessary. It returns the current object upon success, or undef
upon error and sets an error
close_tag
Set or get a closing element object that is used to close the current element.
column
Returns the column at which this element was found in the original HTML text string, by the parser.
content
This is an alias for "children". It returns an array object of the current element's children objects.
content_array_ref
This is an alias for "children". It returns an array object of the current element's children objects.
This is provided in compatibility with HTML::Element
content_list
In list context, this returns the list of the curent element's children, if any, and in scalar context, this returns the number of children elements it contains.
This is provided in compatibility with HTML::Element
delete
Remove all of its content by calling "delete_content", detach the current object, and destroy the object.
delete_content
Remove the content, i.e. all the children, of the current element, effectively calling "delete" on each one of them.
It returns the current element.
delete_ignorable_whitespace
Does not do anything by design. There is no much value into this method under HTML::Object in the first place.
depth
Returns an integer representing the depth level of the current element in the hierarchy.
descendants
Returns an array object of all the element's descendants throughout its hierarchy.
destroy
An alias for "delete"
destroy_content
An alias for "delete_content"
detach
This method takes no parameter and removes the current element from its parent's list of children element, and unset its parent object value.
It returns the element parent object.
detach_content
This method takes no argument and will remove the parent value for each of its children, set the children list for the current element to an empty list and return the list of those children elements thus removed.
my @removed = $e->detach_content;
This is provided in compatibility with HTML::Element
dump
Print out on the stdout a representation of the hierarchy of element objects.
eid
Returns the element unique id, which is automatically generated for any element. This is actually a uuid. For example:
my $eid = $e->eid; # e.g.: 971ef725-e99b-4869-b6ac-b245794e84e2
end
Returns the current object.
Actually, I am not sure this should be here, and rather it should be in HTML::Object::XQuery since it simulates jQuery.
extract_links
Returns links found by traversing the element and all of its children and looking for attributes (like href
in an <a
> element, or src
in an <img
> element) whose values represent links.
You may specify that you want to extract links from just some kinds of elements (instead of the default, which is to extract links from all the kinds of elements known to have attributes whose values represent links). For instance, if you want to extract links from only <a
> and <img
> elements, you could code it like this:
my $links = $elem->extract_links( qw( a img ) ) ||
die( $elem->error );
foreach( @$links )
{
say "Hey, there is a ", $_->{tag}, " that links to ", $_->{value}, "in its ", $_->{attribute}, " attribute, at ", $_->{element}->address;
}
The dictionary definition hash reference of all tags and their attributes containing potential links is available as $HTML::Object::LINK_ELEMENTS
This method returns an array object containing hash objects, for each attribute of an element containing a link, with the following properties:
attribute
The attribute containing the link
element
The element object
tag
The element tag name.
value
The attribute value, which would typically contain the link value.
Nota bene: this method has been implemented to provide similar API as HTML::Element and the 2 first paragraphs of this method description are taken from this module.
find_by_attribute
Returns an array object of all the elements (including potentially the current element itself) in the element's hierarchy who have an attribute that matches the given attribute name.
my $list = $e->find_by_attribute( 'data-dob' );
find_by_tag_name
Returns an array object of all the elements (including potentially the current element itself) in the element's hierarchy who matches any of the specified tag names. Tag names can be provided n case insensitive.
my $list = $e->find_by_tag_name( qw( div p span ) );
has_children
Returns true if the current element has children, i.e. it contains other elements within itself.
id
Set or get the id HTML attribute of the element.
insert_element
Provided with an element object and this will add it to the current element's children.
It returns the current element object.
internal
Returns the internal hash of key-value paris used internally by this package. This is primarily used to handle the data-*
special attributes.
is_closed
Returns true if the current element has a closing tag that is accessible with "close_tag"
is_empty
Returns true if this is an element who, by HTML standard, does not contain any other elements, and false otherwise.
To check if the element has children, use "has_children"
is_inside
Provided with a list of tag names or element objects, and this will check if the current element is contained in any of the element objects, or elements whose tag name is provided. It returns true if it is contained, or false otherwise.
Example:
say $e->is_inside( qw( span div ), $elem1, 'p', $elem2 ) ? 'yes' : 'no';
is_valid_attribute
Provided with an attribute name and this returns true if it is valid of false otherwise.
is_void
Returns true if, by standard, this tag is void, meaning it does not contain any children. For example: <br /
>, <link /
>, or <input /
>
left
Returns an array object of all the sibling objects before the current element.
line
Returns the line at which this element was found in the original HTML text string, by the parser.
lineage
Returns an array object of the current element's parent and parent's parent up to the root of the hierarchy
lineage_tag_names
Returns an array object of the current element's parent tag name and parent's parent tag name up to the root of the hierarchy
This is equivalent to:
my $list = $self->lineage->map(sub{ $_->tag });
look
This is the method that does the heavy work for "look_down" and "look_up"
look_down
Provided with some criterias, and an optional hash reference of options, and this will crawl down the current element hierarchy to find any matching element.
my $list = $e->look_down( _tag => 'div' ); # returns an Module::Generic::Array object
my $list = $e->look_down( class => qr/\bclass_name\b/, { max_level => 3, max_match => 1 });
The options you can specify are:
- max_level
-
Takes an integer that sets the maximum lower or upper level beyond which, this wil stop searching.
- max_match
-
Takes an integer that sets the maximum number of matches after which, this will stop recurring and return the result.
There are three kinds of criteria you can specify:
- 1.
attr_name
,attr_value
-
This is used when you are looking for an element with a particular attribute name and value. For example:
my $list = $e->look_down( id => 'hello' );
This will look for any element whose attribute
id
has a value ofhello
If you want to search for an attribute that does not exist, set the attribute value being searched to
undef
To search for a tag, use the special attribute
_tag
. For example:my $list = $e->look_down( _tag => 'div' );
This will return an array object of all the
div
elements. - 2.
attr_name
, qr// -
Same as above, except the attribute value of the element being checked will be evaluated against this regular expression and if true will be added into the resulting array object.
For example:
my $list = $e->look_down( 'data-dob' => qr/^\d{4}-\d{2}-\d{2}$/ );
This will search for all element who have an attribute
data-dob
and with value something that looks like a date. - 3. \&my_check or sub{ # some code here }
-
Provided with a code reference (i.e. a reference to an existing subroutine, or an anonymous one), and it will be evaluated for each element found. If it returns
undef
,look_down
will interrupt its crawling, and if it returns true, it will signal the need to add the element to the resulting array object of elements.For example:
my $list = $e->look_down( _tag => 'img', class => qr/\bactive\b/, sub { return( $_->attr( 'width' ) > 350 ? 1 : 0 ); } );
When executing the code, the current element being evaluated will be made available via
$_
Those criteria are called and evaluated in the order they are provided. Thus, if you specify, for example:
my $list = $e->look_down(
_tag => 'img',
class => qr/\bactive\b/,
sub
{
return( $_->attr( 'width' ) > 350 ? 1 : 0 );
}
);
Each element will be evaluated first to see if their tag is img
and discarded if they are not. Then, if they have a class attribute and its content match the regular expression provided, and the element gets discarded if it does not match. Finally, the code will be evaluated.
Thus, the order of the criteria is important.
It returns an array object of all the elements found.
This is provided as a compatibility with HTML::Element
look_up
Provided with some criterias, and an optional hash reference of options, and this will crawl up the current element ascendants starting with its parent to find any matching element.
The options that can be used are the same ones that for "look_down", i.e. max_level
and max_match
It returns an array object of all the elements found.
This is provided as a compatibility with HTML::Element
looks_like_html
Provided with a string and this returns true if the string starts with an HTML tag, or false otherwise.
looks_like_it_has_html
Provided with a string and this returns true if the string contains HTML tags, or false otherwise.
modified
Set or get a boolean of whether the element was modified. Actually this is not used.
This returns a DateTime object.
new_attribute
This creates a new HTML::Object::Attribute object passing it any arguments provided, and returns the object thus created, or undef
if an error occurred.
new_closing
This creates a new HTML::Object::Closing object passing it any arguments provided, and returns the object thus created, or undef
if an error occurred.
new_document
Instantiate a new HTML document, passing it whatever argument was provided, and return the resulting object.
new_element
Instantiate a new element, passing it whatever argument was provided, and return the resulting object.
new_from_lol
This is a legacy from HTML::Element
, but is not actually used.
This recursively constructs a tree of nodes.
It returns an array object of elements.
new_parser
Instantiate a new parser object, passing it whatever argument was provided, and return the resulting object.
new_text
Instantiate a new text object, passing it whatever argument was provided, and return the resulting object.
normalize_content
Check each of the current element child element and concatenate any adjacent text or space element.
It returns the current object.
offset
Returns the offset value, i.e. the byte position, at which the tag was found in the original HTML data.
original
Returns the original raw string data as it was captured initially by the parser.
This is an important feature of HTML::Object since that, if nothing was changed, HTML::Object will return the element objects in their original
text version.
Whereas, other HTML parser, decode all the HTML elements parsed and rebuild them, often badly and even though they have not been changed, which of course, incur a heavy speed penalty.
parent
Returns the current element's parent element, if any. The value returned could very well be empty if, for example, it is the top element or if the element was created independently of any parsing.
pindex
This is an alias for "pos"
pos
Read-only.
Returns the position integer of the current element among its parent's children elements.
It returns a smart undef if the element has no parent.
If the current element, somehow, could not be found among its parent, this would return undef
Contrary to the HTML::Element
equivalent, you cannot manually change this value.
postinsert
Provided with a list of elements and this will add them right after the current element in its parent's children.
It returns the current element object for chaining upon success, and upon error, it returns undef
and sets an error
preinsert
Provided with a list of elements and this will add them right before the current element in its parent's children.
It returns the current element object for chaining upon success, and upon error, it returns undef
and sets an error
push_content
Provided with a list of elements and this will add them as children to the current element.
Contrary to the HTML::Element
equivalent, this requires that only object be provided, which is easy to do anyhow.
If consecutive text or space objects are provided they are automatically merged with their immediate text or space objects, if any.
For example:
$e->push_content( $elem1, HTML::Object::Element->new( value => q{some text} ), $elem2 );
And if two consecutive text objects were provided the second one would have its value merged with the previous one.
It returns the current element object for chaining.
replace_with
Provided with a list of element objects and this will replace the current element in its parent's children with the element objects provided.
This will return an error if the current element has no parent, or if the current element cannot be found among its parent's children elements.
Also, this method will filter out any duplicate objects, and return an error if the element being replaced is also among the objects provided for replacement or if the current element's parent is among the replacement objects.
Each replacement object is detached from its previous parent and re-attach to the current element's parent before being added to its children.
It returns the current element object.
replace_with_content
Replaces the current element in its parent's children by its own children element, which, in other words, means that the current element children will be moved up and replace the current element itself.
It returns the current element object, which will then, have no more parent.
reset
Enable the reset flag for this element, which has the effect of instructing "as_string" to not use its cache.
right
Returns an array object of all the sibling objects after the current element.
root
Returns the top most element in the hierarchy, which usually is HTML::Object::Document
same_as
This method will check that 2 element objects are similar, in the sense that they can have different "eid", but have identical structure.
I you want to check if 2 element object are actually the same, by comparing their eid
, you can use the comparison signs that have been overloaded. For example:
say $a eq $b ? 'same' : 'nope';
set_checksum
Calculate and returns the md5 checksum of the current element based on all its attributes.
splice_content
Provided with an offset
and a length
, and a list of element objects and this will replace the elements children at offset position offset
and for a length
number of items by the list of objects supplied.
If consecutive text element or space element are provided they will be merged with their immediate previous sibling of the same type.
For example:
$e->splice_content( 3, 2, $elem1, $elem2, HTML::Object::Text->new( value => 'Hello world' ) );
It returns an error if the offset
or length
provided is not a valid integer.
Upon success, it returns the current object for chaining.
tag
Returns the tag name of the current element as a scalar object. Be careful at any change you would make as it would directly change the element tag name.
Non-element tag, such as text or space have a pseudo tag starting with an underscore ("_"), such as _text
and _space
traverse
Provided with a reference to an existing subroutine, or an anonymous one, and this will crawl through every element of the descending hierarchy and call the callback code, passing it the element object being evaluated. The local variable $_
is also made available and set to the element being evaluated.
unshift_content
This acts like "push_content", except that instead of appending the elements, this prepends the given element on top of the element children.
It returns the current element.
AUTHOR
Jacques Deguest <jack@deguest.jp>
SEE ALSO
HTML::Object, HTML::Object::Attribute, HTML::Object::Boolean, HTML::Object::Closing, HTML::Object::Collection, HTML::Object::Comment, HTML::Object::Declaration, HTML::Object::Document, HTML::Object::Element, HTML::Object::Exception, HTML::Object::Literal, HTML::Object::Number, HTML::Object::Root, HTML::Object::Space, HTML::Object::Text, HTML::Object::XQuery
COPYRIGHT & LICENSE
Copyright (c) 2021 DEGUEST Pte. Ltd.
All rights reserved
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.