|
|
|
<substring></substring>
Provided by module: Tags: RXML tags
Extract a part of a string. The part to extract can
be specified using character positions, substring occurrences, and/or
fields separated by a set of characters. Some examples:
To pick out substrings based on character positions:
<substring index="2">abcdef</substring> |
b |
<substring index="-2">abcdef</substring> |
e |
<substring from="2" to="-2">abcdef</substring> |
bcde |
To pick out substrings based on string occurrences:
<substring after="the">
From the past to the future via the present.
</substring> |
past to the future via the present.
|
<substring after="the" from="2">
From the past to the future via the present.
</substring> |
future via the present.
|
<substring after="the" from="-1">
From the past to the future via the present.
</substring> |
present.
|
<substring after="to" before="the" to="3">
From the past to the future via the present.
</substring> |
the future via |
To pick out substrings based on separated fields:
<substring separator-chars=",:" index="4">a, , b:c, d::e, : f</substring> |
c |
[<substring separator-whites="" from="3" to="4">
These are some
words separated by different amounts of whitespace.
</substring>] |
[some words] |
[<substring separator-chars=",:" trimwhites="" from="3">
a, , b:c, d::e, : f
</substring>] |
[b,c,d,,e,,f] |
[<substring separator="," trimwhites="" from="3" before="::">
a, , b: c, d::e, : f
</substring>] |
[b: c,d] |
To use just the separator/join attributes to replace sets of
characters:
<substring separator-chars=",|:;" join=", ">a,b:c|f</substring> |
a, b, c, f |
[<substring separator-whites="" join="">
Remove all whitespace,
please.
</substring>] |
[Removeallwhitespace,please.] |
[<substring separator-whites="">
Normalize all whitespace,
please.
</substring>] |
[Normalize all whitespace, please.] |
<substring separator-chars="^0-9" join=" ">:bva2de 44:3</substring> |
2 44 3 |
The "from", "to" and "index" attributes specifies positions
in the input string. What is considered a position depends on other
attributes:
-
If the "after" attribute is given then "from" counts the
occurrences of that string.
-
Similarly, if the "before" attribute is given then "to"
counts the occurrences of that string.
-
Otherwise, if the "separator", "separator-chars", or
"separator-whites" attribute is given then the input string is
split into fields according to the separator, and the position
counts the fields. "ignore-empty" can be used to not count empty
fields.
-
If neither of the above apply then positions are counted by
characters.
Positive positions count from the start of the input string,
beginning with 1. Negative positions counts from the end.
It is not an error if a position count goes past the string limit
(in either direction). The position gets capped by the start or end
if that happens. E.g. if a "from" position counts from the
beginning and goes past the end then the result is an empty string,
and if it counts from the end and goes past the beginning then the
result starts at the beginning of the input string.
It is also not an error if the start position ends up after the
end position. In that case the result is simply the empty string.
If neither "from", "index", nor "after" is specified then
the returned substring starts at the beginning of the input string.
If neither "to", "index", nor "before" is specified then the
returned substring ends at the end of the input string.
If <substring> is used in an array context with
"separator", "separator-chars", or "separator-whites" then the
fields are returned as an array of strings instead of a single
string. An example:
<set variable="var.list" type="array">
<substring separator-chars=",:" trimwhites="">
a, , b:c, d::e: f
</substring>
</set>
&var.list; |
Array result: ({"a", "", "b", "c", "d", "", "e", "f"}) |
Performance notes: Character indexing is efficient on arbitrarily
large input. The special case with a large positive
"from"/"to"/"index" position in combination with
"before"/"after"/"separator" is also handled reasonably
efficiently.
Attributes
- from="integer"
-
The position of the start of the substring to return.
- to="integer"
-
The position of the end of the substring to return.
- index="integer"
-
The single position to return. This is simply a shorthand for
writing "from" and "to" attributes with the same value. This
attribute is not allowed together with "after" or "before".
- after="string"
-
The substring to return begins after the first occurrence of this
string. Together with the "from" attribute, it specifies the
nth occurrence.
- before="string"
-
The substring to return ends before the first occurrence of this
string. Together with the "to" attribute, it specifies the
nth occurrence.
- separator="string"
-
The input string is read as an array of fields separated by this
string, and the "from", "to", and "index" attributes count
those fields.
If the separator string is empty (i.e. "") then the input string
is treated as an array of single character fields. Besides being
significantly slower, the only difference from indexing directly by
characters (i.e. by leaving out the separator attributes altogether)
is that "trim-chars", "trimwhites" and "ignore-empty" can be
used.
If the "join" attribute isn't given then this separator string
is also used to join together several fields in a string result.
- separator-chars="string"
-
The input string is read as an array of fields separated by any
character in this string, and the "from", "to", and "index"
attributes count those fields.
The syntax of this string is the same as in a "%[...]" format to
Pikes sscanf() function. That means:
-
Ranges of characters can be defined by using a '-' between the
first and the last character to be included in the range. Example:
"0-9H" means any digit or 'H'.
-
If the first character is '^', and this character does not
begin a range, it means that the set is complemented, which is to
say that any character except those in the set is matched.
-
To include the character '-', you must have it first (not
possible in complemented sets, see below) or last to avoid having a
range defined. To include the character ']', it must be first too.
If both '-' and ']' should be included then put ']' first and '-'
last.
-
It is not possible to make a range that ends with ']'; make the
range end with '\' instead and put ']' at the beginning. Likewise
it is generally not possible to have a range start with '-'; make
the range start with '.' instead and put '-' at the end of the
set.
-
To include '-' in a complemented set, it must be put last, not
first. To include '^' in a non-complemented set, it can be put
anywhere but first, or be specified as a range ("^-^").
If "separator-chars" is an empty string (i.e. "") then the
input string is treated as an array of single character fields.
Besides being significantly slower, the only difference from indexing
directly by characters (i.e. by leaving out the separator attributes
altogether) is that "trim-chars", "trimwhites" and
"ignore-empty" can be used.
If a string containing several fields is returned, the first
character in "separator-chars" is used by default to join the
fields. However, if the set is complemented then the fields are
joined without anything in between. In any case, you can use the
"join" attribute to override the join string.
Performance note: The "separator" attribute is much more
efficient than this one, so use "separator" if you have a single
separator character.
- separator-whites
-
The input string is read as an array of fields separated by
arbitrary amounts of whitespace, and the "from", "to", and
"index" attributes count those fields.
In other words, this is a shorthand for specifying
ignore-empty="" together with
separator-chars=" 	 ". It can
be combined with more characters in another "separator-chars"
attribute.
- ignore-empty
-
Only used together with "separator", "separator-chars", or
"separator-whites". Ignore all fields that are empty (after
trimming if "trim-chars" or "trimwhites" is given). In other
words, fields are considered to be separated by a sequence of the
given separator (and trim characters), instead of a single
separator.
- join="string"
-
Only used together with "separator", "separator-chars", or
"separator-whites". If several fields are joined together to a
result string, then this string is used as delimiter between the
fields.
- case-insensitive
-
Be case insensitive when matching the "after", "before",
"separator", "separator-chars" and "trim-chars" strings. Case
is still preserved in the returned result.
- trim-chars="string"
-
Trim any sequence of the characters in this string from the start
and end of the result before returning it. If "separator",
"separator-chars", or "separator-whites" is specified then the
trimming is done on each field.
The format in this attribute is the same as in a "%[...]" to
Pikes sscanf() function. See the "separator-chars" attribute for a
description.
- trimwhites
-
Shorthand for specifying "trim-chars" with all whitespace
characters, and also slightly faster.
|
|