<substring></substring>

Provided by module: Tags: RXML tags

Extract a part of a string. The part to extract can be specified using character positions, substring occurrences, and/or fields separated by a set of characters. Some examples:

To pick out substrings based on character positions:

<substring index="2">abcdef</substring>

<substring index="-2">abcdef</substring>

<substring from="2" to="-2">abcdef</substring>

bcde

To pick out substrings based on string occurrences:

<substring after="the">
  From the past to the future via the present.
</substring>

past to the future via the present.

<substring after="the" from="2">
  From the past to the future via the present.
</substring>

future via the present.

<substring after="the" from="-1">
  From the past to the future via the present.
</substring>

present.

<substring after="to" before="the" to="3">
  From the past to the future via the present.
</substring>

the future via

To pick out substrings based on separated fields:

<substring separator-chars=",:" index="4">a, , b:c, d::e, : f</substring>

[<substring separator-whites="" from="3" to="4">
   These   are  some
 words  separated by   different amounts of  whitespace.
</substring>]

[some words]

[<substring separator-chars=",:" trimwhites="" from="3">
  a, , b:c, d::e, : f
</substring>]

[b,c,d,,e,,f]

[<substring separator="," trimwhites="" from="3" before="::">
   a, , b: c, d::e, : f
</substring>]

[b: c,d]

To use just the separator/join attributes to replace sets of characters:

<substring separator-chars=",|:;" join=", ">a,b:c|f</substring>

a, b, c, f

[<substring separator-whites="" join="">
Remove   all whitespace,
	please.
</substring>]

[Removeallwhitespace,please.]

[<substring separator-whites="">
Normalize   all whitespace,
	please.
</substring>]

[Normalize all whitespace, please.]

<substring separator-chars="^0-9" join=" ">:bva2de 44:3</substring>

2 44 3

The "from", "to" and "index" attributes specifies positions in the input string. What is considered a position depends on other attributes:

If the "after" attribute is given then "from" counts the occurrences of that string.

Similarly, if the "before" attribute is given then "to" counts the occurrences of that string.

Otherwise, if the "separator", "separator-chars", or "separator-whites" attribute is given then the input string is split into fields according to the separator, and the position counts the fields. "ignore-empty" can be used to not count empty fields.

If neither of the above apply then positions are counted by characters.

Positive positions count from the start of the input string, beginning with 1. Negative positions counts from the end.

It is not an error if a position count goes past the string limit (in either direction). The position gets capped by the start or end if that happens. E.g. if a "from" position counts from the beginning and goes past the end then the result is an empty string, and if it counts from the end and goes past the beginning then the result starts at the beginning of the input string.

It is also not an error if the start position ends up after the end position. In that case the result is simply the empty string.

If neither "from", "index", nor "after" is specified then the returned substring starts at the beginning of the input string. If neither "to", "index", nor "before" is specified then the returned substring ends at the end of the input string.

If <substring> is used in an array context with "separator", "separator-chars", or "separator-whites" then the fields are returned as an array of strings instead of a single string. An example:

<set variable="var.list" type="array">
  <substring separator-chars=",:" trimwhites="">
    a, , b:c, d::e: f
  </substring>
</set>
&var.list;

Array result: ({"a", "", "b", "c", "d", "", "e", "f"})

Performance notes: Character indexing is efficient on arbitrarily large input. The special case with a large positive "from"/"to"/"index" position in combination with "before"/"after"/"separator" is also handled reasonably efficiently.

Attributes

from="integer"

The position of the start of the substring to return.

to="integer"

The position of the end of the substring to return.

index="integer"

The single position to return. This is simply a shorthand for writing "from" and "to" attributes with the same value. This attribute is not allowed together with "after" or "before".

after="string"

The substring to return begins after the first occurrence of this string. Together with the "from" attribute, it specifies the nth occurrence.

before="string"

The substring to return ends before the first occurrence of this string. Together with the "to" attribute, it specifies the nth occurrence.

separator="string"

The input string is read as an array of fields separated by this string, and the "from", "to", and "index" attributes count those fields.

If the separator string is empty (i.e. "") then the input string is treated as an array of single character fields. Besides being significantly slower, the only difference from indexing directly by characters (i.e. by leaving out the separator attributes altogether) is that "trim-chars", "trimwhites" and "ignore-empty" can be used.

If the "join" attribute isn't given then this separator string is also used to join together several fields in a string result.

separator-chars="string"

The input string is read as an array of fields separated by any character in this string, and the "from", "to", and "index" attributes count those fields.

The syntax of this string is the same as in a "%[...]" format to Pikes sscanf() function. That means:

Ranges of characters can be defined by using a '-' between the first and the last character to be included in the range. Example: "0-9H" means any digit or 'H'.

If the first character is '^', and this character does not begin a range, it means that the set is complemented, which is to say that any character except those in the set is matched.

To include the character '-', you must have it first (not possible in complemented sets, see below) or last to avoid having a range defined. To include the character ']', it must be first too. If both '-' and ']' should be included then put ']' first and '-' last.

It is not possible to make a range that ends with ']'; make the range end with '\' instead and put ']' at the beginning. Likewise it is generally not possible to have a range start with '-'; make the range start with '.' instead and put '-' at the end of the set.

To include '-' in a complemented set, it must be put last, not first. To include '^' in a non-complemented set, it can be put anywhere but first, or be specified as a range ("^-^").

If "separator-chars" is an empty string (i.e. "") then the input string is treated as an array of single character fields. Besides being significantly slower, the only difference from indexing directly by characters (i.e. by leaving out the separator attributes altogether) is that "trim-chars", "trimwhites" and "ignore-empty" can be used.

If a string containing several fields is returned, the first character in "separator-chars" is used by default to join the fields. However, if the set is complemented then the fields are joined without anything in between. In any case, you can use the "join" attribute to override the join string.

Performance note: The "separator" attribute is much more efficient than this one, so use "separator" if you have a single separator character.

separator-whites

The input string is read as an array of fields separated by arbitrary amounts of whitespace, and the "from", "to", and "index" attributes count those fields.

In other words, this is a shorthand for specifying ignore-empty="" together with separator-chars=" 	
". It can be combined with more characters in another "separator-chars" attribute.

ignore-empty

Only used together with "separator", "separator-chars", or "separator-whites". Ignore all fields that are empty (after trimming if "trim-chars" or "trimwhites" is given). In other words, fields are considered to be separated by a sequence of the given separator (and trim characters), instead of a single separator.

join="string"

Only used together with "separator", "separator-chars", or "separator-whites". If several fields are joined together to a result string, then this string is used as delimiter between the fields.

case-insensitive

Be case insensitive when matching the "after", "before", "separator", "separator-chars" and "trim-chars" strings. Case is still preserved in the returned result.

trim-chars="string"

Trim any sequence of the characters in this string from the start and end of the result before returning it. If "separator", "separator-chars", or "separator-whites" is specified then the trimming is done on each field.

The format in this attribute is the same as in a "%[...]" to Pikes sscanf() function. See the "separator-chars" attribute for a description.

trimwhites

Shorthand for specifying "trim-chars" with all whitespace characters, and also slightly faster.