Lua strings — Extensions

Introduction

The lulu.string module adds extra methods to the Lua string class.

If you have imported the module as

require 'lulu.string'

You then can access dozens of extra methods for any string s.

In practice, each method belongs to a functional group, and we first list them by those groupings. The method names within any group follow a consistent scheme.

The complete methods list is also available in alphabetic order.

Queries

Queries to check that the contents of a string s are in some particular form:

String Method Description
s:contains_word(word) Does s contain a word; a string/pattern surrounded by whitespace?
s:contains(pattern) Does s contain the string or lpeg pattern pattern?
s:ends_with(suffix) Returns true if s ends with the specified suffix.
s:is_alpha() Returns true if s only contains alphabetic characters.
s:is_alphanumeric() Returns true if s only contains alphanumeric characters.
s:is_bin() Returns true if s can be interpreted as a positive binary number.
s:is_blank() Returns true if s is all white space characters.
s:is_dec() Returns true if s can be interpreted as a positive integer.
s:is_digit() Returns true if s only contains digit characters.
s:is_empty() Returns true if s is empty (length 0).
s:is_escaped(esc) Returns true if s starts with an escape character esc that defaults to the backslash character.
s:is_float() Returns true if s can be interpreted as an arbitrary floating point number.
s:is_hex() Returns true if s can be interpreted as a positive hexadecimal number.
s:is_int() Returns true if s can be interpreted as an arbitrary integer.
s:is_lower() Returns true if s only contains lowercase characters.
s:is_non_empty() Returns true if s is not the empty string.
s:is_number() Returns true if s can be interpreted as a number or some sort.
s:is_oct() Returns true if s can be interpreted as a positive octal number.
s:is_upper() Returns true if s only contains uppercase characters.
s:starts_with(prefix) Returns true if s starts with the specified prefix.

A few other queries deal with delimited text. We include them here for completeness. See a later section for all the details.

String Method Description
s:is_delimited(l, r, line) Returns true if s is marked on the left by l and the right by r.
s:contains_delimited(l,r,line) Returns true if s contains delimited text anywhere inside it.
s:is_quoted() Returns true if s is surrounded by single or double quotes.
s:contains_quoted() Returns true if s contains quoted text anywhere inside it.

End Changes

We often need to check or perhaps change just the start or end of a string:

Strings are never changed in place by these methods.
Instead, a string copy with the relevant changes is returned.
String Method Description
s:append(suffix) Returns a new string that is s with the suffix string appended to it.
s:ellipsis(n) Returns a copy of s truncated after n characters with a following ” …”
s:ensure_end(suffix) Return s if it already ends with a specified suffix. Otherwise, it returns a new padded version that does indeed end with the required prefix.
s:ensure_start(prefix) Return s if it already starts with a specified prefix. Otherwise, it returns a new padded version that does indeed start with the required prefix.
s:kill_from(p) Returns a copy of s where each line has from p to the end of that line removed. The p argument can be a number, a string, or a lpeg pattern.
s:kill_to(p) Returns a copy of s where each line is killed from its start up to and including p. The p argument can be a number, a string, or a lpeg pattern.
s:pad_end(len,fill) Return a copy of s with a specified minimum length by padding s on the right. The default fill is a space.
s:pad_start(len,fill) Return a copy of s with a specified minimum length by padding s on the left. The default fill is a space.
s:prepend(prefix) Returns a new string that is s with the prefix string prepended to it.
s:strip_comments(char) Returns a copy of s with any end-of-line comments erased from each line.
The default comment start char is the hash symbol #.
s:trim_end(chars) Returns a copy of s that has chars trimmed from the right side.
The default chars is the Lua pattern for whitespace.
s:trim_start(chars) Returns a copy of s that has chars trimmed from the left side.
The default chars is the Lua pattern for whitespace.
s:trim_ws() Returns a copy of s without any leading or trailing whitespace.
s:trim(chars) Returns a copy of s that has chars trimmed from both the left & right sides.
The default chars is the Lua pattern for whitespace.
s:truncate(n) Returns a copy of a string where each line is truncated at n characters.

General Changes

Other methods change strings anywhere as opposed to those above that only effect the start or end.

Strings are never changed in place by these methods.
Instead, a copy with the relevant changes is returned.
String Method Description
s:add_escapes(esc,specials) Returns a copy of s with an escape character added before characters in the string that need “escaping”. The default esc character is a backslash.
We add the esc character before all quotation marks and the backslash by default.
s:change(from, to) Returns a new string that is a copy of s with all occurrences of from replaced with to. The from argument can be a string or a lpeg pattern.
s:collapse_ws(careful) Returns a copy of s with contiguous whitespaces converted to a single space. By default, quoted substrings in s are unaltered (careful = true).
s:dedent_lines(indent) Returns a copy of s with each line’s indentation removed.
The indent argument is the string to remove from the start of each line.
s:delete_ws() Returns a copy of s with all spaces removed.
s:delete(p) Returns a copy of s where all matches to a string or pattern p are deleted.
s:from_hex() Returns a decoded version of a hex-encoded s so e.g., “68656c6c6f” -> “hello”
s:from_lua_pattern() Returns a copy of s with all the ‘%’ characters from a Lua pattern removed.
s:indent_lines(indent, ignore_first) Returns a copy of s with each line’s indentation removed.
The indent argument is the string to add to the start of each line.
The ignore_first argument is a boolean that defaults to false.
If true, the first line is not indented.
s:remove_escapes(esc) Returns a copy of s with any esc characters removed (default is a backslash).
s:to_bool() Tries to interpret s as a boolean value where we recognise yes/no, on/off, etc.
The case of s does not matter.
s:to_hex() Returns a hex-encoded version of s so e.g., “hello” -> “68656c6c6f”
s:to_lua_pattern() Returns a new version s string that can be safely used in Lua’s pattern matching functions.
s:unescape(chars,esc) Returns a copy of s with specific add_escapes characters turned into remove_escapes ones. The default character to unescape is the double quotation mark, and the default escape character is the backslash.

Example:

local str = [[joe \{said\} \"no way\".]]
print("str:                      ",  str)
print("str:remove_escapes():     ",  str:remove_escapes())
print("str:unescape_chars('{}'): ",  str:unescape_chars('{}'))

Output

str:                        joe \{said\} \"no way\".
1str:remove_escapes():       joe {said} "no way".
2str:unescape_chars('{}'):   joe {said} \"no way\".
1
All backslashes are removed from str.
2
Only the backslashed braces \{ and \} are altered.

Tokenising

String parsing generally involves splitting text into tokens/words that are delimited by separators such as whitespace. We have several methods to help with that task:

String Method Description
s:after(sep) Returns the text after the separator sep.
The separator can be a string or a lpeg pattern.
s:before(sep) Returns the text after the separator sep.
The separator can be a string or a lpeg pattern.
s:split(sep) Returns an array of “tokens” in s split by a separator sep.
The default separator is the whitespace pattern.
s:to_blocks() Returns a table of all the text blocks/paragraphs in s.
Blocks/paragraphs are pieces of text delimited by one or more blank lines.
s:to_lines() Returns an array of all the “lines” in s by splitting on a newline character.
s:to_tokens(sep) Returns an array of all the “tokens” in s by splitting it on a separator/pattern sep.
The default separator is the whitespace pattern.

We also have a couple of methods to access the individual characters in a string:

String Method Description
s:iter() Returns an iterator for the characters in s
This can be used as for c in s:iter() do ... end
This iterator is Unicode-aware.
s:to_array() Returns an array that contains the individual characters in s.
This is Unicode-aware.

Delimited Text

Given a text block, you often need to extract substrings surrounded by markers. For example, given a block of C/C++ code, you might want to delete any multiline comment. C/C++ multiline comments begin with the left delimiter /* and end with the right delimiter */.

We have methods that work on text containing delimited strings. Some specialisations work on strings delimited by quotation marks, a typical case of interest.

String Method Description
s:all_delimited(l,r,line) Returns an array of all delimited substrings in s.
s:all_quoted() Returns an array of all quoted substrings in s
s:contains_delimited(l,r,line) Returns true if s contains delimited text anywhere inside it.
s:contains_quoted() Returns true if s contains quoted text anywhere inside it.
s:contains_word(word) Does s contain a particular word which is a string or any lpeg pattern surrounded by whitespace? This is the simplest form of delimited text.
s:first_delimited(l,r,line) Returns the first delimited substring in s.
s:first_quoted() Returns the first quoted substring in s.
s:is_delimited(l,r,line) Returns true if s is marked on the left by l and the right by r.
s:is_quoted() Returns true if s is surrounded by single or double quotes.

The various *_delimited(l, r, line) methods take three optional arguments: l and r are the left and right delimiters to look for, and both default to the double quotation mark. The line argument is a boolean that defaults to false. If set to true, matches stop at the first newline character in the string.

The full rules for delimiters are:

  1. If neither is given, we set them to the default double quote character ".
  2. If only one is given and it’s a single character (e.g. ') then we use that for both delimiters.
  3. If only one is given and it has an even number of characters, then we split that into half. For example, if we are passed the single delimiter “{}”, then we split it in half, so in effect l = '{' and r = '}'.
  4. Finally, you can specify l and r separately.

Example

local s = "Here is a string with {text in braces}"
print("is_delimited:       ", s:is_delimited("{}"))
print("contains_delimited: ", s:contains_delimited("{}"))
print("first_delimited:    ", s:first_delimited("{}"))
print("all_delimited:      ", table.concat(s:all_delimited("{}"), ", "))

Output

is_delimited:           false
contains_delimited:     true
first_delimited:        text in braces
all_delimited:          text in braces

Another Example

local s = "{Here} {is a string} {with} {text in braces}"
print("is_delimited:       ", s:is_delimited("{}"))
print("contains_delimited: ", s:contains_delimited("{}"))
print("first_delimited:    ", s:first_delimited("{}"))
print("all_delimited:      ", table.concat(s:all_delimited("{}"), ", "))

Output

is_delimited:           true
contains_delimited:     true
first_delimited:        Here
all_delimited:          Here, is a string, with, text in braces

Note that in this case, s starts with a "{" and ends with a "}" so the entire string is indeed delimited. However, the first_delimited and all_delimited methods work in a non-greedy manner, which is almost certainly what you would want.

All Methods

Here is an alphabetic list of all the methods the lulu.string module adds to the Lua string class for any string s.

Click on a method to jump to the documentation section for the corresponding functional group.

See Also

lulu.table
lulu.xpeg

Back to top