Lua String Manipulation

This guide discusses how to manipulate and match strings in Lua.

Overview

The Corona string library provides generic functions for string manipulation, such as pattern matching and finding/extracting substrings. When indexing a string in Lua, the first character is at position 1, not at position 0 as in C. Indices are allowed to be negative and are interpreted as indexing backwards from the end of the string. Thus, the last character is at position -1, and so on.

The string library provides all of its functions inside the table string. It also sets a metatable for strings where the __index field points to the string table. Therefore, you can use the string functions in object-oriented style. For instance, string.byte(s,i) can be written as s:byte(i).

The string library assumes one-byte character encodings.

String Functions

Function Description
string.byte() Returns the internal numerical codes of the characters in a string.
string.char() Returns a string in which each character has the internal numerical code equal to its corresponding argument.
string.find() Looks for the first match of a pattern in a string. If found, it returns the indices where the occurrence starts and ends; otherwise, returns nil.
string.format() Returns a formatted string following the description given in its arguments.
string.gmatch() Returns a pattern-finding iterator.
string.gsub() Replaces all occurrences of a pattern in a string.
string.len() Returns the length of a string (number of characters).
string.lower() Changes uppercase characters in a string to lowercase.
string.match() Extracts substrings by matching patterns in a string.
string.rep() Replicates a string by returning a string that is the concatenation of n copies of a specified string.
string.reverse() Reverses a string.
string.sub() Returns a substring (a specified portion of an existing string).
string.upper() Changes lowercase characters in a string to uppercase.

String Patterns

String functions such as string.find(), string.match(), string.gmatch() and string.gsub() require a string pattern to search and replace on. A pattern must be constructed very carefully, otherwise the function will return nil or the string will be improperly altered.

Character Classes

A character class is used to represent a set of characters. The following combinations are allowed in describing a character class, where the class is not one of the magic characters ^, $, (, ), %, ., [, ], *, +, -, or ?.

Class Description
. Represents all characters.
%a Represents all letters.
%c Represents all control characters.
%d Represents all digits.
%l Represents all lowercase letters.
%p Represents all punctuation characters.
%s Represents all space characters.
%u Represents all uppercase letters.
%w Represents all alphanumeric characters.
%x Represents all hexadecimal digits.
%z Represents the character with representation 0.
%x When x is any non-alphanumeric character, this is the standard way to escape the magic characters. Any punctuation character — even non-magic — can be preceded by a % when used to represent itself in a pattern.
[set] Represents the union of all characters in set. A range of characters can be specified by separating the end characters of the range with a - (hyphen). All classes described above can also be used as components in the set. All other characters in set represent themselves. For example, [%w_] or [_%w] represents all alphanumeric characters plus the underscore. [0-7] represents the octal digits. [0-7%l%-] represents the octal digits plus lowercase letters plus the - character. The interaction between ranges and classes is not defined, so patterns like [%a-z] or [a-%%] have no meaning.
[^set] Represents the complement of set as interpreted above.
Notes
  • For all classes represented by single letters — %a, %c, etc. — the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

  • The definition of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

Patterns and Pattern Items

A pattern is a sequence of pattern items (see below). A ^ at the beginning of a pattern anchors the match at the beginning of the subject string. A $ at the end of a pattern anchors the match at the end of the subject string. At other positions, ^ and $ have no special meaning and simply represent themselves. A pattern cannot contain embedded zeros; use %z instead.

A pattern item is defined by any of the following:

  • A single character class, which matches any single character within the class.

  • A single character class followed by *, which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence.

  • A single character class followed by +, which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence.

  • A single character class followed by -, which also matches 0 or more repetitions of characters in the class. Unlike *, these repetition items will always match the shortest possible sequence.

  • A single character class followed by ?, which matches 0 or 1 occurrences of a character in the class.

  • %n, for n between 1 and 9; this pattern matches a substring equal to the n-th captured string (see next).

  • %bxy, where x and y are two distinct characters; this pattern item matches strings that start with x, end with y, and where the x and y are balanced. To clarify, if you read the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For example, the item %b() matches expressions with balanced parentheses.

Captures

A pattern can contain sub-patterns enclosed in parentheses. These describe captures. When a match succeeds, the substrings of the main string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern (a*(.)%w(%s*)), the part of the string matching a*(.)%w(%s*) is stored as the first capture and therefore has number 1. The character matching . is captured with number 2, and the part matching %s* has number 3.

As a special case, the empty capture () captures the current string position (number). For instance, if we apply the pattern ()aa() on the string flaaap, there will be two captures: 3 and 5.