Module nre

What is NRE?

A regular expression library for Nim using PCRE to do the hard work.

Licencing

PCRE has some additional terms that you must agree to in order to use this module.

Example

import nre

let vowels = re"[aeoui]"

for match in "moigagoo".findIter(vowels):
  echo match.matchBounds
# (a: 1, b: 1)
# (a: 2, b: 2)
# (a: 4, b: 4)
# (a: 6, b: 6)
# (a: 7, b: 7)

import options  # critical to use isSome() and get()
let firstVowel = "foo".find(vowels)
let hasVowel = firstVowel.isSome()
if hasVowel:
  let matchBounds = firstVowel.get().captureBounds[-1]
  echo "first vowel @", matchBounds.get().a
  # first vowel @1

Types

Regex = ref object
  pattern*: string             ## not nil
  pcreObj: ptr pcre.Pcre        ## not nil
  pcreExtra: ptr pcre.ExtraData ## nil
  captureNameToId: Table[string, int]
Represents the pattern that things are matched against, constructed with re(string). Examples: re"foo", re(r"(*ANYCRLF)(?x)foo # comment".
pattern: string
the string that was used to create the pattern.
captureCount: int
the number of captures that the pattern has.
captureNameId: Table[string, int]
a table from the capture names to their numeric id.

Options

The following options may appear anywhere in the pattern, and they affect the rest of it.

  • (?i) - case insensitive
  • (?m) - multi-line: ^ and $ match the beginning and end of lines, not of the subject string
  • (?s) - . also matches newline (dotall)
  • (?U) - expressions are not greedy by default. ? can be added to a qualifier to make it greedy
  • (?x) - whitespace and comments (#) are ignored (extended)
  • (?X) - character escapes without special meaning (\w vs. \a) are errors (extra)

One or a combination of these options may appear only at the beginning of the pattern:

  • (*UTF8) - treat both the pattern and subject as UTF-8
  • (*UCP) - Unicode character properties; \w matches я
  • (*U) - a combination of the two options above
  • (*FIRSTLINE*) - fails if there is not a match on the first line
  • (*NO_AUTO_CAPTURE) - turn off auto-capture for groups; (?<name>...) can be used to capture
  • (*CR) - newlines are separated by \r
  • (*LF) - newlines are separated by \n (UNIX default)
  • (*CRLF) - newlines are separated by \r\n (Windows default)
  • (*ANYCRLF) - newlines are separated by any of the above
  • (*ANY) - newlines are separated by any of the above and Unicode newlines:

    single characters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit library, the last two are recognized only in UTF-8 mode. — man pcre

  • (*JAVASCRIPT_COMPAT) - JavaScript compatibility
  • (*NO_STUDY) - turn off studying; study is enabled by default

For more details on the leading option groups, see the Option Setting and the Newline Convention sections of the PCRE syntax manual.

  Source
RegexMatch = object
  pattern*: Regex              ## The regex doing the matching.
                ## Not nil.
  str*: string                 ## The string that was matched against.
             ## Not nil.
  pcreMatchBounds: seq[Slice[cint]] ## First item is the bounds of the match
                                  ## Other items are the captures
                                  ## `a` is inclusive start, `b` is exclusive end
  
Usually seen as Option[RegexMatch], it represents the result of an execution. On failure, it is none, on success, it is some.
pattern: Regex
the pattern that is being matched
str: string
the string that was matched against
captures[]: string
the string value of whatever was captured at that id. If the value is invalid, then behavior is undefined. If the id is -1, then the whole match is returned. If the given capture was not matched, nil is returned.
  • "abc".match(re"(\w)").captures[0] == "a"
  • "abc".match(re"(?<letter>\w)").captures["letter"] == "a"
  • "abc".match(re"(\w)\w").captures[-1] == "ab"
captureBounds[]: Option[Slice[int]]
gets the bounds of the given capture according to the same rules as the above. If the capture is not filled, then None is returned. The bounds are both inclusive.
  • "abc".match(re"(\w)").captureBounds[0] == 0 .. 0
  • "abc".match(re"").captureBounds[-1] == 0 .. -1
  • "abc".match(re"abc").captureBounds[-1] == 0 .. 2
match: string
the full text of the match.
matchBounds: Slice[int]
the bounds of the match, as in captureBounds[]
(captureBounds|captures).toTable
returns a table with each named capture as a key.
(captureBounds|captures).toSeq
returns all the captures by their number.
$: string
same as match
  Source
Captures = distinct RegexMatch
  Source
CaptureBounds = distinct RegexMatch
  Source
RegexError = ref object of Exception
  Source
RegexInternalError = ref object of RegexError
  
Internal error in the module, this probably means that there is a bug   Source
InvalidUnicodeError = ref object of RegexError
  pos*: int                    ## the location of the invalid unicode in bytes
  
Thrown when matching fails due to invalid unicode in strings   Source
SyntaxError = ref object of RegexError
  pos*: int                    ## the location of the syntax error in bytes
  pattern*: string             ## the pattern that caused the problem
  
Thrown when there is a syntax error in the regular expression string passed in   Source
StudyError = ref object of RegexError
  
Thrown when studying the regular expression failes for whatever reason. The message contains the error code.   Source

Procs

proc captureCount(pattern: Regex): int {.raises: [FieldError, ValueError], tags: [].}
  Source
proc captureNameId(pattern: Regex): Table[string, int] {.raises: [], tags: [].}
  Source
proc captureBounds(pattern: RegexMatch): CaptureBounds {.raises: [], tags: [].}
  Source
proc captures(pattern: RegexMatch): Captures {.raises: [], tags: [].}
  Source
proc `[]`(pattern: CaptureBounds; i: int): Option[Slice[int]] {.raises: [], tags: [].}
  Source
proc `[]`(pattern: Captures; i: int): string {.raises: [UnpackError], tags: [].}
  Source
proc match(pattern: RegexMatch): string {.raises: [UnpackError], tags: [].}
  Source
proc matchBounds(pattern: RegexMatch): Slice[int] {.raises: [UnpackError], tags: [].}
  Source
proc `[]`(pattern: CaptureBounds; name: string): Option[Slice[int]] {.
    raises: [KeyError], tags: [].}
  Source
proc `[]`(pattern: Captures; name: string): string {.raises: [UnpackError, KeyError],
    tags: [].}
  Source
proc toTable(pattern: Captures; default: string = nil): Table[string, string] {.
    raises: [UnpackError, KeyError], tags: [].}
  Source
proc toTable(pattern: CaptureBounds; default = none(Slice[int])): Table[string,
    Option[Slice[int]]] {.raises: [KeyError], tags: [].}
  Source
proc toSeq(pattern: CaptureBounds; default = none(Slice[int])): seq[Option[Slice[int]]] {.
    raises: [FieldError, ValueError], tags: [].}
  Source
proc toSeq(pattern: Captures; default: string = nil): seq[string] {.
    raises: [FieldError, ValueError, UnpackError], tags: [].}
  Source
proc `$`(pattern: RegexMatch): string {.raises: [UnpackError], tags: [].}
  Source
proc `==`(a, b: Regex): bool {.raises: [], tags: [].}
  Source
proc `==`(a, b: RegexMatch): bool {.raises: [], tags: [].}
  Source
proc re(pattern: string): Regex {.raises: [KeyError, SyntaxError, StudyError,
                                      FieldError, ValueError], tags: [].}
  Source
proc match(str: string; pattern: Regex; start = 0; endpos = int.high): Option[RegexMatch] {.raises: [
    FieldError, ValueError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
Like ```find(...)`` <#proc-find>`_, but anchored to the start of the string. This means that "foo".match(re"f") == true, but "foo".match(re"o") == false.   Source
proc find(str: string; pattern: Regex; start = 0; endpos = int.high): Option[RegexMatch] {.raises: [
    FieldError, ValueError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
Finds the given pattern in the string between the end and start positions.
start
The start point at which to start matching. |abc is 0; a|bc is 1
endpos
The maximum index for a match; int.high means the end of the string, otherwise it’s an inclusive upper bound.
  Source
proc findAll(str: string; pattern: Regex; start = 0; endpos = int.high): seq[string] {.raises: [
    FieldError, ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
  Source
proc contains(str: string; pattern: Regex; start = 0; endpos = int.high): bool {.raises: [
    FieldError, ValueError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
Determine if the string contains the given pattern between the end and start positions:
  • "abc".contains(re"bc") == true
  • "abc".contains(re"cd") == false
  • "abc".contains(re"a", start = 1) == false

Same as isSome(str.find(pattern, start, endpos)).

  Source
proc split(str: string; pattern: Regex; maxSplit = - 1; start = 0): seq[string] {.raises: [
    FieldError, ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
Splits the string with the given regex. This works according to the rules that Perl and Javascript use:
  • If the match is zero-width, then the string is still split: "123".split(r"") == @["1", "2", "3"].
  • If the pattern has a capture in it, it is added after the string split: "12".split(re"(\d)") == @["", "1", "", "2", ""].
  • If maxsplit != -1, then the string will only be split maxsplit - 1 times. This means that there will be maxsplit strings in the output seq. "1.2.3".split(re"\.", maxsplit = 2) == @["1", "2.3"]

start behaves the same as in ```find(...)`` <#proc-find>`_.

  Source
proc replace(str: string; pattern: Regex; subproc: proc (match: RegexMatch): string): string {.raises: [
    FieldError, ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}

Replaces each match of Regex in the string with sub, which should never be or return nil.

If sub is a proc (RegexMatch): string, then it is executed with each match and the return value is the replacement value.

If sub is a proc (string): string, then it is executed with the full text of the match and and the return value is the replacement value.

If sub is a string, the syntax is as follows:

  • $$ - literal $
  • $123 - capture number 123
  • $foo - named capture foo
  • ${foo} - same as above
  • $1$# - first and second captures
  • $# - first capture
  • $0 - full match

If a given capture is missing, a ValueError exception is thrown.

  Source
proc replace(str: string; pattern: Regex; subproc: proc (match: string): string): string {.raises: [
    FieldError, ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}
  Source
proc replace(str: string; pattern: Regex; sub: string): string {.raises: [FieldError,
    ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError, KeyError, Exception], tags: [].}
  Source
proc escapeRe(str: string): string {.raises: [FieldError, ValueError, UnpackError,
    AssertionError, AccessViolationError, RegexInternalError, InvalidUnicodeError,
    KeyError, Exception], tags: [].}
Escapes the string so it doesn’t match any special characters. Incompatible with the Extra flag (X).   Source

Iterators

iterator items(pattern: CaptureBounds; default = none(Slice[int])): Option[Slice[int]] {.
    raises: [FieldError, ValueError], tags: [].}
  Source
iterator items(pattern: Captures; default: string = nil): string {.
    raises: [FieldError, ValueError, UnpackError], tags: [].}
  Source
iterator findIter(str: string; pattern: Regex; start = 0; endpos = int.high): RegexMatch {.raises: [
    FieldError, ValueError, UnpackError, AssertionError, AccessViolationError,
    RegexInternalError, InvalidUnicodeError], tags: [].}

Works the same as ```find(...)`` <#proc-find>`_, but finds every non-overlapping match. "2222".find(re"22") is "22", "22", not "22", "22", "22".

Arguments are the same as ```find(...)`` <#proc-find>`_

Variants:

  • proc findAll(...) returns a seq[string]
  Source