Generic Lexer¶

A generic pattern-based Lexer/tokenizer tool.

The minimun python version is 3.6

Version: 1.1.1
Maintainer: Leandro Benedet Garcia
Author: Eli Bendersky
License: The Unlicense
Example

If we try to execute the following code:

from generic_lexer import Lexer

rules = {
    "VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)",
    "EQUALS": r"=",
    "SPACE": r" ",
    "STRING": r"\".*\"",
}

data = "first_word: String = \"Hello\""

for curr_token in Lexer(rules, False, data):
    print(curr_token)

Will give us the following output:

VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0
SPACE( ) at 18
EQUALS(=) at 19
SPACE( ) at 20
STRING("Hello") at 21

class generic_lexer.Lexer(rules, skip_whitespace=False, text_buffer='')¶

A simple pattern-based lexer/tokenizer.

All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters

rules (Union[Dict[str, Pattern], Iterable[Tuple[str, Pattern]]]) – A list of rules. Each rule is a str, re.Pattern pair, where str is the type of the token to return when it’s recognized and re.Pattern is the regular expression used to recognize the token.
skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.
text_buffer (str) – the string to generate the tokens from

clear_text_buffer()¶: Set the text buffer to a blank string and set the text pointer to 0

get_text_buffer()¶

Get the current text to be parsed into the lexer

Return type: str

set_text_buffer(value)¶: Set the text to be parsed into the lexer and set the pointer back to 0

property text_buffer¶

Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

Return type: str

tokens(skip_whitespace=False)¶

Parameters: skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough Lexer for the current method call.
Raises: LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).
Yields: the next token (a Token object) found in the Lexer.text_buffer.
Return type: Iterator[Token]

exception generic_lexer.LexerError(char, text_buffer_pointer)¶

Lexer error exception.

Parameters

text_buffer_pointer (int) – position in the input_buf line where the error occurred.
char (str) – the character that triggered the error

class generic_lexer.Token(name, position, val)¶

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word: String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS(=) at 19

>>> equals.val
'='

>>> string
STRING("Hello") at 21

>>> string.val
'"Hello"'

Parameters

name – the name of the token
position – the position the token was found in the text buffer
val – token’s value

Changelog¶

1.1.1¶

Added¶

Token can have multiple values, they can be set or get like the example in the Token class.
Lexer can have patterns with named groups that can be acessed trough Token.

Generic Lexer¶

Changelog¶

1.1.1¶

Added¶

Table of Contents

This Page