Generic Lexer

License: Unlicense Code style: Black GitHub Workflow Status Code Climate Coverage Code Climate Graduation

A generic pattern-based Lexer/tokenizer tool.

The minimun python version is 3.6

Version

1.1.1

Maintainer

Leandro Benedet Garcia

Author

Eli Bendersky

License

The Unlicense

Example

If we try to execute the following code:

from generic_lexer import Lexer

rules = {
    "VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)",
    "EQUALS": r"=",
    "SPACE": r" ",
    "STRING": r"\".*\"",
}

data = "first_word: String = \"Hello\""

for curr_token in Lexer(rules, False, data):
    print(curr_token)

Will give us the following output:

VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0
SPACE( ) at 18
EQUALS(=) at 19
SPACE( ) at 20
STRING("Hello") at 21
class generic_lexer.Lexer(rules, skip_whitespace=False, text_buffer='')

A simple pattern-based lexer/tokenizer.

All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.

Parameters
  • rules (Union[Dict[str, Pattern], Iterable[Tuple[str, Pattern]]]) – A list of rules. Each rule is a str, re.Pattern pair, where str is the type of the token to return when it’s recognized and re.Pattern is the regular expression used to recognize the token.

  • skip_whitespace (bool) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.

  • text_buffer (str) – the string to generate the tokens from

clear_text_buffer()

Set the text buffer to a blank string and set the text pointer to 0

get_text_buffer()

Get the current text to be parsed into the lexer

Return type

str

set_text_buffer(value)

Set the text to be parsed into the lexer and set the pointer back to 0

property text_buffer

Set, Get or Clear the text buffer, you may use del with this property to clear the text buffer

Return type

str

tokens(skip_whitespace=False)
Parameters

skip_whitespace (bool) – just like Lexer.skip_whitespace passed trough Lexer for the current method call.

Raises

LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).

Yields

the next token (a Token object) found in the Lexer.text_buffer.

Return type

Iterator[Token]

exception generic_lexer.LexerError(char, text_buffer_pointer)

Lexer error exception.

Parameters
  • text_buffer_pointer (int) – position in the input_buf line where the error occurred.

  • char (str) – the character that triggered the error

class generic_lexer.Token(name, position, val)

A simple Token structure. Contains the token name, value and position.

As you can see differently from the original gist, we are capable of specifying multiple groups per token.

You may get the values of the tokens this way:

>>> from generic_lexer import Lexer
>>> rules = {
...     "VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)",
...     "EQUALS": r"=",
...     "STRING": r"\".*\"",
... }
>>> data = "first_word: String = \"Hello\""
>>> variable, equals, string = tuple(Lexer(rules, True, data))

>>> variable
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0

>>> variable.val
{'var_name': 'first_word', 'var_type': 'String'}
>>> variable["var_name"]
'first_word'
>>> variable["var_type"]
'String'

>>> equals
EQUALS(=) at 19

>>> equals.val
'='

>>> string
STRING("Hello") at 21

>>> string.val
'"Hello"'
Parameters
  • name – the name of the token

  • position – the position the token was found in the text buffer

  • val – token’s value

Changelog

1.1.1

Added

  • Token can have multiple values, they can be set or get like the example in the Token class.

  • Lexer can have patterns with named groups that can be acessed trough Token.