Generic Lexer¶
A generic pattern-based Lexer/tokenizer tool.
The minimun python version is 3.6
- Version
1.1.1
- Maintainer
Leandro Benedet Garcia
- Author
Eli Bendersky
- License
The Unlicense
- Example
If we try to execute the following code:
from generic_lexer import Lexer
rules = {
"VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)",
"EQUALS": r"=",
"SPACE": r" ",
"STRING": r"\".*\"",
}
data = "first_word: String = \"Hello\""
for curr_token in Lexer(rules, False, data):
print(curr_token)
Will give us the following output:
VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0
SPACE( ) at 18
EQUALS(=) at 19
SPACE( ) at 20
STRING("Hello") at 21
-
class
generic_lexer.
Lexer
(rules, skip_whitespace=False, text_buffer='')¶ A simple pattern-based lexer/tokenizer.
All the regexes are concatenated into a single one with named groups. The group names must be valid Python identifiers. The patterns without groups auto generate them. Groups are then mapped to token names.
- Parameters
rules (
Union
[Dict
[str
,Pattern
],Iterable
[Tuple
[str
,Pattern
]]]) – A list of rules. Each rule is astr
,re.Pattern
pair, wherestr
is the type of the token to return when it’s recognized andre.Pattern
is the regular expression used to recognize the token.skip_whitespace (
bool
) – If True, whitespace (s+) will be skipped and not reported by the lexer. Otherwise, you have to specify your rules for whitespace, or it will be flagged as an error.text_buffer (
str
) – the string to generate the tokens from
-
clear_text_buffer
()¶ Set the text buffer to a blank string and set the text pointer to 0
-
set_text_buffer
(value)¶ Set the text to be parsed into the lexer and set the pointer back to 0
-
property
text_buffer
¶ Set, Get or Clear the text buffer, you may use
del
with this property to clear the text buffer- Return type
-
tokens
(skip_whitespace=False)¶ - Parameters
skip_whitespace (
bool
) – just likeLexer.skip_whitespace
passed troughLexer
for the current method call.- Raises
LexerError – raised with the position and character of the error in case of a lexing error (if the current chunk of the buffer matches no rule).
- Yields
the next token (a Token object) found in the
Lexer.text_buffer
.- Return type
-
exception
generic_lexer.
LexerError
(char, text_buffer_pointer)¶ Lexer error exception.
-
class
generic_lexer.
Token
(name, position, val)¶ A simple Token structure. Contains the token name, value and position.
As you can see differently from the original gist, we are capable of specifying multiple groups per token.
You may get the values of the tokens this way:
>>> from generic_lexer import Lexer >>> rules = { ... "VARIABLE": r"(?P<var_name>[a-z_]+): (?P<var_type>[A-Z]\w+)", ... "EQUALS": r"=", ... "STRING": r"\".*\"", ... } >>> data = "first_word: String = \"Hello\"" >>> variable, equals, string = tuple(Lexer(rules, True, data)) >>> variable VARIABLE({'var_name': 'first_word', 'var_type': 'String'}) at 0 >>> variable.val {'var_name': 'first_word', 'var_type': 'String'} >>> variable["var_name"] 'first_word' >>> variable["var_type"] 'String' >>> equals EQUALS(=) at 19 >>> equals.val '=' >>> string STRING("Hello") at 21 >>> string.val '"Hello"'
- Parameters
name – the name of the token
position – the position the token was found in the text buffer
val – token’s value