Direct Conversion Method of Regex to DFA
Compiler Design and Construction
Abstract
A regular expression (regex) is a formal way to describe a set of strings (a language) using a concise notation. Regex is recognized by a finite automaton (like a DFA or NFA). Regex to NFA and NFA to DFA is the traditional conversion method whereas the McNaughton-Yamada-Thompson algorithm efficiently converts regex directly to DFA. This article dives into the Algorithm and Syntax tree required for the direct conversion method of Regex to DFA.
1. Introduction
The McNaughton-Yamada-Thompson algorithm (often just called McNaughton-Yamada in some texts) converts a regular expression directly into a deterministic finite automaton (DFA) without explicitly constructing an intermediate nondeterministic finite automaton (NFA) as in Thompson’s construction.
Regex is widely used in computer science for pattern matching, searching, and text processing. Examples
2. Syntax Tree
A syntax tree, also known as a parse tree or an abstract syntax tree (AST), is a tree-like representation of the structure of a sentence or expression, based on its grammatical rules.
.Given Expression (a|b)*abb).
### Construction of Parse/Syntax /Abstract Syntax Tree.
[.] (concatenation)
/ | \
[*] a [.] (concatenation)
| / \
[|] b b
/ \
a b
Root: . (concatenation of (a|b)*, a, bb).
Left child: * (Kleene star).
Child of *: | (union of a and b).
Left child of |: a.
Right child of |: b.
Middle child: a.
Right child: . (concatenation of b and b).
Left child: b.
Right child: b.
3. Firstpos() ,Lastpos() and Followpos()
Grammar:
A grammar is a set of rules to construct an expression or set of strings with the correct symbols placement.
S
/ \
A B
| |
a b
Given expression: a . b
.
/ \
{1}a{1} {2}b{2}
where a = C1 and b = C2
firstpos of C1 or firstpos(C1) = {1}
lastpos of C1 or lastpos(C1) = {1}
firstpos of C2 or firstpos(C2) = {2}
lastpos of C2 or lastpos(C2) = {2}
4. Algorithm
4.1 Augment the given expression
for an example we have expression like this:
(a|b)^*abb.
then the augmented form is
(a | b )^*. a. b . b ) See the dots concated between them
4.2 RE to DFA by Direct Method
Rules for Computing the Function nullable, firstpos, and lastpos
4.3. Computing followpos()
To compute followpos for each position in the syntax tree. The provided code snippet outlines the algorithm:
for (each node n in the tree)
{
// n is a cat-node with left child c1 and right child c2
if (n == c1 . c2 (concatenation-node))
{
for (each i in lastpos(c1))
{
followpos(i) = followpos(i) ∪ firstpos(c2);
}
}
else if (n is a star-node)
{
for (each i in lastpos(n))
{
followpos(i) = followpos(i) ∪ firstpos(n);
}
}
}
5. Handwritten Notes
Here the example solved handwritten examples for the above expression:
Conclusion
The direct method avoids Nondeterminism and ε-Transitions, Reduces Intermediate Complexity, and Efficient Pattern matching in the conversion of regex to DFA. Implementation Complexity, Less Intuitive for Humans, and limited scalability could be the cons of this method. This method serves as the theoretical foundation method to convert the given expression into a DFA. Having DFA means having a machine-computable algorithm.
References
Appendix
A. Key Concepts in Regex to DFA Conversion
For more details, refer to the [Regex to DFA Direct Method ChatGPT Chat.pdf]
B. Google Colab Notebook Implementation of Algorithm
For more details, refer to the Google Colab
Tools