Code archives/Miscellaneous/Parser framework
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
| This is a monstrosity of a library that allows you to describe a parser using a declarative grammar syntax, and use it to parse a token stream generated by my Lexical scanner framework. EDIT: See first comment for a utility to make defining parsers with this library even easier! "Parsing" is the process of taking a linear input stream and turning it into something meaningful, usually an abstract syntax tree. The tree is a nice way to reorganise an arbitrarily-complex and nested input into nodes each carrying a defined, fixed amount of easily-processed meaning. e.g. the following input: 1 + 2 * 3 + 4 * 5 ...could, after being tokenised, be "parsed" into a syntax tree structured like this: +
/ \
1 +
/ \
* *
/ \ / \
2 3 4 5This is easy to evaluate recursively (or e.g. output as function-calls), treating each operation as a simple function that doesn't have to worry about nuisances like operator precedence or the like. We can generalise this to most formal languages, which interestingly for us, includes programming, scripting and configuration languages (Blitz, C, JSON...). Using this library is a tad more complicated than its accompanying lexer, mainly because parsing is a much more complicated task. You will need to have a solid grasp of the concept of a "formal grammar" in order to use this library, and indeed to design your language in the first place (it's close to impossible to design a decent programming language without defining a grammar first, don't try). Grammar is to "sentences" or program forms as regular expressions are to individual words, more or less. Create a new parser with PAR_CreateParser. This sets a global state variable so that you don't have to keep passing the parser object to the rule functions, which would get boring really quickly (needless to say this means the library is absolutely not threadable, as though that matters for Blitz Classic; several other global state vars are also used internally by the parser engine). Rules are defined using the five rule operators, and are given names so they can actually be used using PAR_NameRule. The root rule that the parser will use to match the entire input stream is set with PAR_Root. The operators (cat, alt, opt, rep, err) create anonymous rule objects and return automatic names, so they must be named in order to be useful. Operators: -- PAR_Cat - concatenation (things following each other - this is usually implicit in BNF and similar syntaxes) -- PAR_Alt - alternative: one of the listed options. Bar operator "|" in common syntax -- PAR_Opt - optional: is allowed to not be present. Corresponds to "?" in common syntax -- PAR_Rep - repetition: is repeated ZERO or more times. Corresponds to the "*" operator -- PAR_Err - raise an error and stop parsing, returning null -- PAR_Plus - A wrapper around Cat/Rep. Corresponds to "+" in common syntax The components of a rule are separated by spaces. A terminal is represented by a string containing either its token-type or actual value, prefixed with # for type and $ for value. To raise an error on failing to match a terminal, add ! (e.g. #!number ). If multipart rules are passed to Opt or Rep, they are treated as though they were packaged by Cat (i.e. the elements are assumed to be concatenated). The Cat operator (and Opt or Rep, as applied to the implicit Cat) can also take an optional second "filter" argument: a filter can rename ("@name") or drop ("~") elements from the resulting match. Elements that aren't renamed, but kept with a simple "@", are numbered according to their position from the start of the match. It can also fold elements into their predecessors ("<"); if the predecessor is a list created with Rep, the element will be stuck on the end and numbered accordingly (this is useful when you need a slightly different rule for whatever marks the end of a list so it can't be part of the Rep). So: -- PAR_Cat("atom comma atom") produces a match with the values named "0:", "1:", "2:" -- PAR_CAT("atom comma atom", "@left ~ @right") produces a match with the atoms named "left:" and "right:", and no comma remaining in the output - it is filtered out -- Any values not accounted for beyond the end of the filter are numbered as default By default concatenating operators that form nested elements of named rules are "folded": if the result consists of only one element, it is returned directly to the containing rule in order to keep the result tree flatter (or at least, manageably small - otherwise even simple expressions would produce very deeply nested results). Named rules can also be set to "fold" by passing "^" as the last element of the filter, after the rename/drop elements. If the result contains more than one node, the match will never fold, even if specifically requested to do so with "^". If the exclamation mark "!" appears on its own as an element of a rule, in the event that the rule fails to return a match after that point, instead of resetting and trying something else, the parser will error out. This is handy for something like e.g. a "Type" or "Function" declaration, where you know what the object is from the first word: you don't want it to try a different pattern if the match fails, because you already know there must be an error around here. Once all the rules have been listed, the parser must be finalised with PAR_Commit. This links all of the renames to the relevant rules, compiles token type checks, etc. etc. Once PAR_Commit has been called, the parser is ready for use or storage (you can for instance safely create other parsers if necessary after this point). The parser accepts input in the form of a list of LEX_Token objects. It does not take ownership of this list, but the resulting parse tree will make reference to the tokens held in the list, so the list shouldn't be freed until you're finished with the parse tree making use of it. Note that while designed as an API, this is really intended as just a more declarative way of expressing static parsers in Blitz code. As a result, PAR_Commit and the rule definition functions use RunTimeError to signal problems (the parser engine itself uses a safe, dynamic error handling system to report errors in the input stream without crashing). In other words, verify that your grammar and so on is correct before releasing your program! This is not really intended for dynamically loading new language grammars on the fly (that would be badass, but... at that point even I have to admit defeat and say "just go and use Lisp"). As with the lexer, the parser library expects you to supply a file "Parser-Interface.bb" containing the definition of PAR_Error, to suit your program's error handling system; and PAR_GetRuleDesc, to suit your grammar. As with the lexer, the library makes no assumptions about these functions other than they exist: leave them empty for all we care, or log errors properly. The first function is called when the Err operator matches input, with the state PAR_ERR_MATCH (in p\errorState); and to output messages generated by other internal error states as well. The second is called when a rule force-fails because of the ! marker, and is intended to expand concise rule names (e.g. "func-decl") into longer, readable ones ("function body declaration"). Warning: There are two major restrictions upon the rules that can be expressed by this API. Firstly, the parser engine that interprets the rules uses something akin to recursive descent; this means no left-recursive rules or you'll get an infinite loop that never matches anything. Secondly, the parser engine uses actual recursion (because looping is a PITA for such things), so make use of the Rep operator rather than using ludicrously deep recursion to describe list-like structures, or there will be a very real risk of stack overflow. Here's a small example, using the parser library to parse the same input we saw before into a syntax tree (example prints to the Debug Log, so run in Debug to see anything): ...using this Parser-Interface.bb: ; Example parser error handler ;============================== ;Called in the event of user-defined error thrown by pattern match ;Examine p\strp to get file/line.column information on the point of failure Function PAR_Error(p.PAR_Parser, msg$) DebugLog "Parser Error: " + msg End Function ;Called in the event of an error within a specific match that cannot fail ;Replace the short-form rule identifier with a user-readable name Function PAR_GetRuleDesc$(rule$) ;Have a list here of rule names to match against and get a better error description Return rule End Function This also provides an example grammar, in non-API syntax, and several examples of filters and folds in use. Hopefully this will help if you have difficulty understanding these concepts. The printed output is a little odd (e.g. the right-hand argument to math operators is nested; argument 2 and beyond to a function are packed into one "el" for "element list" slot): this is a product of the structure of the grammar and its nested rules. A normalisation pass, converting this tree into a slightly more intuitive layout, would be handy - but not necessary - before actually evaluating the expression, to make it more regular. At any rate, actually evaluating a little math tree like this is dead easy from here on, so I won't spoil the fun by showing the rest of the calculator (since it's only an example and that has nothing to do with parsing). This library depends on the accompanying lexer (obviously) and linked list include files, along with any dependencies of their own (e.g. regex). This is a new, complicated library for a reasonably complicated subject. It is not like parser frameworks for other languages (mainly because they usually use either preprocessors, or macros/operator overloading); and it is probably full of bugs still; so expect to see some stealth-updates occurring. Either way, please, feel free to ask questions and for any more helpful examples if you have any interest in using this tool, and please do report any bugs you encounter! | |||||
; Generic parser framework
;==========================
;Dependencies
Include "Lexer.bb" ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2985
Include "LList.bb" ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2873
;User-specialised interface: supplies PAR_Error
Include "Parser-Interface.bb"
Type PAR_Parser
Field isCompiled
Field rules.LList ;List of owned rules
Field alias.LList ;List of rule aliases
Field root.PAR_Rule, rn$
Field tokens.LList, strp.ListNode
Field errorState
End Type
Type PAR_RuleAlias
Field alias$, rName$, r.PAR_Rule ;Does not own r
End Type
Type PAR_Rule
Field rs$, fs$, genName$, alias$
Field action, doFold, isTerminal
Field elems.LList, filters.LList, errOnFailAfter
End Type
Type PAR_Filter
Field name$, action
End Type
Type PAR_Node
Field rName$, eName$ ;Name of rule (as matched), name of element (as filtered)
Field term.LEX_Token, leaves.LList
End Type
Const PAR_ACTION_CAT = 1, PAR_ACTION_ALT = 2, PAR_ACTION_OPT = 3, PAR_ACTION_REP = 4, PAR_ACTION_ERR = 5
Const PAR_ACTION_CT = 6, PAR_ACTION_CV = 7, PAR_ACTION_ET = 8, PAR_ACTION_EV = 9, PAR_ACTION_EOF = 10, PAR_ACTION_PLUS = 11
Const PAR_FILTER_NAME = 1, PAR_FILTER_DROP = 2, PAR_FILTER_FOLDPREV = 3, PAR_FILTER_FOLDWITH = 4
Const PAR_ERR_NONE = 0, PAR_ERR_EX_T = -1, PAR_ERR_EX_V = -2, PAR_ERR_INCOMP = -3, PAR_ERR_UNKNOWN = -4
Const PAR_ERR_MATCH = -5, PAR_ERR_EOF = -6, PAR_ERR_EXRULE = -7
;Private globals maintaining internal state
Global PAR_private_CurrentParser_.PAR_Parser, PAR_private_Nil_.LEX_Token
Global PAR_private_NameCounter_, PAR_private_EOF_.LEX_Token, PAR_private_Exc_$
Dim PAR_private_SplitRes_$(0) : Global PAR_private_SplitCt_
;Create a new parser object. Rules are added after creation
Function PAR_CreateParser.PAR_Parser()
Local p.PAR_Parser = New PAR_Parser
p\rules = CreateList()
p\alias = CreateList()
PAR_private_CurrentParser_ = p
Return p
End Function
;Free a parser object
Function PAR_FreeParser(p.PAR_Parser)
Local i.Iterator
i = GetIterator(p\rules) : While EachIn(i)
PAR_FreeRule Object.PAR_Rule i\Value
Wend
FreeList p\rules
i = GetIterator(p\alias) : While EachIn(i)
Delete Object.PAR_RuleAlias i\Value
Wend
FreeList p\alias
Delete p
End Function
;Apply all rule additions to a parser, aliasing and compiling rules as necessary
Function PAR_Commit()
Local i.Iterator, a.PAR_RuleAlias, r.PAR_Rule, p.PAR_Parser = PAR_private_CurrentParser_
i = GetIterator(p\alias) : While EachIn(i) ;Link aliases
a = Object.PAR_RuleAlias i\Value
a\r = PAR_GetRuleByWord(a\rName)
If a\r\doFold <> 1 Then a\r\doFold = 0
a\r\alias = a\alias
Wend
i = GetIterator(p\rules) : While EachIn(i) ;Find all subelements and build elem lists
r = Object.PAR_Rule i\Value
If r\doFold Then r\doFold = 1 ;Normalise this supposed boolean
If Not r\isTerminal
PAR_Split r\rs
Local e : For e = 0 To PAR_private_SplitCt_ - 1
If PAR_private_SplitRes_(e) = "!"
r\errOnFailAfter = e
Else
ListAddLast r\elems, Handle PAR_GetRuleByWord(PAR_private_SplitRes_(e))
EndIf
Next
EndIf
Wend
p\root = PAR_GetRuleByWord(p\rn) ;Set root
p\isCompiled = True
PAR_private_CurrentParser_ = Null ;No further changes permitted after the parser has been committed
End Function
;Parse a token stream generated by the generic lexer into an abstract syntax tree
Function PAR_Parse.PAR_Node(p.PAR_Parser, stream.LList)
If Not p\isCompiled Then RuntimeError "Generic parser error: unable to run uncompiled parser, please use PAR_Commit"
If PAR_private_Nil_ = Null Then PAR_private_Nil_ = LEX_NewToken("<NIL>", "", "")
PAR_private_EOF_ = New LEX_Token : ListAddLast stream, Handle PAR_private_EOF_
p\tokens = stream : p\strp = ListNodeAtIndex(stream, 0) ;Attach token stream to parser
p\errorState = PAR_ERR_NONE : PAR_private_Exc_ = "" ;Zero error state
Local n.PAR_Node = PAR_DispatchRule(p, p\root) ;Apply root rule to the stream
If p\errorState = PAR_ERR_NONE
If p\strp\Value <> Handle PAR_private_EOF_
p\errorState = PAR_ERR_INCOMP
ElseIf n = Null
p\errorState = PAR_ERR_UNKNOWN
ElseIf n\term <> PAR_private_EOF_ And n\term <> Null
p\errorState = PAR_ERR_INCOMP
EndIf
EndIf
If p\errorState
If n <> Null Then PAR_FreeNode n
Local tok.LEX_Token = Object.LEX_Token p\strp\Value
Local err$ = "at line: " + tok\l + ", col: " + tok\c + ", file:'" + tok\file + "' - "
Select p\errorState
Case PAR_ERR_EX_T
err = err + "expecting token of type '" + PAR_private_Exc_ + "' but found '" + tok\tType + "'"
Case PAR_ERR_EX_V
err = err + "expecting token '" + PAR_private_Exc_ + "' but found '" + tok\value + "'"
Case PAR_ERR_EOF
err = err + "expecting end-of-file"
Case PAR_ERR_INCOMP
err = err + "unable to match remainder of document, parsing aborted"
Case PAR_ERR_EXRULE
err = err + "syntax error while matching " + PAR_GetRuleDesc(PAR_private_Exc_)
Case PAR_ERR_UNKNOWN
err = err + "encountered an unknown error attempting to match document; parsing aborted"
End Select
PAR_Error p, err
EndIf
Delete PAR_private_EOF_ : ListRemoveLast stream
p\tokens = Null : p\strp = Null ;Detach stream
Return n
End Function
;(Internal) Dispatch to the correct application for a given action, redirecting errors
Function PAR_DispatchRule.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local res.PAR_Node
If Not p\errorState
Select r\action
Case PAR_ACTION_CAT : res = PAR_ApplyCat(p, r)
Case PAR_ACTION_ALT : res = PAR_ApplyAlt(p, r)
Case PAR_ACTION_OPT : res = PAR_ApplyOpt(p, r)
Case PAR_ACTION_REP : res = PAR_ApplyRep(p, r)
Case PAR_ACTION_ERR : res = PAR_ApplyErr(p, r)
Case PAR_ACTION_EOF : res = PAR_ApplyEof(p, r)
Case PAR_ACTION_PLUS : res = PAR_ApplyPlus(p, r)
Default ;Token rules
res = PAR_ApplyTokenRule(p, r)
End Select
EndIf
If p\errorState And (res <> Null) Then PAR_FreeNode res : Else Return res
End Function
;(Internal) Cat operator implementation
Function PAR_ApplyCat.PAR_Node(p.PAR_Parser, r.PAR_Rule, isRep = False)
Local res.PAR_Node = PAR_CreateNode(r\alias, Null), pos.ListNode = p\strp
Local i.Iterator = GetIterator(r\elems) : While EachIn(i)
Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem)
If (sub = Null) Or p\errorState
PAR_FreeNode res : If Not p\errorState Then p\strp = pos
If i\cni_ >= r\errOnFailAfter
PAR_private_Exc_ = r\alias
p\errorState = PAR_ERR_EXRULE
EndIf
IteratorBreak(i) : Return Null
EndIf
ListAddLast res\leaves, Handle sub
Wend
If r\filters <> Null Then PAR_ApplyFilter r, res
If r\doFold Or isRep Then res = PAR_ApplyFold(res); : Else res = PAR_ApplyFold(res, True)
Return res
End Function
;(Internal) Alt operator implementation
Function PAR_ApplyAlt.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local i.Iterator = GetIterator(r\elems) : While EachIn(i)
Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem)
If (sub <> Null)
Local res.PAR_Node
If r\doFold ;Alt can fold itself easily, so it simply inherits any fold state
res = sub
Else
res = PAR_CreateNode(r\alias, Null) : ListAddLast res\leaves, Handle sub
EndIf
IteratorBreak(i) : Return res
ElseIf p\errorState
IteratorBreak(i) : Return Null
EndIf
Wend
End Function
;(Internal) Opt operator implementation
Function PAR_ApplyOpt.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local pos.ListNode = p\strp, res.PAR_Node = PAR_ApplyCat(p, r) ;Don't return Null on failure
If res = Null And p\errorState = PAR_ERR_NONE
res = PAR_CreateNode(r\alias, PAR_private_Nil_) : p\strp = pos
EndIf
Return res
End Function
;(Internal) Rep operator implementation
Function PAR_ApplyRep.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local res.PAR_Node, pos.ListNode = p\strp, sub.PAR_Node
Repeat
sub = PAR_ApplyCat(p, r, True)
If sub = Null
If p\errorState
If res <> Null Then PAR_FreeNode res
Return Null
EndIf
If res = Null Then res = PAR_CreateNode(r\alias, PAR_private_Nil_)
p\strp = pos
Else
If res = Null Then res = PAR_CreateNode(r\alias, Null)
ListAddLast res\leaves, Handle sub
pos = p\strp
EndIf
Until sub = Null
;If r\doFold And (res\term <> PAR_private_Nil_) Then res = PAR_ApplyFold(res)
If res\leaves <> Null ;More than one, not folded
Local i.Iterator = GetIterator(res\leaves) : While EachIn(i)
Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_
Wend
EndIf
Return res
End Function
;(Internal) Plus operator implementation
Function PAR_ApplyPlus.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local pos.ListNode = p\strp, sub.PAR_Node = PAR_ApplyCat(p, r, True)
If sub = Null Then p\strp = pos : Return Null
Local res.PAR_Node = PAR_CreateNode(r\alias, Null)
ListAddLast res\leaves, Handle sub
pos = p\strp
Repeat
sub = PAR_ApplyCat(p, r, True)
If sub = Null
If p\errorState Then PAR_FreeNode res : Return Null
p\strp = pos
Else
ListAddLast res\leaves, Handle sub
pos = p\strp
EndIf
Until sub = Null
If res\leaves <> Null ;More than one, not folded
Local i.Iterator = GetIterator(res\leaves) : While EachIn(i)
Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_
Wend
EndIf
If r\doFold Then res = PAR_ApplyFold(res)
Return res
End Function
;(Internal) Err operator implementation
Function PAR_ApplyErr.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local sub.PAR_Node = PAR_ApplyCat(p, r)
If sub <> Null Then PAR_Error p, r\fs : p\errorState = PAR_ERR_MATCH : PAR_FreeNode sub
End Function
;(Internal) Eof operator implementation
Function PAR_ApplyEof.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local tok.LEX_Token = Object.LEX_Token p\strp\Value
If tok <> PAR_private_EOF_ Then p\errorState = PAR_ERR_EOF : Return Null
Return PAR_CreateNode("EOF", tok)
End Function
;(Internal) Token-operator implementation(s)
Function PAR_ApplyTokenRule.PAR_Node(p.PAR_Parser, r.PAR_Rule)
Local tok.LEX_Token = Object.LEX_Token p\strp\Value, res.PAR_Node
If r\action = PAR_ACTION_CT Or r\action = PAR_ACTION_ET
If tok\tType = r\rs Then res = PAR_CreateNode(r\alias, tok)
Else ;EV or CV
If tok\value = r\rs Then res = PAR_CreateNode(r\alias, tok)
EndIf
If res <> Null
p\strp = p\strp\nx_
ElseIf r\action = PAR_ACTION_ET Or r\action = PAR_ACTION_EV
If r\action = PAR_ACTION_ET Then p\errorState = PAR_ERR_EX_T : Else p\errorState = PAR_ERR_EX_V
PAR_private_Exc_ = r\rs
EndIf
Return res
End Function
;(Internal) Run a filter string over a node's result list to strip/rename entries
Function PAR_ApplyFilter(r.PAR_Rule, n.PAR_Node)
Local f.PAR_Filter, e.PAR_Node, fi = 0
Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
f = Object.PAR_Filter ValueAtIndex(r\filters, fi)
e = Object.PAR_Node i\Value
If f = Null
e\eName = Str fi
Else ;Actually do something
Select f\action
Case PAR_FILTER_DROP
PAR_FreeNode e : IteratorRemove i
Case PAR_FILTER_NAME
e\eName = f\name
Case PAR_FILTER_FOLDPREV
Local pv.PAR_Node = Object.PAR_Node i\cn_\pv_\Value
If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list"
If pv\leaves = Null
If pv\term = PAR_private_Nil_
pv\term = Null : pv\leaves = CreateList()
Else
RuntimeError "Previous node does not have a leaf list"
EndIf
EndIf
Local tail.PAR_Node = Object.PAR_Node ListLast(pv\leaves)
If tail <> Null
If tail\term = PAR_private_Nil_ Then PAR_FreeNode tail : ListRemoveLast pv\leaves
EndIf
e\eName = ListLength(pv\leaves) ;Give it an index, not a name
ListAddLast pv\leaves, Handle e
IteratorRemove i
Case PAR_FILTER_FOLDWITH
pv = Object.PAR_Node i\cn_\pv_\Value
If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list"
If e\leaves = Null
If e\term = PAR_private_Nil_
e\term = Null : e\leaves = CreateList()
Else
RuntimeError "This node does not have a leaf list"
EndIf
EndIf
Local head.PAR_Node = Object.PAR_Node ListFirst(e\leaves)
If head <> Null
If head\term = PAR_private_Nil_ Then PAR_FreeNode head : ListRemoveFirst e\leaves
EndIf
ListAddFirst e\leaves, Handle pv
RemoveListNode i\cn_\pv_
Local j.Iterator = GetIterator(e\leaves) : While EachIn(j)
pv = Object.PAR_Node j\Value : pv\eName = j\cni_
Wend
Default
RuntimeError "Unexpected parse filter action: " + f\action
End Select
EndIf
fi = fi + 1
Wend
End Function
;(Internal) 'Fold' a result node: if it consists of only one element, return that element instead
Function PAR_ApplyFold.PAR_Node(n.PAR_Node)
Local ret.PAR_Node, i.Iterator, elem.PAR_Node
i = GetIterator(n\leaves) : While EachIn(i)
elem = Object.PAR_Node i\Value
If elem\term <> PAR_private_Nil_
If ret <> Null Then Return n : Else ret = elem ;Two non-Nil elements: don't fold
EndIf
Wend
If ret = Null
ret = PAR_CreateNode(n\rName, PAR_private_Nil_) ;All Nil, or none: create a Nil return value
Else
; If preserve
; If Left(ret\rName, 2) = "<@" ;Only fold up internal nodes if they're anonymous
; ret\rName = n\rName : ret\eName = n\eName
; Else
; Return n
; EndIf
; EndIf
ListRemove n\leaves, Handle ret
PAR_FreeNode n
EndIf
Return ret
End Function
;Set the root rule for the whole grammar of the current parser
Function PAR_Root(name$)
PAR_private_CurrentParser_\rn = name
End Function
;Grammar operator: concatenate arguments
Function PAR_Cat$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_CAT, True)
End Function
;Grammar operator: choose first matching argument
Function PAR_Alt$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_ALT, True)
End Function
;Grammar operator: try to match against arguments, but allow failure
Function PAR_Opt$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_OPT, True)
End Function
;Grammar operator: match arguments, with repetition
Function PAR_Rep$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_REP, True)
End Function
;Extended grammar operator: match arguments, with _at least one_ repetition
Function PAR_Plus$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_PLUS, True)
End Function
;Grammar operator: raise a specific error on match, abort
Function PAR_Err$(rule$, filter$ = "")
Return PAR_CreateRule_(rule, filter, PAR_ACTION_ERR, False)
End Function
;Grammar operator: match the end of the input stream, error on anything else
Function PAR_Eof$()
Return PAR_CreateRule_("EOF", "", PAR_ACTION_EOF, False)
End Function
;(Internal) Grammar operator: #
Function PAR_CT$(rule$)
Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CT, False)
End Function
;(Internal) Grammar operator: $
Function PAR_CV$(rule$)
Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CV, False)
End Function
;(Internal) Grammar operator: #!
Function PAR_ET$(rule$)
Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_ET, False)
End Function
;(Internal) Grammar operator: $!
Function PAR_EV$(rule$)
Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_EV, False)
End Function
;(Internal) Create a token rule out of a simple token word
Function PAR_TokenRule$(rule$)
If Left(rule, 1) = "#" ;Type
If Mid(rule, 2, 1) = "!" Return PAR_ET(rule) : Else Return PAR_CT(rule)
Else ;Value
If Mid(rule, 2, 1) = "!" Return PAR_EV(rule) : Else Return PAR_CV(rule)
EndIf
End Function
;(Internal) Construct a new rule object
Function PAR_CreateRule_$(rule$, filter$, action, doFilter)
Local r.PAR_Rule = New PAR_Rule
r\rs = rule : r\fs = filter : r\action = action
r\genName = PAR_GenRuleName()
r\alias = r\genName ;May be updated later
r\elems = CreateList() ;Populated during PAR_Commit; content not owned by the rule
r\doFold = -1 ;Will set to true or false based on filter and whether it gets named
r\isTerminal = (action = PAR_ACTION_CT Or action = PAR_ACTION_CV Or action = PAR_ACTION_ET Or action = PAR_ACTION_EV Or action = PAR_ACTION_EOF)
r\errOnFailAfter = 1000000000 ;Unrealistically high number by default, i.e. "off"
If doFilter
PAR_Split filter
If PAR_private_SplitCt_
If PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "^" ;"Fold up" filter option
PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1
r\doFold = 1
ElseIf PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "." ;"No fold" filter option
PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1
r\doFold = 1
EndIf
EndIf
r\filters = CreateList()
Local f : For f = 0 To PAR_private_SplitCt_ - 1
ListAddLast r\filters, Handle PAR_FilterAction(PAR_private_SplitRes_(f), f)
Next
EndIf
ListAddLast PAR_private_CurrentParser_\rules, Handle r ;Add to parser's ownership list
Return " " + r\genName + " "
End Function
;(Internal) Generate a unique internal name for rules
Function PAR_GenRuleName$()
PAR_private_NameCounter_ = PAR_private_NameCounter_ + 1
Return "<@" + PAR_private_NameCounter_ + ">"
End Function
;Retrieve a rule by aliased name
Function PAR_GetRuleByName.PAR_Rule(p.PAR_Parser, name$, errFail = False)
Local i.Iterator = GetIterator(p\alias) : While EachIn(i)
Local a.PAR_RuleAlias = Object.PAR_RuleAlias i\Value
If a\alias = name Then IteratorBreak(i) : Return a\r
Wend
If errFail Then RuntimeError "Generic-parser error: unable to find rule with alias '" + name + "'"
End Function
;(Internal) Retrieve a rule by genName
Function PAR_GetRuleByGenName.PAR_Rule(p.PAR_Parser, genName$, errFail = False)
Local i.Iterator = GetIterator(p\rules) : While EachIn(i)
Local r.PAR_Rule = Object.PAR_Rule i\Value
If r\genName = genName Then IteratorBreak(i) : Return r
Wend
If errFail Then RuntimeError "Generic-parser error: unable to find rule with genName '" + genName + "'"
End Function
;(Internal) Retrieve a rule by name, genName, or if it's a token rule, create it anew
Function PAR_GetRuleByWord.PAR_Rule(word$)
If Instr(word, "@") Then Return PAR_GetRuleByGenName(PAR_private_CurrentParser_, word, True)
If Left(word, 1) = "#" Or Left(word, 1) = "$"
PAR_TokenRule(word)
Return Object.PAR_Rule ListLast(PAR_private_CurrentParser_\rules)
EndIf
Return PAR_GetRuleByName(PAR_private_CurrentParser_, word, True)
End Function
;Free a rule object
Function PAR_FreeRule(r.PAR_Rule)
If r\filters <> Null
Local i.Iterator = GetIterator(r\filters) : While EachIn(i) ;Free filters
Delete Object.PAR_Filter i\Value
Wend
FreeList r\filters
EndIf
FreeList r\elems ;Free element list, but not element objects (owned by parser)
Delete r
End Function
;Apply a fixed name to a rule object (this is the only way to refer to them in user code)
Function PAR_NameRule(name$, rName$)
If Instr(name, "#") Or Instr(name, "$") Or Instr(name, "!") Or Instr(name, "@")
RuntimeError "Generic-parser error: rule names may not contain the sigils #, !, $ or @"
ElseIf Instr(name, " ") Or Instr(name, Chr(9))
RuntimeError "Generic-parser error: rule names may not contain whitespace"
EndIf
name = Lower(name)
Local i.Iterator = GetIterator(PAR_private_CurrentParser_\alias) ;Check for duplicate name
While EachIn(i)
Local r.PAR_RuleAlias = Object.PAR_RuleAlias i\Value
If r\alias = name Then RuntimeError "Generic-parser error: duplicate rule named '" + name + "'"
Wend
PAR_CreateRuleAlias name, rName
End Function
;(Internal) Actually construct the above alias
Function PAR_CreateRuleAlias.PAR_RuleAlias(alias$, rName$)
Local nr.PAR_RuleAlias = New PAR_RuleAlias
nr\alias = alias
nr\rName = Trim(rName) ;Remember the trim!
If Instr(rName, "@") ;It's a genName: sort the list on this basis so that "real" rules are available first
ListAddFirst PAR_private_CurrentParser_\alias, Handle nr
Else
ListAddLast PAR_private_CurrentParser_\alias, Handle nr
EndIf
End Function
;(Internal) Create a new result node, named after the given match
Function PAR_CreateNode.PAR_Node(rName$, tok.LEX_Token)
Local n.PAR_Node = New PAR_Node
n\rName = rName
If tok = Null Then n\leaves = CreateList() : Else n\term = tok
Return n
End Function
;Free a result node and all of its subnodes
Function PAR_FreeNode(n.PAR_Node)
If n\leaves <> Null
Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
PAR_FreeNode Object.PAR_Node i\Value
Wend
FreeList n\leaves
EndIf
Delete n
End Function
;Create a recursive copy of a node
Function PAR_CopyNode.PAR_Node(n.PAR_Node)
Local c.PAR_Node = New PAR_Node
c\rName = n\rName
c\eName = n\eName
c\term = n\term ;Nodes don't own terminals, so copying the reference is correct
If n\leaves <> Null
c\leaves = CreateList()
Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
ListAddLast c\leaves, Handle PAR_CopyNode(Object.PAR_Node i\Value)
Wend
EndIf
Return c
End Function
;Return the child of a node with the given name
Function PAR_NodeChild.PAR_Node(n.PAR_Node, name$)
If n\leaves <> Null
Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
Local ch.PAR_Node = Object.PAR_Node i\Value
If ch\eName = name Then Return ch
Wend
EndIf
Return Null
End Function
;(Internal) Construct filter objects based on filter keys
Function PAR_FilterAction.PAR_Filter(name$, num)
Local f.PAR_Filter = New PAR_Filter
Select name
Case "~"
f\action = PAR_FILTER_DROP
Case "<"
f\action = PAR_FILTER_FOLDPREV
Case ">"
f\action = PAR_FILTER_FOLDWITH
Default
If Left(name, 1) = "@"
f\action = PAR_FILTER_NAME
If name = "@" Then f\name = Str num : Else f\name = Mid(name, 2)
Else
RuntimeError "Generic-parser error: illegal filter action '" + name + "'"
EndIf
End Select
Return f
End Function
;(Internal) Split a string by spaces - this is NOT a generic string splitter though
Function PAR_Split(s$)
If Len(s) = 0 Then PAR_private_SplitCt_ = 0 : Return
Local t$ = Replace(Trim(s), Chr(9), " ")
Repeat
s = Replace(t, " ", " ")
If Len(s) = Len(t) Then Exit : Else t = s
Forever
Local c, sCount : For c = 1 To Len(s)
If Mid(s, c, 1) = " " Then sCount = sCount + 1
Next
Dim PAR_private_SplitRes_$(sCount) : PAR_private_SplitCt_ = sCount + 1
If sCount = 0
PAR_private_SplitRes_(0) = s
Else
For c = 0 To sCount
If c < sCount
Local p = Instr(s, " ")
PAR_private_SplitRes_(c) = Left(s, p - 1)
s = Mid(s, p + 1)
Else
PAR_private_SplitRes_(c) = s
EndIf
Next
EndIf
End Function
;~IDEal Editor Parameters:
;~F#D#16#1A#20#24#39#42#50#70#A0#B3#C7#D9#E2#FD#117#11D#124#138#172
;~F#18B#190#195#19A#19F#1A4#1A9#1AE#1B3#1B8#1BD#1C2#1C7#1D0#1F1#1F7#200#209#213#21F
;~F#232#23E#246#251#260#26B#280
;~C#Blitz3D |
Comments
| ||
| While this is waaaaaay more efficient and simple than writing a parser by hand, it's still a lot of typing and I've found it's incredibly tedious to get all of the punctuation (quotemarks, commas etc.) right when entering the Blitz code. It's also a nuisance to type things like PAR_NameRule repeatedly, and it's a real pain to have to maintain two separate copies of the grammar, one in readable form and one in this API representation, and correct changes from one into the other... To that end, I have knocked together a tiny parser-generator-generator program. Enter the grammar once, and the program will emit parser prepackaged API code for the above library, grammar in BNF style, or both (BNF comments above API calls). Download it here: https://sites.google.com/site/nangdongseng/downloads/ParserGenerator.zip?attredirects=0&d=1 The program is written in pure R5RS Scheme, and requires an interpreter (note: won't work with a compiler, it dynamically "eval"s the grammar file). I recommend Gambit or Chicken. You need to write your grammar in S-expression style like this: ; S-expression grammar for a simple calculator language, readable by parser generator: (parser democalc (:= expression comp-expr) (:= comp-expr (:& sum-expr (:* comp-op sum-expr))) (:= sum-expr (:& shift-expr (:* sum-op shift-expr) => "foo")) (:= shift-expr (:& mul-expr (:* shift-op mul-expr))) (:= mul-expr (:& pow-expr (:* mul-op pow-expr))) (:= pow-expr (:& unary-expr (:* pow-op unary-expr))) (:= unary-expr (:& (:* unary-operator) atom)) (:= atom (:/ %number func-call (:& %lparen expression %!rparen))) (:= func-call (:& %function %lparen expression (:* %comma expression) %!rparen)) (:= comp-op (:/ %eql %neq %leq %geq)) (:= sum-op (:/ %add %sub)) (:= shift-op (:/ %shl %shr)) (:= mul-op (:/ %mul %div %mod)) (:= pow-op %pow) (:= unary-op (:/ %add %sub)) ) ...and it will produce a packaged output parser like this: Or, put the line "(set! output-mode '(doc))" or "(set! output-mode '(code))" right before the grammar definition, to get just the comments - i.e. a readable BNF-style grammar - or just the code, respectively. Personally I find the S-expression grammar very easy to read and write, and therefore think this tool helps matters no end. Certainly there's a lot less typing involved. Run the program with the line "echo (file1.scm file2.scm) | SCHEME parse-gen.scm", where "file1.scm" etc are the files you want to process, and "SCHEME" is the command to invoke your Scheme interpreter. Output is placed in a file with the name given in the grammar definition block. parse-gen.scm has more information in its comments. There is no attempt at error-checking in the generator, so... um, don't make mistakes. |
| ||
| Quite interesting, ...and super complex, with lots of special constants, and unusual DATA types It's true that I know nothing about SCHEME. I'm not sure, but If I were to create a parser, I would stick to using simple stacks, FILO-style (first-in,last-out) http://en.wikipedia.org/wiki/LIFO I'll be following your topic, and see how this progresses. |
| ||
| You don't need to pay any attention to the Scheme program, it's just a support utility because I'm too lazy to even write out as much as the main API requires. As for creating parsers: this framework is powered by a simple recursive descent algorithm. It's not efficient, but it should be able to handle any right-recursive CFG (e.g. it should have no problem parsing Blitz, Lua or even JavaScript code, given a grammar). I'm not too good at parsing algorithms so I could be totally wrong, but I think there's no stack-based solution of comparable power that isn't also far more complicated (if all you want to do is parse simple math expressions, you can do the shunting-yard algorithm in one page or so of Blitz code; but that can't cover arbitrary programming languages: this can). I guess you could substantially simplify a grammar by using a secondary precedence parser for all simple expressions though. Do share if you have interesting ideas! |
| ||
| You said, "covering arbitrary programming languages", etc.... . --------------------------------------------------------- As for interesting ideas, well, perhaps the ability to include CONSTANTS, trigonometry, set theory, complex numbers, string functions, memories, arrays, ... ( as in a[5]= sin(4*pi*sqrt[-1]) ) ---------------------------------------------------------- For some inspiration on your parser project, maybe copy the Mathematica style of doing things. http://functions.wolfram.com/ http://reference.wolfram.com/mathematica/guide/VariablesAndFunctions.html http://reference.wolfram.com/mathematica/guide/FunctionalProgramming.html |
| ||
| .... ...I think we may be talking at cross-purposes. This code is finished (barring bugs, of which I am sure there are many...). The archive entry at the top is the completed product. I think you may be misunderstanding what it is though: it's a framework for creating a parser (it's about as close as you can get in B3D to a parser combinator). It's not a programming language by itself. So that means two things: 1) You have to supply the language grammar yourself. The parse engine within it can handle "complicated" structures, but it can't recognise anything without a grammar. It has no built-in knowledge of any language. 2) Parsing is simply the process of breaking down a linear input stream into an organised, structured tree. A parser doesn't understand concepts like constants, functions or even data types. It just "draws" them in memory. You've got to hand the resulting parse tree over to a checker and code generator before you have a working compiler - this is just the first step! Anything that would involve either a. "running" the input, or b. understanding the "meaning" of the input, is beyond the scope of both this framework, and parsing in general. This is just the process of turning your input stream - i.e. messy, arbitrary user files - into a neat structure akin to XML before the "real work" can begin... whatever that is (note: a parser isn't even directly tied to the concept of programming - it could just as well be used as the first stage of a very inefficient 3D model loader). |
Code Archives Forum