Code archives/Algorithms/Lexer generator
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
| Final update: There is no longer any reason to use this entry: it has been completely superseded by a much more convenient generic lexer API. Updated 21/07/2011: Completely rewrote the regular expression engine in an attempt to fix some bugs, and tidied up the output of the scanner to use objects. Updated 18/06/2010: Added the "Type" command; affects both files. Updated 19/04/2010: Fixed a minor error with backslash-escaped backslashes in the regexen; added a couple of major speed enhancements. Updated 01/03/2010: If you downloaded this previously, please do so again as there was a problem with the regular expression engine. A lexical scanner, or tokeniser, is a tool that reads through souce code or some other kind of input, and breaks it up into separate tokens, identified by type. It's an absolutely essential component in a compiler or interpreter, and is useful as the first stage of parsing many other kinds of input as well, such as in a calculator or command-line interface. Writing them can be both difficult and tedious, however; so it's common to automate the process... This generator is very loosely based on Flex, although significantly less powerful and only aimed at producing a specific kind of output. Input is in the form of a definitions file - this is a simple text file in the following format:
Case Insensitive
Constants: {
Digit [0-9]
Quote "
Point \.
Int 1
Float 2
String 3
}
Modes: {
COMMENT Exclusive
DOG Inclusive
}
Rules: {
-?{Digit}+ Store Int
-?{Digit}*{Point}{Digit}+ Store Float
{Quote}[^\n]*{Quote} Store String
\{- Mode <COMMENT>
<COMMENT> -\} Mode <>
<COMMENT,DOG,> doggy {
print "nada"
}
}
Code: {
;Anything in here is copied straight to the main body
;so make sure it's valid BB code
}
Input is arranged in "blocks" - anything not in one of these blocks is currently ignored, with the exception of the "case insensitive" directive which should appear outside the blocks (EDIT: There's now also a "case sensitive" directive, if you need to return it to the default, which also needs to be outside the blocks). There are four kinds of block, as shown above, although none are necessary and each type may appear more than once (a file with no "Rules" block will not generate a functioning lexer, though!) and in any order. Blocks begin with a type specifier ("Rules", "Code", "Constants" or "Modes") followed by a colon and an opening brace; the definitions in the block must then begin on a new line. The block ends when a closing brace is encountered on its own new line. The "Code" block is simplest: anything in this block is simply copied verbatim into the resulting file, in the "main body" of the code. You can specify include lines or helper functions here if you like. The "Constants" block defines simple replacement constants for use in rule definitions and their actions (more on this below). These won't make it into the final BB code though, so don't reference them in any code sections. You can declare these before or after Rules; it makes no difference. The "Modes" block defines different modes of operation that will determine which rules are followed at any given time (supposed to imitate Flex's "start conditions"). For example, if you wanted to scan C or C++, you could set the token "/\*" to trigger "comment" mode, which has only one rule: "\*/", which puts the scanner back in normal mode and allows it to pick up names and commands again (in both cases the asterisk must be escaped). Modes are defined as "inclusive" or "exclusive": exclusive modes, like the comment mode, only allow rules explicitly assigned to them to be followed while active; an inclusive mode will also allow rules without any specific mode to be followed. As with constants, modes can be declared anywhere in the definitions file. Finally, the most important section is the Rules section. A rule begins with an optional list of modes, contained within <> (more than one mode can be assigned to a rule, separated with commas, including as above, the empty mode). Next is a regular expression that defines a pattern to match tokens to, following the rules outlined here with the addition of ^ at the start of a pattern to force start-of-line or -file and $ at the end of a pattern to force end-of-line/file. Note that the braces are escaped with backslash to force their literal value. If you included the "case insensitive" directive somewhere in the file (outside the blocks) then all of the rules will be case-insensitive (if you need only some of the rules to be case-sensitive, you'll need to devise regex rules that reflect this). After the pattern, you can specify an action for the lexer to take: you can store the token and an integer type for it, just store the type (useful for saving memory on things like operators where the token itself is always known), change mode, or execute arbitrary BB code within a braced block, again allowing a new line for the final brace (unlike the code in Code:{} blocks, this is placed into the scanner function scope). If you want to do more than one of these, at the moment the only way is to do so directly in BB code (will probably change this). You don't need to specify an action at all - in a C++ lexer you might have the rule //[^\n]*\n (double slash, then anything up until the next newline) with no action, to comment out the rest of a line. If more than one pattern could match a character string (eg. "End" and "End Function") the lexer will go with the longer match. If the matches are the same length then the first specified in the list will be chosen. The generated scanner builds a list of tokens (as bankstrings) and their integer types in a bank, which it returns, unless you use the BB code blocks to make it do something else, so unlike some tokenising functions that are called repeatedly to cough up the next token, BBLex_ScanFile() only needs to be called once and then the tokens can be obtained by navigating the resulting bank. Only three functions are actually created by this generator - the scanner itself (BBLex_ScanFile) and two other initialisation functions that it calls. The vast majority of the scanner actually consists of a slightly-modified version of my regular expressions library, and so is simply Included as BBLex_Functions.bb: Here's an example program to demonstrate the results:
Local i, tokenBank
tokenBank = BBLex_ScanFile("test.txt")
For i = 0 To BankSize(tokenBank) / 4 - 1
Local tok.BBLex_Token = BBLex_GetToken(tokenBank, i)
Print tok\tType + " : " + tok\val
Next
WaitKey
End
Include "testlex.bb"
...and a really simple test file to tokenise:
12 345 56.45 doggy {- This is a
comment and shouldn't be picked up -}
"String literal 1!"n
n"String literal 2!"
doggy
65 -12.34
boogledoggy 35.8 "doggy as a string literal!"
And finally, the generator itself: | |||||
Write "Generating... "
BBLex_Generate("Scythe lexer.txt","BBLex_Scythe.bb") ;Change these to the desired input and output files
Print "done!"
Print ""
Print "Press any key to exit..."
WaitKey
End
Const SIZEOF_CONST = 9
Function BBLex_Generate(defFile$,lexFile$) ;Generate a .bb lexer from the definitions given in defFile and output it as lexFile
Local dFile,dLine$,lFile,i,caseSen,userCodeOutput
Local ruleBank,constBank,modeBank
dFile=ReadFile(defFile)
lFile=WriteFile(lexFile)
constBank=CreateBank()
modeBank=CreateBank(5)
PokeInt modeBank,0,StrToBank("")
ruleBank=CreateBank()
WriteLine lFile,""
WriteLine lFile,";This file was automatically generated using BBLex: http://www.blitzbasic.com/codearcs/codearcs.php?code=2636"
WriteLine lFile,""
While Not Eof(dFile)
dLine=Replace(Replace(Lower(ReadLine(dFile)),Chr(9),"")," ","")
Select dLine
Case "caseinsensitive","case-insensitive"
caseSen=False
Case "casesensitive","case-sensitive"
caseSen=True
Case "constants:{"
LoadConstants(constBank,dFile)
Case "modes:{"
LoadModes(modeBank,dFile)
Case "rules:{"
LoadRules(ruleBank,dFile)
Case "code:{"
WriteLine lFile,""
dLine=ReadLine(dFile)
While Not Eof(dFile)
If Left(Trim(dLine),1)="}" Then Exit
If userCodeOutput=False
WriteLine lFile,""
WriteLine lFile,Chr(9)+";User code:"
WriteLine lFile,""
userCodeOutput=True
EndIf
WriteLine lFile,dLine
dLine=ReadLine(dFile)
Wend
WriteLine lFile,""
End Select
Wend
CloseFile dFile
ProcessRules(ruleBank,modeBank,constBank)
OutputLexer(constBank,ruleBank,lFile,caseSen,userCodeOutput)
CloseFile lFile
For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
FreeBank PeekInt(constBank,i)
FreeBank PeekInt(constBank,i+4)
Next
FreeBank constBank
For i=0 To BankSize(modeBank)-5 Step 5
FreeBank PeekInt(modeBank,i)
Next
FreeBank modeBank
FreeBank ruleBank
End Function
Function OutputLexer(constBank,ruleBank,lexFile,caseSen,userCodeOutput)
Local newLine$,i,j,action$
newLine=Chr(13)+Chr(10)
If userCodeOutput Then WriteLine lexFile,newLine+newLine+Chr(9)+";Generated code:"
WriteLine lexFile,newLine+"Include "+Chr(34)+"BBLex_Functions.bb"+Chr(34)+newLine
If BankSize(constBank)
For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
If PeekByte(constBank,i+8)=True
WriteLine lexFile,"Const "+BankToStr(PeekInt(constBank,i))+" = "+BankToStr(PeekInt(constBank,i+4))
EndIf
Next
WriteLine lexFile, ""
EndIf
WriteLine lexFile,"Function BBLex_ScanData(sBank)"
WriteLine lexFile,Chr(9)+"Local rBank, mBank, tBank, cPtr"+Chr(9)+newLine+Chr(9)+"Local token$, cMatch$, rID, i, cMode"+newLine
WriteLine lexFile,Chr(9)+"rBank = BBLex_InitRegexen()"+newLine+Chr(9)+"mBank = BBLex_InitModes()"
WriteLine lexFile,Chr(9)+"tBank = CreateBank()"+newLine
WriteLine lexFile,Chr(9)+"While cPtr < BankSize(sBank)"+newLine+Chr(9)+Chr(9)+"token = "+Chr(34)+Chr(34)+newLine
WriteLine lexFile,Chr(9)+Chr(9)+"For i = 0 to "+((BankSize(ruleBank)/12)-1)
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"If BBLex_ModeMatch(i, mBank, cMode)"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMatch = Regex_Match(Object.RegEx_Node(PeekInt(rBank, i * 4)), sBank, cPtr)"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"If Len(cMatch) > Len(token) Then token = cMatch : rID = i"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+Chr(9)+"Next"+newLine
WriteLine lexFile,Chr(9)+Chr(9)+"If token = "+Chr(34)+Chr(34)+newLine+Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + 1"
WriteLine lexFile,Chr(9)+Chr(9)+"Else"+newLine+Chr(9)+Chr(9)+Chr(9)+"Select rID"
For i=0 To ((BankSize(ruleBank)/12)-1)
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"Case "+i
action=BankToStr(PeekInt(ruleBank,i*12+8))
Select Lower(Left(action,1))
Case "s"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreToken tBank, "+Trim(Mid(action,6))+", token"
Case "t"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreType tBank, "+Trim(Mid(action,6))
Case "m"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMode = "+Trim(Mid(action,5))
Case "{"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+Mid(action,2)
End Select
Next
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"End Select"
WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + Len(token)"+newLine+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+"Wend"+newLine
WriteLine lexFile,Chr(9)+"BBLex_DeleteRegexen(rBank)"+newLine+Chr(9)+"BBLex_ClearModes(mBank)"
WriteLine lexFile,Chr(9)+"FreeBank sBank"+newLine
WriteLine lexFile,Chr(9)+"Return tBank"+newLine+"End Function"+newLine
WriteLine lexFile,"Function BBLex_InitRegexen()"
WriteLine lexFile,Chr(9)+"Local regexBank"+newLine
WriteLine lexFile,Chr(9)+"regexBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine
For i=0 To BankSize(ruleBank)/12-1
WriteLine lexFile,Chr(9)+"PokeInt regexBank, "+(i*4)+", Handle(Regex_Parse("+ExpandQuotes(BankToStr(PeekInt(ruleBank,i*12+4)))+", "+StrFromBool(caseSen)+"))"
Next
WriteLine lexFile,newLine+Chr(9)+"Return regexBank"+newLine+"End Function"+newLine
WriteLine lexFile,"Function BBLex_InitModes()"
WriteLine lexFile,Chr(9)+"Local modeBank"+newLine+Chr(9)+"modeBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine
For i = 0 To BankSize(ruleBank) / 12 - 1
If BankSize(PeekInt(ruleBank, i * 12))
WriteLine lexFile, Chr(9)+"PokeInt modeBank, " + i * 4 + ", CreateBank("+BankSize(PeekInt(ruleBank, i * 12)) + ")"
For j = 0 To BankSize(PeekInt(ruleBank, i * 12)) - 4 Step 4
WriteLine lexFile, Chr(9) + "PokeInt PeekInt(modeBank, "+i*4+"), "+j+", " + PeekInt(PeekInt(ruleBank, i * 12), j)
Next
Else
WriteLine lexFile, Chr(9) + "PokeInt modeBank, " + i * 4 + ", 0"
EndIf
Next
WriteLine lexFile,newLine+Chr(9)+"Return modeBank"+newLine+"End Function"+newLine
End Function
Function ExpandQuotes$(s$)
Local i
If s = Chr(34) Then Return "Chr(34)"
Local l$ = Left(s, 1), r$ = Right(s, 1), m$ = Mid(s, 2, Len(s) - 2)
If l = Chr(34) Then l = "Chr(34) + " + Chr(34) : Else l = Chr(34) + l
If r = Chr(34) Then r = Chr(34) + " + Chr(34)" : Else r = r + Chr(34)
m = Replace(m, Chr(34), Chr(34) + " + Chr(34) + " + Chr(34))
Return l + m + r
End Function
Function StrFromBool$(b)
If b Then Return "True" Else Return "False"
End Function
Function LoadConstants(constBank,dFile)
Local dLine$,cName$,cValue$,i, export
While Not Eof(dFile)
dLine=Trim(ReadLine(dFile))
If Left(dLine,1)="}" Then Exit
If dLine<>""
If Left(dLine,1)<>";"
For i=1 To Len(dLine)
If i>1
If Mid(dLine,i-1,1)<>"\"
If Asc(Mid(dLine,i,1))<=32 Then Exit
EndIf
EndIf
cName=cName+Mid(dLine,i,1)
Next
dLine=Trim(Mid(dLine,i+1))
For i=1 To Len(dLine)
If i>1
If Mid(dLine,i-1,1)<>"\"
If Asc(Mid(dLine,i,1))<=32 Then Exit
EndIf
EndIf
cValue=cValue+Mid(dLine,i,1)
Next
dLine=Trim(Mid(dLine,i))
ResizeBank constBank,BankSize(constBank)+SIZEOF_CONST
PokeInt constBank,BankSize(constBank)-SIZEOF_CONST,StrToBank(cName)
PokeInt constBank,BankSize(constBank)-(SIZEOF_CONST-4),StrToBank(cValue)
PokeByte constBank,BankSize(constBank)-(SIZEOF_CONST-8),(Lower(Left(dLine,6))="export")
cName=""
cValue=""
EndIf
EndIf
Wend
End Function
Function LoadModes(modeBank,dFile)
Local dLine$,mName$,i
While Not Eof(dFile)
dLine=Trim(ReadLine(dFile))
If Left(dLine,1)="}" Then Exit
If dLine<>""
If Left(dLine,1)<>";"
For i=1 To Len(dLine)
If i>1
If Mid(dLine,i-1,1)<>"\"
If Asc(Mid(dLine,i,1))<=32 Then Exit
EndIf
EndIf
mName=mName+Mid(dLine,i,1)
Next
dLine=Trim(Mid(dLine,i+1))
ResizeBank modeBank,BankSize(modeBank)+5
PokeInt modeBank,BankSize(modeBank)-5,StrToBank(mName)
If Lower(Left(dLine,2))="in" Then PokeByte modeBank,BankSize(modeBank)-1,1:Else PokeByte modeBank,BankSize(modeBank)-1,0
mName=""
EndIf
EndIf
Wend
End Function
Function LoadRules(ruleBank,dFile)
Local dLine$,cPtr,mode$,rule$,action$
While Not Eof(dFile)
dLine=Trim(ReadLine(dFile))
If Left(dLine,1)="}" Then Exit
If dLine<>""
If Left(dLine,1)<>";"
mode=""
If Left(dLine,1)="<"
cPtr=2
While Mid(dLine,cPtr,1)<>">"
If Asc(Mid(dLine,cPtr,1))>32 Then mode=mode+Mid(dLine,cPtr,1)
cPtr=cPtr+1
Wend
dLine=Trim(Mid(dLine,cPtr+1))
EndIf
rule=""
For cPtr=1 To Len(dLine)
If cPtr>2
If Mid(dLine,cPtr-1,1)<>"\"
If Asc(Mid(dLine,cPtr,1))<=32 Then Exit
ElseIf Mid(dLine,cPtr-2,2)="\\"
If Asc(Mid(dLine,cPtr,1))<=32 Then Exit ;If that backslash was part of the pattern
EndIf
ElseIf cPtr=2
If Left(dLine,1)<>"\"
If Asc(Mid(dLine,cPtr,1))<=32 Then Exit
EndIf
EndIf
rule=rule+Mid(dLine,cPtr,1)
Next
dLine=Trim(Mid(dLine,cPtr))
action=dLine
If Left(dLine,1)="{"
While Not Eof(dFile)
dLine=Trim(ReadLine(dFile))
If Left(dLine,1)="}" Then action=action+Chr(13)+Chr(10):Exit
action=action+Chr(13)+Chr(10)+dLine
Wend
EndIf
ResizeBank ruleBank,BankSize(ruleBank)+12
PokeInt ruleBank,BankSize(ruleBank)-12,StrToBank(mode)
PokeInt ruleBank,BankSize(ruleBank)-8,StrToBank(rule)
PokeInt ruleBank,BankSize(ruleBank)-4,StrToBank(action)
EndIf
EndIf
Wend
End Function
Function ProcessRules(ruleBank,modeBank,constBank)
Local r,c,m,mode$,rule$,action$
For r=0 To BankSize(ruleBank)-12 Step 12
mode=BankToStr(PeekInt(ruleBank,r))
FreeBank PeekInt(ruleBank,r)
PokeInt ruleBank,r,CreateBank()
While mode<>""
If Right(mode,1)=","
ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,0
mode=Trim(Left(mode,Len(mode)-1))
EndIf
If Instr(mode,",")>0
For m=0 To BankSize(modeBank)-5 Step 5
If Left(mode,Instr(mode,",")-1)=BankToStr(PeekInt(modeBank,m))
ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
If PeekByte(modeBank,m+4)=1
PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5)
Else
PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5
EndIf
EndIf
Next
mode=Mid(mode,Instr(mode,",")+1)
Else
For m=0 To BankSize(modeBank)-5 Step 5
If mode=BankToStr(PeekInt(modeBank,m))
ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
If PeekByte(modeBank,m+4)=1
PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5)
Else
PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5
EndIf
EndIf
Next
mode=""
EndIf
Wend
rule=BankToStr(PeekInt(ruleBank,r+4))
FreeBank PeekInt(ruleBank,r+4)
For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
rule=Replace(rule,"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
Next
PokeInt ruleBank,r+4,StrToBank(rule)
action=BankToStr(PeekInt(ruleBank,r+8))
If Left(action,1)<>"{"
FreeBank PeekInt(ruleBank,r+8)
If Lower(Left(action,5))="store"
For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
If PeekByte(constBank, c + 8) = False
action="store "+Replace(Mid(action,6),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
EndIf
Next
ElseIf Lower(Left(action,4))="type"
For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
If PeekByte(constBank, c + 8) = False
action="type "+Replace(Mid(action,5),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
EndIf
Next
ElseIf Lower(Left(action,4))="mode"
For m=0 To BankSize(modeBank)-5 Step 5
If PeekByte(modeBank,m+4)=1
action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",-(m/5))
Else
action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",m/5)
EndIf
Next
EndIf
PokeInt ruleBank,r+8,StrToBank(action)
EndIf
Next
End Function
Function StrToBank(s$) ;Return a bank containing the binary value of the given string
Local i,bank
bank=CreateBank(Len(s))
For i=0 To Len(s)-1
PokeByte bank,i,Asc(Mid(s,i+1,1))
Next
Return bank
End Function
Function BankToStr$(bank) ;Return a string containing the ASCII value of the given bank
Local i,s$
For i=0 To BankSize(bank)-1
s=s+Chr(PeekByte(bank,i))
Next
Return s
End Function
;~IDEal Editor Parameters:
;~F#E#4F#9D#AB#AF#D8#F6#12E#17B#184
;~C#Blitz3D |
Comments
| ||
| Here are a couple of more useful examples. This generates a small lexer for a simple QuakeC-like language: This generates a complete lexer for the C programming language, as described in the reference grammar at the back of K&R (ANSI C89, not including preprocessor): |
| ||
| Nice work! :) Dabz |
Code Archives Forum