Code archives/Algorithms/Simple regular expressions
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
| Updated 21/07/2011: Completely rewritten due to old version being unsalvageable. This is my last attempt: if this turns out to be broken too, I'll abandon it and use PCRE. Regular expressions are an extremely powerful tool, that can be used in all manner of ways but mainly for matching strings against flexible keys. Useful for searching, parsing input, syntax highlighting, making a lexical-scanner-generator (yay!), etc. I have no prior experience with them so I just copied the example syntax and character classes on Wikipedia: you can use *, ?, +, ^ as negation (ie. within character classes only), and . for any character. \ escapes metacharacters where applicable (including itself), as well as n (Chr(13)), r (Chr(10)) and t (Chr(9)). Start- and end-of-line are also supported. You'll need to use some imagination to make this implementation even remotely efficient, or useful with the more fun tasks like syntax highlighting, but I think it works, more or less, even though it's an absolute mess. As given, the main function (RegEx_Process$() ) accepts two strings: the first is the expression, and the second is the string to try to match. It returns the longest matching substring (so a failed attempt returns nothing; if the string to match goes on beyond the end of the expression, only the matching part will be returned. eg. RegEx_Process("H(ä|ae?)ndel","HandelASDFGHB!") would return "Handel"). There is no built-in error checking, so incorrectly formatted expressions will probably crash. My assumption is that this will be used mostly in a context where you check the regexen thoroughly and keep them constant. Expressions that would involve backtracking (e.g. "a*a", "a?+") will always fail, as backtracking simply isn't implemented. In the vast majority of cases this isn't a problem if the assumption above (that you won't be letting users input arbitrary unchecked expressions) holds; at any rate, I don't need backtracking for any use I can think of for this, and nor should you </joke>. | |||||
;===============================================================================
;Regular expressions
;===============================================================================
Include "LList.bb" ;Get this Include from: http://www.blitzbasic.com/codearcs/codearcs.php?code=2873
Type RegEx_Node
Field nType, state
Field value$, valBank
Field optL.RegEx_Node, optR.RegEx_Node
Field branch.RegEx_Node
Field conc.RegEx_Node, root.RegEx_Node
End Type
Const REGEX_NTYPE_SIMPLE = -1, REGEX_NTYPE_ROOT = 1, REGEX_NTYPE_OR = 3, REGEX_NTYPE_STAR = 4
Const REGEX_NTYPE_PLUS = 5, REGEX_NTYPE_OPT = 6, REGEX_NTYPE_VALUE = 10, REGEX_NTYPE_FINAL = 12
Const REGEX_STATE_START = 1, REGEX_STATE_END = 2, REGEX_STATE_CIS = 4
Const REGEX_STATE_READY = 0, REGEX_STATE_OPT2 = 1, REGEX_STATE_DONE = 2;, REGEX_STATE_FAIL = -1
;Try to match a pattern against a string
Function RegEx_Process$(regIn$, inputString$)
Local oNode.RegEx_Node = RegEx_Parse(regIn), bk = RegEx_StrToBank(inputString)
Local char$ = RegEx_Match(oNode, bk, 0)
RegEx_Delete(oNode) : FreeBank bk
Return char
End Function
;Match an NFA against a character bank from the given offset
Function RegEx_Match$(root.RegEx_Node, inputStringBank, startPtr)
If root\nType = REGEX_NTYPE_SIMPLE Then Return RegEx_MatchSimple(root, inputStringBank, startPtr)
If (root\state And REGEX_STATE_START) ;If it's a line start pattern only...
If startPtr > 0
If PeekByte(inputStringBank, startPtr - 1) <> $0A Then Return ""
EndIf
EndIf
Local caseInsensitive = (root\state And REGEX_STATE_CIS) <> 0
Local cPtr = startPtr, endPtr = BankSize(inputStringBank)
Local nStack.LList = CreateList(), pStack.LList = CreateList()
Local node.RegEx_Node = root\branch
RegEx_SetState node, REGEX_STATE_READY
ListAddLast nStack, Handle(root) ;Push the current operator and pointer to their respective stacks
ListAddLast pStack, cPtr
Repeat
If node\nType = REGEX_NTYPE_FINAL Or node\nType = REGEX_NTYPE_ROOT ;We're done here
Exit
ElseIf node\nType = REGEX_NTYPE_VALUE ;Try to match a character
Local char$
If cPtr < endPtr
Local ch = PeekByte(inputStringBank, cPtr)
char = Chr(ch + (32 * caseInsensitive * (ch >= 65 And ch <= 90)))
EndIf
If (Instr(node\value, char) > 0) And (cPtr < endPtr) ;node\value is already adjusted for case sensitivity
cPtr = cPtr + 1
node = node\conc
Else
Repeat
node = Object.RegEx_Node(ListRemoveLast(nStack))
cPtr = ListRemoveLast(pStack)
If node\nType = REGEX_NTYPE_ROOT ;Total failure
Exit
ElseIf node\nType = REGEX_NTYPE_STAR Or node\nType = REGEX_NTYPE_OPT ;Allowed to fail
node = node\conc
Exit
ElseIf node\nType = REGEX_NTYPE_PLUS And node\state = REGEX_STATE_DONE ;Can't fail first time
node = node\conc
Exit
ElseIf node\nType = REGEX_NTYPE_OR And node\state = REGEX_STATE_READY
ListAddLast nStack, Handle(node) ;Stack it up and try again
ListAddLast pStack, cPtr
node\state = REGEX_STATE_OPT2
node = node\optR
Exit
EndIf
Forever
EndIf
Else ;It's an operator: zero state if complicated, then move to next position
ListAddLast nStack, Handle(node)
ListAddLast pStack, cPtr
If node\nType = REGEX_NTYPE_OR Or node\nType = REGEX_NTYPE_PLUS Then node\state = REGEX_STATE_READY
If node\nType = REGEX_NTYPE_OR Then node = node\optL : Else node = node\branch
EndIf
;If this happens, it means we're in a success state but the branch has run out
While node = Null
node = Object.RegEx_Node(ListRemoveLast(nStack)) ;Should always be non-null
ListRemoveLast pStack
If node\nType = REGEX_NTYPE_PLUS Or node\nType = REGEX_NTYPE_STAR
ListAddLast nStack, Handle(node) ;Restack it
ListAddLast pStack, cPtr
node\state = REGEX_STATE_DONE
node = node\branch
Else
node = node\conc
If node <> Null
ListRemoveLast pStack
ListAddLast pStack, cPtr ;Update previous operator with moved cursor
EndIf
EndIf
Wend
Forever
If node\nType <> REGEX_NTYPE_FINAL Then cPtr = startPtr ;Fail if it was too short to reach a mismatch
FreeList nStack ;Doesn't hold anything that needs to be freed here
FreeList pStack
Local c, match$ = ""
For c = startPtr To (cPtr - 1)
match = match + Chr(PeekByte(inputStringBank, c))
Next
If (root\state And REGEX_STATE_END) ;If it's a line end pattern only
If startPtr + Len(match) < BankSize(inputStringBank)
If PeekByte(inputStringBank, startPtr + Len(match)) <> $0A Then match = ""
EndIf
EndIf
Return match
End Function
;Quicker test for simple patterns
Function RegEx_MatchSimple$(node.RegEx_Node, inputStringBank, startPtr)
Local caseInsensitive, i, c, inputString$
;If the remaining string is shorter than the pattern, it can't match
If BankSize(inputStringBank) < startPtr + BankSize(node\valBank) Then Return ""
If (node\state And REGEX_STATE_START) ;If it's a line start pattern only...
If startPtr > 1
If PeekShort(inputStringBank, startPtr - 2) <> $A0D Then Return ""
Else
If startPtr = 1 Then Return ""
EndIf
EndIf
caseInsensitive = node\state And REGEX_STATE_CIS
Local pLimit = BankSize(node\valBank) - 1
For i = 0 To pLimit
c = PeekByte(inputStringBank, startPtr + i)
If caseInsensitive
If (c >= 65 And c <= 90) Then c = c + 32 ;Force lower for case-insensitive comparison
EndIf
If c <> PeekByte(node\valBank, i) Then Return "" : Else inputString = inputString + Chr(c)
Next
If (node\state And REGEX_STATE_END) ;If it's a line end pattern only
i = BankSize(node\valBank)
If startPtr + i < BankSize(inputStringBank)
If startPtr + i = BankSize(inputStringBank) - 1
Return ""
Else
If PeekShort(inputStringBank, startPtr + i) <> $A0D Then Return ""
EndIf
EndIf
EndIf
Return inputString
End Function
;Parse a regular expression into a tree of nodes
Function RegEx_Parse.RegEx_Node(regIn$, caseSensitive = True)
Local lineStart, lineEnd
If Left(regIn, 1) = "^" ;This pattern may only match the start of a line
regIn = Mid(regIn, 2)
lineStart = True
EndIf
If Right(regIn, 1) = "$" ;This pattern may only match the end of a line
If Right(regIn,2) <> "\$"
regIn = Left(regIn, Len(regIn) - 1)
lineEnd = True
EndIf
EndIf
;If the Regex doesn't use any operators or groups, return a "simple Regex" instead for fast matching
If RegEx_IsSimple(regIn) Then Return RegEx_CreateSimple(regIn, lineStart, lineEnd, caseSensitive)
Local n.RegEx_Node = New RegEx_Node
n\nType = REGEX_NTYPE_ROOT
n\value = regIn ;May as well store it in case required for reference
If lineStart Then n\state = n\state Or REGEX_STATE_START
If lineEnd Then n\state = n\state Or REGEX_STATE_END
If (Not caseSensitive) Then n\state = n\state Or REGEX_STATE_CIS
n\branch = RegEx_ParseBranch(regIn, n, caseSensitive)
n\conc = New RegEx_Node
n\conc\nType = REGEX_NTYPE_FINAL
Return n
End Function
;Check if a Regex contains any operators - if not, it's simple
Function RegEx_IsSimple(regIn$)
Local i, char
For i = 1 To Len(regIn)
char = Asc(Mid(regIn, i, 1))
Select char
Case 40, 42, 43, 46, 63, 91, 92, 124 ;(*+.?[\|
Return False
End Select
Next
Return True
End Function
;Create a simple regex object for faster comparison of plain string keys
Function RegEx_CreateSimple.RegEx_Node(regIn$, lineStart, lineEnd, caseSensitive)
Local n.RegEx_Node = New RegEx_Node
n\nType = REGEX_NTYPE_SIMPLE
If lineStart Then n\state = REGEX_STATE_START
If lineEnd Then n\state = n\state Or REGEX_STATE_END
If caseSensitive = False
n\state = n\state Or REGEX_STATE_CIS
regIn = Lower(regIn)
EndIf
n\valBank = RegEx_StrToBank(regIn)
n\value = regIn
Return n
End Function
;Split am input Regex into the parts to the left and to the right of the first Or operator
Function RegEx_Split(regIn$, out$[1])
Local cPtr, char$, esc, inClass, pDepth
While cPtr < Len(regIn)
cPtr = cPtr + 1
char = Mid(regIn, cPtr, 1)
If esc ;Current character is escaped - doesn't affect anything for our purposes
esc = False
Else
If char = "\"
esc = True
ElseIf char = "["
inClass = inClass + 1
ElseIf char = "]" And inClass > 0
inClass = inClass - 1
ElseIf inClass = 0
If char = "("
pDepth = pDepth + 1
ElseIf char = ")" And pDepth > 0
pDepth = pDepth - 1
ElseIf char = "|" And pDepth = 0
Exit
EndIf
EndIf
EndIf
Wend
If char = "|"
out[0] = Left(regIn, cPtr - 1)
out[1] = Mid(regIn, cPtr + 1)
Return True
Else
out[0] = regIn
Return False
EndIf
End Function
;Parse a Regex with choice operators
Function RegEx_ParseBranch.RegEx_Node(regIn$, root.RegEx_Node, caseSensitive)
Local rBranch$[1], node.RegEx_Node
If RegEx_Split(regIn, rBranch)
node = New RegEx_Node
node\nType = REGEX_NTYPE_OR
node\optL = RegEx_ParseLinear(rBranch[0], node, caseSensitive)
node\optR = RegEx_ParseBranch(rBranch[1], node, caseSensitive) ;This recursion may cause problems with long regexen
node\root = root
node\conc = Null ;May have been set by first value
Return node
Else
Return RegEx_ParseLinear(rBranch[0], root, caseSensitive)
EndIf
End Function
;Parse a Regex consiting only of postfix operators and characters/classes
Function RegEx_ParseLinear.RegEx_Node(regIn$, root.RegEx_Node, caseSensitive)
Local cPtr, char$, esc, head.RegEx_Node, tail.RegEx_Node = root
While cPtr < Len(regIn)
cPtr = cPtr + 1
char = Mid(regIn, cPtr, 1)
If esc ;Create a simple value node for escaped characters
Select char
Case "n" : char = Chr(10)
Case "r" : char = Chr(13)
Case "t" : char = Chr(9)
End Select
esc = False
tail = RegEx_ConcatValue(tail, char, caseSensitive) ;Add char as a single character concatenation
Else
Select char ;Note that char should never be "|" here
Case "\"
esc = True
Case "("
Local subExpr$ = RegEx_ScanSubExpression(Mid(regIn, cPtr))
tail\conc = RegEx_ParseBranch(subExpr, tail, caseSensitive) ;Note mutual recusrion - avoid extreme depth
tail = tail\conc
cPtr = cPtr + Len(subExpr) + 1 ;subExpr doesn't include the enclosing parens
Case "["
Local class$ = RegEx_ScanCharacterClass(Mid(regIn, cPtr))
tail = RegEx_ConcatValue(tail, RegEx_GetCharacterClass(class), caseSensitive)
cPtr = cPtr + Len(class) + 1 ;As above
Case "*"
tail = RegEx_AppendPostfix(tail, REGEX_NTYPE_STAR)
Case "+"
tail = RegEx_AppendPostfix(tail, REGEX_NTYPE_PLUS)
Case "?"
tail = RegEx_AppendPostfix(tail, REGEX_NTYPE_OPT)
Case "."
tail = RegEx_ConcatValue(tail, RegEx_GetCharacterClass("..."), caseSensitive)
Default
tail = RegEx_ConcatValue(tail, char, caseSensitive) ;Add char as a single character concatenation
End Select
EndIf
If head = Null
If tail <> root Then head = tail
ElseIf head = tail\branch
head = tail
EndIf
Wend
Return head
End Function
;Attach a value node to another node as following it
Function RegEx_ConcatValue.RegEx_Node(root.RegEx_Node, val$, caseSensitive)
Local v.RegEx_Node = New RegEx_Node
v\nType = REGEX_NTYPE_VALUE
If (Not caseSensitive) Then val = Lower(val)
v\value = val
v\root = root
If root <> Null Then root\conc = v
Return v
End Function
;Append a postfix operator of the given type to a linear node chain
Function RegEx_AppendPostfix.RegEx_Node(tail.RegEx_Node, nType)
Local n.RegEx_Node = New RegEx_Node
n\nType = nType
If tail\root <> Null Then tail\root\conc = n ;Replace the tail node on the line
tail\root = n
n\branch = tail ;And attach it to this node as a branch
Return n
End Function
;Read a subexpression from within parentheses (always starts on a "(")
Function RegEx_ScanSubExpression$(reg$)
Local cPtr, cDepth, pDepth, char$, esc
While cPtr < Len(reg)
cPtr = cPtr + 1
char = Mid(reg, cPtr, 1)
If esc
esc = False
Else
Select char
Case "\"
esc = True
Case "["
cDepth = cDepth + 1
Case "]"
cDepth = cDepth - 1
Case "("
If cDepth = 0 Then pDepth = pDepth + 1
Case ")"
If cDepth = 0 Then pDepth = pDepth - 1
End Select
EndIf
If pDepth = 0 Then Return Mid(reg, 2, cPtr - 2)
Wend
End Function
;Read a character class from a string (always starts on a "[")
Function RegEx_ScanCharacterClass$(reg$)
Local cPtr, cDepth, char$, esc
While cPtr < Len(reg)
cPtr = cPtr + 1
char = Mid(reg, cPtr, 1)
If esc
esc = False
Else
Select char
Case "\"
esc = True
Case "["
cDepth = cDepth + 1
Case "]"
cDepth = cDepth - 1
End Select
EndIf
If cDepth = 0 Then Return Mid(reg, 2, cPtr - 2)
Wend
End Function
;Returns a full character list from a character class
Function RegEx_GetCharacterClass$(cClass$)
Local invert, cList$, char$, cPtr, i
If cClass = "..." ;Not an expected input class - this is shorthand for "any character"
For i = 0 To 127
cList = cList + Chr(i)
Next
Return cList
EndIf
If Left(cClass, 1) = ":" And Right(cClass, 1) = ":" ;A POSIX-style named class
cClass = Lower(Mid(cClass, 2, Len(cClass) - 2))
If Left(cClass, 1) = "^"
invert = True
cClass = Mid(cClass, 2)
EndIf
Select cClass
Case "alnum" : cClass = "A-Za-z0-9"
Case "word" : cClass = "A-Za-z0-9_"
Case "alpha" : cClass = "A-Za-z"
Case "blank" : cClass = " " + Chr(9)
Case "cntrl" : cClass = Chr(0) + "-" + Chr($1F) + Chr($7F)
Case "digit" : cClass = "0-9"
Case "graph" : cClass = Chr($21) + "-" + Chr($7E)
Case "lower" : cClass = "a-z"
Case "print" : cClass = Chr($20) + "-" + Chr($7E)
Case "punct" : cClass = "-!" + Chr(34) + "#$%&'()*+,./:;<=>?@\[\\\]_`{|}~"
Case "space" : cClass = " " + Chr(13) + Chr(12) + Chr(11) + Chr(10) + Chr(9)
Case "upper" : cClass = "A-Z"
Case "xdigit": cClass = "A-Fa-f0-9"
Default : cClass=""
End Select
If invert Then cClass="^"+cClass
Else ;A negative class
If Left(cClass, 1) = "^"
invert = True
cClass = Mid(cClass, 2)
EndIf
EndIf
For cPtr = 1 To Len(cClass)
char = Mid(cClass, cPtr, 1)
If char = "\" ;Add escaped characters to the class directly
cPtr = cPtr + 1
char = Mid(cClass, cPtr, 1)
Select char
Case "n" : char = Chr(10)
Case "r" : char = Chr(13)
Case "t" : char = Chr(9)
End Select
cList = cList + char
ElseIf Mid(cClass, cPtr + 1, 1) = "-" And cPtr < (Len(cClass) - 1) ;Expand any character ranges
For i = Asc(char) To Asc(Mid(cClass, cPtr + 2, 1))
cList = cList + Chr(i)
Next
cPtr = cPtr + 2
Else
If char = "[" ;Expand nested classes
Local subClass$ = "", cDepth = 1
Repeat
cPtr = cPtr + 1
char = Mid(cClass, cPtr, 1)
If char = "\"
cPtr = cPtr + 1
char = Mid(cClass, cPtr, 1)
subClass = subClass + "\"
ElseIf char = "["
cDepth = cDepth + 1
ElseIf char = "]"
cDepth = cDepth - 1
If cDepth = 0 Then Exit
EndIf
subClass = subClass + char
Forever
char = RegEx_GetCharacterClass(subClass)
EndIf
cList = cList + char
EndIf
Next
If invert
cClass = cList
cList = ""
For i = 0 To 127
If Not Instr(cClass, Chr(i)) Then cList = cList + Chr(i)
Next
EndIf
Return cList
End Function
;Set the state of a node and all child nodes to the given value
Function RegEx_SetState(node.RegEx_Node, state)
If node\branch <> Null Then RegEx_SetState node\branch, state
If node\optL <> Null Then RegEx_SetState node\optL, state
If node\optR <> Null Then RegEx_SetState node\optR, state
If node\conc <> Null Then RegEx_SetState node\conc, state
node\state = state
End Function
;Delete a node and all branch/following nodes
Function RegEx_Delete(node.RegEx_Node)
If node\branch <> Null Then RegEx_Delete node\branch
If node\optL <> Null Then RegEx_Delete node\optL
If node\optR <> Null Then RegEx_Delete node\optR
If node\conc <> Null Then RegEx_Delete node\conc
If node\nType = REGEX_NTYPE_SIMPLE Then FreeBank node\valBank
Delete node
End Function
;Utility - return a bank containing the binary value of the given string
Function RegEx_StrToBank(s$)
Local i, bank = CreateBank(Len(s))
For i = 0 To Len(s) - 1
PokeByte bank, i, Asc(Mid(s, i + 1, 1))
Next
Return bank
End Function
;Utility - return a string containing the ASCII value of the given bank
Function RegEx_BankToStr$(bank)
Local i, s$
For i = 0 To BankSize(bank) - 1
s = s + Chr(PeekByte(bank, i))
Next
Return s
End Function
;===============================================================================
;~IDEal Editor Parameters:
;~F#9#19#21#93#BA#DD#EA#F9#126#139#177#182#18C#1A9#1C2#22B#234#23E#247
;~C#Blitz3D |
Comments
None.
Code Archives Forum