Regex In Python

2013-07-17 13:14:00 +0000

Regex is a important tool when dealing text.
In python we have a library named re.
Most of the regular expression operations are available as module-level function and RegexObject methods.

import the lib

import re

re.compile(pattern, flag=0) >Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

import re
text = 'I love China'
regexs = [re.compile(p)
		for p in ['love', 'll']
	]
for regex in regexs:
	print regex.pattern
	if regex.search(text):
		print "Match"
	else:
		print "Not Match"

re.search(pattern, string) >Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

#!/usr/bin/python
#encoding:utf-8
import re
text = 'I love China'
regexs = [re.compile(p)
		for p in ['love', 'll']
	 ]
for regex in regexs:
	print regex.pattern
	match = re.search(regex, text):
	if match:
		print "Match"
	else:
		print "Not Match"

RegexObject.search(string)

#!/usr/bin/python
#encoding:utf-8
import re
text = 'I love China'
pattern = 'love'
match = re.search(pattern, text)
s = match.start()
e = match.end()
print 'Found "%s"\nin "%s"\nfrom %d to %d ("%s")' % \
      (match.re.pattern, match.string, s, e, text[s:e])

re.match(pattern, srting) >If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

#!/usr/bin/python
#encoding:utf-8
import re
text = 'His phone 12345 number is 67890'
regex = re.compile(r'.*?(\d+).*?(\d+)')
match = re.match(regex, text)
if match: 
	print match.group(1),match.group(2),

RegexObject.match(string)

#!/usr/bin/python
#encoding:utf-8
import re
text = 'His phone 12345 number is 67890'
regex = re.compile(r'.*?(\d+).*?(\d+)')
match = regex.match(text)
if match: 
	print match.group(1), match.group(2),

Match object has many attributes:

Match more than one

re.findall(pattern, string)

#!/usr/bin/python
#encoding:utf-8
import re
text = 'this is a text'
pattern = 'is'
matchs = re.findall(pattern, text)
for match in matchs:
	print match
lo@ubuntu:~/try/regex$ python searchall.py 
is
is

re.finditer(pattern, string)

#!/usr/bin/python
#encoding:utf-8
import re
text = 'this is a text'
pattern = 'is'
matchs = re.finditer(pattern, text)
for match in matchs:
	s = match.start()
	e = match.end()
	print "Found ", text[s:e], "at: ", s, e
lo@ubuntu:~/try/regex$ python searchall.py 
Found  is at:  2 4
Found  is at:  5 7

Pattern Syntax Repetition {a, b}

   
* equivalent to {0,}
+ equivalent to {1,}
? equivalent to {0,1}
#!/usr/bin/python
#encoding: utf-8
import re
text = '101000111'
patterns = [
		'10',
		'10?',
		'10*',
		'10+',
		'10{3}',
		'10{1,3}'
		]

print 'orginal string: ', text

for pattern in patterns:
	matchs = re.finditer(pattern, text)
	for match in matchs:
		s = match.start()
		e = match.end()
		substr = text[s:e]
		print 'pattern: ', pattern,' Found: ', substr, 'at', s, e
@ubuntu:~/try/regex$ python repetition.py 
orginal string:  101000111
pattern:  10  Found:  10 at 0 2
pattern:  10  Found:  10 at 2 4
pattern:  10?  Found:  10 at 0 2
pattern:  10?  Found:  10 at 2 4
pattern:  10?  Found:  1 at 6 7
pattern:  10?  Found:  1 at 7 8
pattern:  10?  Found:  1 at 8 9
pattern:  10*  Found:  10 at 0 2
pattern:  10*  Found:  1000 at 2 6
pattern:  10*  Found:  1 at 6 7
pattern:  10*  Found:  1 at 7 8
pattern:  10*  Found:  1 at 8 9
pattern:  10+  Found:  10 at 0 2
pattern:  10+  Found:  1000 at 2 6
pattern:  10{3}  Found:  1000 at 2 6
pattern:  10{1,3}  Found:  10 at 0 2
pattern:  10{1,3}  Found:  1000 at 2 6

Character set

[a|b] [a-z] [0-9] [a-zA-Z]

#!/usr/bin/python
#encoding utf-8
import re
text = 'string is not 12324, 234 IS NOT STRING'
patterns =[
		'[a-z]+',
		'[A-Z]+',
		'[0-9]+',
		'[a-zA-Z]+',
		'[a-zA-Z0-9]+'
	]
print "orginal string: ", text

for pattern in patterns:
	print "pattern is: ", pattern
	matchs = re.findall(pattern, text)
	for match in matchs:
		print match
lo@ubuntu:~/try/regex$ python searchset.py 
orginal string:  string is not 12324, 234 IS NOT STRING
pattern is:  [a-z]+
string
is
not
pattern is:  [A-Z]+
IS
NOT
STRING
pattern is:  [0-9]+
12324
234
pattern is:  [a-zA-Z]+
string
is
not
IS
NOT
STRING
pattern is:  [a-zA-Z0-9]+
string
is
not
12324
234
IS
NOT
STRING

Greedy Or Non-greedy(minimal fashion)

Escape sequences #TODO | | | |–|—| |\d|any decimal digit| |\D|any character that is not a decimal digit| |\w|any ‘word’ character| |\W|any ‘non-word’ character| |\s|any whitespace character| |\S|any character that is not a whitespace character| Anchors | | | |–|—| |^ |the current match point is at the start of the subject string| |$ |the current match point is at the end of the subject string| |\b|word boundary| |\B|not a word boundary| |\A|start of subject (independent of multiline mode)| |\Z|end of subject or newline at end (independent of multiline mode)|

Reference:

python.org
regex php