Remove all -es/-s/-e/-x suffixes that follow 4 or more characters
Remove all -es/-s/-e/-x suffixes that follow 4 or more characters
I am trying to delete all word suffixes -es
, -s
, -e
or -x
of all words that have at least 4 characters after removing the suffix, using regex in Python.
-es
-s
-e
-x
There are some examples of desired output (in French):
I tried to implement as shown below, but I do not find it very efficient.
def _stem_reg(word):
pattern = "(w{4,})(es$)|(w{4,})(s$)|(w{4,})(e$)|(w{4,})(x$)"
found = re.match(pattern, word)
if found is not None:
return next(group for group in found.groups() if group is not None)
else:
return word
Try
re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)
– Wiktor Stribiżew
Jun 27 at 16:59
re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)
What about the accent sign in
sièges
?– Dominique
Jun 29 at 9:09
sièges
@Dominique - the Unicode pattern
[^Wd_]
matches also accent sign characters such as è
.– Ωmega
Jun 29 at 11:48
[^Wd_]
è
2 Answers
2
Assuming
txt = "your input string"
You can use:
re.sub(r"b([^Wd_]{4,})(?:(?<=...[^e])s|(?<=^...e)s|es|e|x)b", r'1', txt, flags = re.U)
Test this regex pattern here.
Try this: ^(w{4,}?)(?:es|s|e|x)$
^(w{4,}?)(?:es|s|e|x)$
word = "feuilletées"
output = re.sub(r"^(w{4,}?)(?:es|s|e|x)$", r'1', word)
(w{4,}?)
(?:es|s|e|x)
Pattern
(w{4,}?)
would match not just letters, but also numbers and underscore, so for example it will match a string X7_q
– Ωmega
Jun 29 at 11:51
(w{4,}?)
X7_q
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What exactly do you mean by "not very efficient"?
– Thierry Lathuille
Jun 27 at 16:48