Remove all -es/-s/-e/-x suffixes that follow 4 or more characters


Remove all -es/-s/-e/-x suffixes that follow 4 or more characters



I am trying to delete all word suffixes -es, -s, -e or -x of all words that have at least 4 characters after removing the suffix, using regex in Python.


-es


-s


-e


-x



There are some examples of desired output (in French):



I tried to implement as shown below, but I do not find it very efficient.


def _stem_reg(word):
pattern = "(w{4,})(es$)|(w{4,})(s$)|(w{4,})(e$)|(w{4,})(x$)"
found = re.match(pattern, word)

if found is not None:
return next(group for group in found.groups() if group is not None)
else:
return word





What exactly do you mean by "not very efficient"?
– Thierry Lathuille
Jun 27 at 16:48





Try re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)
– Wiktor Stribiżew
Jun 27 at 16:59


re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)





What about the accent sign in sièges?
– Dominique
Jun 29 at 9:09


sièges





@Dominique - the Unicode pattern [^Wd_] matches also accent sign characters such as è.
– Ωmega
Jun 29 at 11:48


[^Wd_]


è




2 Answers
2



Assuming


txt = "your input string"



You can use:


re.sub(r"b([^Wd_]{4,})(?:(?<=...[^e])s|(?<=^...e)s|es|e|x)b", r'1', txt, flags = re.U)



Test this regex pattern here.



Try this: ^(w{4,}?)(?:es|s|e|x)$


^(w{4,}?)(?:es|s|e|x)$


word = "feuilletées"
output = re.sub(r"^(w{4,}?)(?:es|s|e|x)$", r'1', word)


(w{4,}?)


(?:es|s|e|x)





Pattern (w{4,}?) would match not just letters, but also numbers and underscore, so for example it will match a string X7_q
– Ωmega
Jun 29 at 11:51


(w{4,}?)


X7_q






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift