Remove all -es/-s/-e/-x suffixes that follow 4 or more characters

I am trying to delete all word suffixes -es, -s, -e or -x of all words that have at least 4 characters after removing the suffix, using regex in Python.

-es

-s

-e

-x

There are some examples of desired output (in French):

I tried to implement as shown below, but I do not find it very efficient.

def _stem_reg(word): pattern = "(w{4,})(es$)|(w{4,})(s$)|(w{4,})(e$)|(w{4,})(x$)" found = re.match(pattern, word) if found is not None: return next(group for group in found.groups() if group is not None) else: return word

What exactly do you mean by "not very efficient"?
– Thierry Lathuille
Jun 27 at 16:48

Try re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)
– Wiktor Stribiżew
Jun 27 at 16:59

re.sub(r'b(w{4,})(?:e?s|[ex])b', r'1', s)

What about the accent sign in sièges?
– Dominique
Jun 29 at 9:09

sièges

@Dominique - the Unicode pattern [^Wd_] matches also accent sign characters such as è.
– Ωmega
Jun 29 at 11:48

[^Wd_]

è

2 Answers
2

Assuming

txt = "your input string"

You can use:

re.sub(r"b([^Wd_]{4,})(?:(?<=...[^e])s|(?<=^...e)s|es|e|x)b", r'1', txt, flags = re.U)

Test this regex pattern here.

Try this: ^(w{4,}?)(?:es|s|e|x)$

^(w{4,}?)(?:es|s|e|x)$

word = "feuilletées" output = re.sub(r"^(w{4,}?)(?:es|s|e|x)$", r'1', word)

(w{4,}?)

(?:es|s|e|x)

Pattern (w{4,}?) would match not just letters, but also numbers and underscore, so for example it will match a string X7_q
– Ωmega
Jun 29 at 11:51

(w{4,}?)

X7_q

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk