counting number of each substring in array python

I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.

string

[a_text, b_text, ab_text, a_text]

['a_', 'b_', 'ab_']

'a_'

so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.

num_a = len(filter(lambda x: x.startswith('a_'), array))

filter()

Also perhaps instead of the filter I could use list comprehension to make it faster?

filter

wait so your array only contains strings with 'a_' and 'b_' prefixes??
– Michael Ilie
2 days ago

It can contain any prefixes but let's say I only need to count the number of a_ and b_ and ab_
– mysticalstick
2 days ago

a_

b_

ab_

to answer your original question as to which is faster, the Unix time command in terminal reveals that as your data gets bigger, the filter() method gets slower, however the delay is not that big (second or two) especially considering that i ran tests with millions of strings.
– Michael Ilie
2 days ago

time

filter()

4 Answers
4

You can use collections.Counter with a regular expression (if all of your strings have prefixes):

collections.Counter

from collections import Counter arr = ['a_text', 'b_text', 'ab_text', 'a_text'] Counter([re.match(r'^.*?_', i).group() for i in arr])

Output:

Counter({'a_': 2, 'b_': 1, 'ab_': 1})

If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:

re.match

arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test'] matches = [re.match(r'^.*?_', i) for i in arr] Counter([i.group() for i in matches if i])

Output:

Counter({'a_': 2, 'b_': 1, 'ab_': 1})

what if they don't all have prefixes? would using this method of regex still work
– mysticalstick
2 days ago

I just updated for that possibility
– user3483203
2 days ago

I see! That is a very cool solution. I'm sorry I was probably not very clear, I have a set list of prefixes I am looking for. So in this example, I'm looking for some prefix = ['a_', 'b_']. So not every prefix that occurs. Is this still something I can regex?
– mysticalstick
2 days ago

prefix = ['a_', 'b_']

I would still recommend using this method, because the Counter object is a dictionary with O(1) lookup, so getting your desired values is trivial
– user3483203
2 days ago

Counter

O(1)

I see, thank you!
– mysticalstick
2 days ago

Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').

defaultdict()

'a'

'a_a'

from collections import defaultdict array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000 def count_prefixes(arr): counts = defaultdict(int) for item in arr: if '_' in item: counts[item.split('_')[0] + '_'] += 1 return counts

The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.

from timeit import timeit setup = """ from collections import Counter, defaultdict import re array = ['a_text', 'b_text', 'ab_text', 'a_text'] def with_defaultdict(arr): counts = defaultdict(int) for item in arr: if '_' in item: counts[item.split('_')[0] + '_'] += 1 return counts def with_counter(arr): matches = [re.match(r'^.*?_', i) for i in arr] return Counter([i.group() for i in matches if i]) """ for method in ('with_defaultdict', 'with_counter'): print(timeit('{}(array)'.format(method), setup=setup, number=1))

Timing results:

0.4836089063341265 1.3238173544676142

If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.

you're right that I can do it with Regex. I'm not too familiar with python either but I'm mostly unsure of which method would be faster. I wonder if using regex is match faster than text.startswith()
– mysticalstick
2 days ago

text.startswith()

the op was asking specifically about python
– aydow
2 days ago

You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():

str.partition()

collections.Counter()

from collections import Counter arr = ['a_text', 'b_text', 'ab_text', 'a_text'] prefixes = {'a_', 'b_', 'ab_'} counter = Counter() for word in arr: before, delim, _ = word.partition('_') prefix = before + delim if prefix in prefixes: counter[prefix] += 1 print(counter)

Which Outputs:

Counter({'a_': 2, 'b_': 1, 'ab_': 1})

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk