counting number of each substring in array python


counting number of each substring in array python



I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.


string


[a_text, b_text, ab_text, a_text]


['a_', 'b_', 'ab_']


'a_'



so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.


num_a = len(filter(lambda x: x.startswith('a_'), array))


filter()



Also perhaps instead of the filter I could use list comprehension to make it faster?


filter





wait so your array only contains strings with 'a_' and 'b_' prefixes??
– Michael Ilie
2 days ago





It can contain any prefixes but let's say I only need to count the number of a_ and b_ and ab_
– mysticalstick
2 days ago



a_


b_


ab_





to answer your original question as to which is faster, the Unix time command in terminal reveals that as your data gets bigger, the filter() method gets slower, however the delay is not that big (second or two) especially considering that i ran tests with millions of strings.
– Michael Ilie
2 days ago


time


filter()




4 Answers
4



You can use collections.Counter with a regular expression (if all of your strings have prefixes):


collections.Counter


from collections import Counter

arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])



Output:


Counter({'a_': 2, 'b_': 1, 'ab_': 1})



If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:


re.match


arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])



Output:


Counter({'a_': 2, 'b_': 1, 'ab_': 1})





what if they don't all have prefixes? would using this method of regex still work
– mysticalstick
2 days ago





I just updated for that possibility
– user3483203
2 days ago





I see! That is a very cool solution. I'm sorry I was probably not very clear, I have a set list of prefixes I am looking for. So in this example, I'm looking for some prefix = ['a_', 'b_']. So not every prefix that occurs. Is this still something I can regex?
– mysticalstick
2 days ago


prefix = ['a_', 'b_']





I would still recommend using this method, because the Counter object is a dictionary with O(1) lookup, so getting your desired values is trivial
– user3483203
2 days ago


Counter


O(1)





I see, thank you!
– mysticalstick
2 days ago



Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').


defaultdict()


'a'


'a_a'


from collections import defaultdict

array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000

def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts



The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.


from timeit import timeit

setup = """
from collections import Counter, defaultdict
import re

array = ['a_text', 'b_text', 'ab_text', 'a_text']

def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts

def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""

for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))



Timing results:


0.4836089063341265
1.3238173544676142



If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.





you're right that I can do it with Regex. I'm not too familiar with python either but I'm mostly unsure of which method would be faster. I wonder if using regex is match faster than text.startswith()
– mysticalstick
2 days ago


text.startswith()





the op was asking specifically about python
– aydow
2 days ago



You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():


str.partition()


collections.Counter()


from collections import Counter

arr = ['a_text', 'b_text', 'ab_text', 'a_text']

prefixes = {'a_', 'b_', 'ab_'}

counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1

print(counter)



Which Outputs:


Counter({'a_': 2, 'b_': 1, 'ab_': 1})






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift