Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get data between 1st and 2nd pipe in python

Tags:

python

regex

This is my sample data

78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193

I want to get the result as

Indonesia

Currently I am using split on the string and accessing the value. But I want to use regex for it.

Some conditions to be aware of: The data may be empty The data will NOT contain pipe (|)

I want to use regex instead of split because I think regex are more efficient. The reason I want this to be as efficient as possible is because the source file is 70gb.

EDIT:

This is the whole code in which I will be using this

def main(argv):
    mylist = set();
    input_file = open("test.txt", 'r')

    for row in input_file:
        rowsplit = row.split("|");

        if rowsplit[1] !='':
            if rowsplit[1] in mylist:
                filename= "bby_"+rowsplit[1]+".dat";
                existingFile=open(filename,'a')
                existingFile.write(row);
                existingFile.close()
            else:
                mylist.add(rowsplit[1])
                filename= "bby_"+rowsplit[1]+".dat";
                newFile = open(filename,'a')
                newFile.write(row);
                newFile.close();
        else:
            print "Empty"
    print mylist

I'm just confused on which of the answers I should now use :(

I just want this code to be fast. Thats it.

like image 398
v1shnu Avatar asked Dec 05 '25 03:12

v1shnu


2 Answers

Here is the performance of the meaningful answers on Python 3.4.3:

In [4]: timeit.timeit('s.split("|", 2)[1]', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"')
Out[4]: 0.43930888699833304

In [10]: timeit.timeit('re.search(r"^[^a-zA-Z]*([a-zA-Z]+)", s).group(1)', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"; import re')
Out[10]: 1.234878903022036

In [16]: timeit.timeit('re.search("^\d*\|(\w+?)?\|", s).group(1)', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"; import re')
Out[16]: 1.8305770770530216

If there is no pipes:

In [24]: timeit.timeit('s.split("|", 2)[1] if "|" in s else None', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"')
Out[24]: 0.494665392965544

In [25]: timeit.timeit('s.split("|", 2)[1] if "|" in s else None', 's =  ""')
Out[25]: 0.04492994397878647
like image 102
Yaroslav Admin Avatar answered Dec 06 '25 15:12

Yaroslav Admin


splitting and checking the length may still be faster than a regex:

for line in f:
    spl = line.split("|",2)
    if len(spl) > 2:
        print(spl[1])
       ....

Some timings on matching and non-matching lines:

In [24]: s = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"

In [25]: %%timeit                                                        
    spl = s.split("|",2)
    if len(spl) > 2:
        pass
   ....: 
1000000 loops, best of 3: 413 ns per loop

In [26]: r = re.compile(r'(?<=\|)[^|]*')

In [27]: timeit r.search(s)                                            
1000000 loops, best of 3: 452 ns per loop

In [28]: s = "78 Indonesia Pamela Reid [email protected] 147.3.67.193"

In [29]: timeit r.search(s)
1000000 loops, best of 3: 1.66 µs per loop

In [30]: %%timeit                       
    spl = s.split("|",2)
    if len(spl) > 2:
        pass
   ....: 
1000000 loops, best of 3: 342 ns per loop

You can shave a bit more off by creating a local reference to str.split:

_spl = str.split
for line in f:
    spl = _spl(s,"|",2)
    if len(spl) > 2:
      .....

Since there are always at the same number of pipes in each line:

def main(argv):
    seen = set() # only use if you actually need  a set of all names
    with open("test.txt", 'r') as infile:
        r = csv.reader(infile, delimiter="|")
        for row in r:
            v = row[1]
            if v:
                filename = "bby_" + v + ".dat"
                existingFile = open(filename, 'a')
                existingFile.write(row)
                existingFile.close()
                seen.add(v)
            else:
                print "Empty"

if/else seems redundant as you are appending to the file regardless, if you want to keep a set of the row[1]'s for another reason you can just add to the set each time, unless you actually want a set of all the names I would remove it from the code.

Applying the same logic to split:

def main(argv):
    seen = set()
    with open("test.txt", 'r') as infile:
        _spl = str.split
        for row in infile:
            v = _spl(row,"|",2)[1]
            if v:
                filename = "bby_" + v + ".dat"
                existingFile = open(filename, 'a')
                existingFile.write(row)
                existingFile.close()
                seen.add(v)
            else:
                print "Empty"

What will cause a lot of overhead is constantly opening and writing but unless you could store all the lines in memory there is no simple way to get around it.

As far as reading goes, on a file with ten million lines just splitting twice outperforms the csv reader:

In [15]: with open("in.txt") as f:
   ....:     print(sum(1 for _ in f))
   ....: 
10000000

In [16]: paste
def main(argv):
    with open(argv, 'r') as infile:
        for row in infile:
            v = row.split("|", 2)[1]
            if v:
                pass
## -- End pasted text --

In [17]: paste
def main_r(argv):
    with open(argv, 'r') as infile:
        r = csv.reader(infile, delimiter="|")
        for row in r:
            if row[1]:
                pass

## -- End pasted text --

In [18]: timeit main("in.txt")
1 loops, best of 3: 3.85 s per loop

In [19]: timeit main_r("in.txt")
1 loops, best of 3: 6.62 s per loop
like image 27
Padraic Cunningham Avatar answered Dec 06 '25 16:12

Padraic Cunningham



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!