This is my sample data
78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193
I want to get the result as
Indonesia
Currently I am using split on the string and accessing the value. But I want to use regex for it.
Some conditions to be aware of: The data may be empty The data will NOT contain pipe (|)
I want to use regex instead of split because I think regex are more efficient. The reason I want this to be as efficient as possible is because the source file is 70gb.
EDIT:
This is the whole code in which I will be using this
def main(argv):
mylist = set();
input_file = open("test.txt", 'r')
for row in input_file:
rowsplit = row.split("|");
if rowsplit[1] !='':
if rowsplit[1] in mylist:
filename= "bby_"+rowsplit[1]+".dat";
existingFile=open(filename,'a')
existingFile.write(row);
existingFile.close()
else:
mylist.add(rowsplit[1])
filename= "bby_"+rowsplit[1]+".dat";
newFile = open(filename,'a')
newFile.write(row);
newFile.close();
else:
print "Empty"
print mylist
I'm just confused on which of the answers I should now use :(
I just want this code to be fast. Thats it.
Here is the performance of the meaningful answers on Python 3.4.3:
In [4]: timeit.timeit('s.split("|", 2)[1]', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"')
Out[4]: 0.43930888699833304
In [10]: timeit.timeit('re.search(r"^[^a-zA-Z]*([a-zA-Z]+)", s).group(1)', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"; import re')
Out[10]: 1.234878903022036
In [16]: timeit.timeit('re.search("^\d*\|(\w+?)?\|", s).group(1)', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"; import re')
Out[16]: 1.8305770770530216
If there is no pipes:
In [24]: timeit.timeit('s.split("|", 2)[1] if "|" in s else None', 's = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"')
Out[24]: 0.494665392965544
In [25]: timeit.timeit('s.split("|", 2)[1] if "|" in s else None', 's = ""')
Out[25]: 0.04492994397878647
splitting and checking the length may still be faster than a regex:
for line in f:
spl = line.split("|",2)
if len(spl) > 2:
print(spl[1])
....
Some timings on matching and non-matching lines:
In [24]: s = "78|Indonesia|Pamela|Reid|[email protected]|147.3.67.193"
In [25]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 413 ns per loop
In [26]: r = re.compile(r'(?<=\|)[^|]*')
In [27]: timeit r.search(s)
1000000 loops, best of 3: 452 ns per loop
In [28]: s = "78 Indonesia Pamela Reid [email protected] 147.3.67.193"
In [29]: timeit r.search(s)
1000000 loops, best of 3: 1.66 µs per loop
In [30]: %%timeit
spl = s.split("|",2)
if len(spl) > 2:
pass
....:
1000000 loops, best of 3: 342 ns per loop
You can shave a bit more off by creating a local reference to str.split:
_spl = str.split
for line in f:
spl = _spl(s,"|",2)
if len(spl) > 2:
.....
Since there are always at the same number of pipes in each line:
def main(argv):
seen = set() # only use if you actually need a set of all names
with open("test.txt", 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
v = row[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
if/else seems redundant as you are appending to the file regardless, if you want to keep a set of the row[1]'s for another reason you can just add to the set each time, unless you actually want a set of all the names I would remove it from the code.
Applying the same logic to split:
def main(argv):
seen = set()
with open("test.txt", 'r') as infile:
_spl = str.split
for row in infile:
v = _spl(row,"|",2)[1]
if v:
filename = "bby_" + v + ".dat"
existingFile = open(filename, 'a')
existingFile.write(row)
existingFile.close()
seen.add(v)
else:
print "Empty"
What will cause a lot of overhead is constantly opening and writing but unless you could store all the lines in memory there is no simple way to get around it.
As far as reading goes, on a file with ten million lines just splitting twice outperforms the csv reader:
In [15]: with open("in.txt") as f:
....: print(sum(1 for _ in f))
....:
10000000
In [16]: paste
def main(argv):
with open(argv, 'r') as infile:
for row in infile:
v = row.split("|", 2)[1]
if v:
pass
## -- End pasted text --
In [17]: paste
def main_r(argv):
with open(argv, 'r') as infile:
r = csv.reader(infile, delimiter="|")
for row in r:
if row[1]:
pass
## -- End pasted text --
In [18]: timeit main("in.txt")
1 loops, best of 3: 3.85 s per loop
In [19]: timeit main_r("in.txt")
1 loops, best of 3: 6.62 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With