I have a huge file (6Gb) with 74.000 articles in this format:
<text id="1">
bla bla bla bla.........
</text>
<text id="2">
bla bla bla bla.........
</text>
<text id="3">
bla bla bla bla.........
</text>
<text id="............ and so on untill 74.000
then I have another file having the title corresponding to each of the id's, like this:
1 title1
2 title2
3 title3
...
74000 title74000
I have to put the corresponding title to each of the id's in the first file so I transformed the second file into this script:
sed -i "s/<text id="1">/<text id="1" title="title1">/" file1
sed -i "s/<text id="2">/<text id="2" title="title2">/" file1
sed -i "s/<text id="3">/<text id="3" title="title3">/" file1
...
sed -i "s/<text id="74000">/<text id="74000" title="title74000">/" file1
Notice I didn't put the g at the end of sed command because it is not global serch, that means at the first match it changes the string and goes to the next search. The script works, but due to the huge size of the file it takes 12 minutes per change, that gives me about two years to complete all the changes while I need them ASAP, so my question is if somebody knows how can I perform this changes in a faster way, maybe with some other utility, python, perls or any other...
In Gnu Awk version 4, you could try:
gawk4 -f a.awk file2 RS="^$" file1
where a.awk is:
NR==FNR {
b["<text id=\""$1"\">"]=$2
next
}
{
n=split($0,a,/<text id=[^>]*>/,s)
printf "%s%s",s[0],a[1]
for (i=1; i<n; i++) {
ind=index(s[i],">")
printf "%s%s", substr(s[i],1,ind-1) " title=\""b[s[i]]"\">", a[i+1]
}
printf "%s",s[n]
}
Output:
<text id="1" title="title1">
bla bla bla bla.........
</text>
<text id="2" title="title2">
bla bla bla bla.........
</text>
<text id="3" title="title3">
bla bla bla bla.........
</text>
Update
Just for fun, I tested some of the solutions here on 3.9Mb xml file (80000 titles) and a 1.3Mb info file (also 80000 titles)
(Scripts for generating the input files can be found here: http://pastebin.com/PpTPt0gk )
Update 2
To get more reliable timing results I took an average over 20 runs:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With