I have a large file contains nearly 250 million characters. Now, I want to split it into parts of each contains 30 million characters ( so first 8 parts will contains 30 million and last part will contain 10 million character). Another point is that I want to include last 1000 characters of each file at the beginning of the next part (means part 1's last 1000 characters append in 2nd part's begining - so, 2nd part contains 30 million 1000 characters and so on). Can anybody help me how to do it programmaticaly (using Java) or using Linux commands (in a fast way).
One way is to use regular unix commands to split the file and the prepend the last 1000 bytes from the previous file.
First split the file:
split -b 30000000 inputfile part.
Then, for each part (ignoring the farst make a new file starting with the last 1000 bytes from the previous:
unset prev
for i in part.*
do if [ -n "${prev}" ]
then
tail -c 1000 ${prev} > part.temp
cat ${i} >> part.temp
mv part.temp ${i}
fi
prev=${i}
done
Before assembling we again iterate over the files, ignoring the first and throw away the first 1000 bytes:
unset prev
for i in part.*
do if [ -n "${prev}" ]
then
tail -c +1001 ${i} > part.temp
mv part.temp ${i}
fi
prev=${i}
done
Last step is to reassemble the files:
cat part.* >> newfile
Since there was no explanation of why the overlap was needed I just created it and then threw it away.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With