The file names are dynamic and I need to extract the file extension. The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh
20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt
For the first one I want to extract txt, for the second one I want to extract fastq.bz2, for the third one I want to extract tar.gz.
I am using os module to get the file extension as:
import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]
This gives me only gz which is fine if the file name is ok.txt but for this one I want the extension to be tar.gz.
import os
def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            return path[:-len(ext)], path[-len(ext):]
    return os.path.splitext(path)
assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'
Removing dot:
import os
def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            path, ext = path[:-len(ext)], path[-len(ext):]
            break
    else:
        path, ext = os.path.splitext(path)
    return path, ext[1:]
assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'
Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a . in it?
At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'} and add some extra logic yourself
>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
...     name, ext = os.path.splitext(path)
...     if ext in {'.bz2', '.gz'}:
...         name, ext2 = os.path.splitext(name)
...         ext = ext2 + ext
...     return name, ext
... 
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With