Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract header from the first commented line in NumPy via numpy.genfromtxt

My environment:

OS: Windows 11
Python version: 3.13.2
NumPy version: 2.1.3

According to NumPy Fundementals guide describing how to use numpy.genfromtxt function:

The optional argument comments is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumes comments='#'. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored.

Note: There is one notable exception to this behavior: if the optional argument names=True, the first commented line will be examined for names.

To do a test about the above-mentioned note (indicated in bold), I created the following data file and I put the header line, as a commented line:

C:\tmp\data.txt

#firstName|LastName
Anthony|Quinn
Harry|POTTER
George|WASHINGTON

And the following program to read and print the content of the file:

with open("C:/tmp/data.txt", "r", encoding="UTF-8") as fd:
    result = np.genfromtxt(fd,
                           comments="#",
                           delimiter="|",
                           dtype=str,
                           names=True,
                           skip_header=0)
    print(f"result = {result}")

But the result is not what I expected:

result = [('', '') ('', '') ('', '')]

I cannot figure out where is the error in my code and I don't understand why the content of my data file, and in particular, its header line after the comment indicator #, is not interpreted correctly.

I'd appriciate if you could kindly make some clarification.

like image 627
user17911 Avatar asked Sep 14 '25 12:09

user17911


1 Answers

The magic happens in this line in genfromtxt:

rows = np.array(data, dtype=[('', _) for _ in dtype_flat])

The inputs are

data = data=[('Anthony', 'Quinn'), ('Harry', 'POTTER'), ('George', 'WASHINGTON')]
dtype_flat = [dtype('<U'), dtype('<U')]

This is not too surprising since you have variable-length strings, and numpy is designed for homogeneous data types. You should have a couple of workarounds available, but only one seems to work.

If you set dtype=object, you get

result = [(b'Anthony', b'Quinn') (b'Harry', b'POTTER') (b'George', b'WASHINGTON')]

You would also expect that specify a string size explicitly. Instead of dtype = str, should work. However, using something like dtype = '<U10' does not work and produces the same empty result as before.

There appears to be an issue open for this, or at least a similar issue: https://github.com/numpy/numpy/issues/9644

like image 198
Mad Physicist Avatar answered Sep 17 '25 03:09

Mad Physicist