What to do when the pandas error position overflows?

Question

So, I'm experimenting with pandas with the IMDB files, especially title.basic.tsv. When trying to parse the runtimeMinutes column to "Int64", I get an error

ValueError: Unable to parse string "Reality-TV" at position 47993

However, neither line 47994, nor the directly surrounding lines, contain the string Reality-TV. So I started deleting entries from the beginning of the data file, and indeed, the reported position got down. Just until I deleted exactly 47994 entries, at which point the error became

ValueError: Unable to parse string "Reality-TV" at position 65535

This raised my suspicion that the position variable is a uint16 which overflows? Is there a way to deal with this kind of problem, and get the correct line which is making trouble?

Here is the command I used:

titles = pd.read_csv("title.basics.tsv",
                     sep="	",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": ["\N"],
                     })

Sindik · Accepted Answer

I looked at your data and during the analysis of the column "runtimeMinutes" I found that there are str values there, which are causing the error. The picture shows a list of these str values.

Incorrect error

Code search Error:

import pandas as pd

titles = pd.read_csv("title.basics.tsv",
                     sep="	",
                     na_values={
                         "runtimeMinutes": ["\N"],
                     })

def search_error_values(df, column):
    error_value = []

    print(f"{'Type':20} | {'Value'}")
    print('-'*53)
    for val in df[column].unique():
        try:
            int(val)
        except:
            print(f"{str(type(val)):20} | {val}")
            error_value.append(val)

    print("
Incorrect values:", error_value)
    return error_value

values_error = search_error_values(titles, "runtimeMinutes")

I suggest this solution, it will take you more time to load the data. But the long loading will be only once, if you then save the properly processed DataFrame and use it.

Code of the solution:

values_error.append("\N")

titles = pd.read_csv("title.basics.tsv",
                     sep="	",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": values_error,
                     })

What to do when the pandas error position overflows?

Tags:

python

pandas

integer-overflow

red_trumpet

1 Answers

Sindik

Recent Activity

Donate For Us

What to do when the pandas error position overflows?

Tags:

python

pandas

integer-overflow

red_trumpet

1 Answers

Sindik

Related questions

Recent Activity

Donate For Us