Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to do when the pandas error position overflows?

So, I'm experimenting with pandas with the IMDB files, especially title.basic.tsv. When trying to parse the runtimeMinutes column to "Int64", I get an error

ValueError: Unable to parse string "Reality-TV" at position 47993

However, neither line 47994, nor the directly surrounding lines, contain the string Reality-TV. So I started deleting entries from the beginning of the data file, and indeed, the reported position got down. Just until I deleted exactly 47994 entries, at which point the error became

ValueError: Unable to parse string "Reality-TV" at position 65535

This raised my suspicion that the position variable is a uint16 which overflows? Is there a way to deal with this kind of problem, and get the correct line which is making trouble?


Here is the command I used:

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": ["\\N"],
                     })
like image 267
red_trumpet Avatar asked Aug 31 '25 03:08

red_trumpet


1 Answers

I looked at your data and during the analysis of the column "runtimeMinutes" I found that there are str values there, which are causing the error. The picture shows a list of these str values.

Incorrect error

Code search Error:

import pandas as pd

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     na_values={
                         "runtimeMinutes": ["\\N"],
                     })

def search_error_values(df, column):
    error_value = []

    print(f"{'Type':20} | {'Value'}")
    print('-'*53)
    for val in df[column].unique():
        try:
            int(val)
        except:
            print(f"{str(type(val)):20} | {val}")
            error_value.append(val)

    print("\nIncorrect values:", error_value)
    return error_value

values_error = search_error_values(titles, "runtimeMinutes")

I suggest this solution, it will take you more time to load the data. But the long loading will be only once, if you then save the properly processed DataFrame and use it.

Code of the solution:

values_error.append("\\N")

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": values_error,
                     })
like image 75
Sindik Avatar answered Sep 02 '25 18:09

Sindik