So, I'm experimenting with pandas with the IMDB files, especially title.basic.tsv
. When trying to parse the runtimeMinutes
column to "Int64"
, I get an error
ValueError: Unable to parse string "Reality-TV" at position 47993
However, neither line 47994, nor the directly surrounding lines, contain the string Reality-TV
. So I started deleting entries from the beginning of the data file, and indeed, the reported position got down. Just until I deleted exactly 47994 entries, at which point the error became
ValueError: Unable to parse string "Reality-TV" at position 65535
This raised my suspicion that the position variable is a uint16
which overflows? Is there a way to deal with this kind of problem, and get the correct line which is making trouble?
Here is the command I used:
titles = pd.read_csv("title.basics.tsv",
sep="\t",
dtype={
"runtimeMinutes": "Int64",
},
na_values={
"runtimeMinutes": ["\\N"],
})
I looked at your data and during the analysis of the column "runtimeMinutes"
I found that there are str
values there, which are causing the error. The picture shows a list
of these str
values.
Code search Error:
import pandas as pd
titles = pd.read_csv("title.basics.tsv",
sep="\t",
na_values={
"runtimeMinutes": ["\\N"],
})
def search_error_values(df, column):
error_value = []
print(f"{'Type':20} | {'Value'}")
print('-'*53)
for val in df[column].unique():
try:
int(val)
except:
print(f"{str(type(val)):20} | {val}")
error_value.append(val)
print("\nIncorrect values:", error_value)
return error_value
values_error = search_error_values(titles, "runtimeMinutes")
I suggest this solution, it will take you more time to load the data. But the long loading will be only once, if you then save the properly processed DataFrame
and use it.
Code of the solution:
values_error.append("\\N")
titles = pd.read_csv("title.basics.tsv",
sep="\t",
dtype={
"runtimeMinutes": "Int64",
},
na_values={
"runtimeMinutes": values_error,
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With