I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.Any ideas?
Try DuckDB
Despite the 'DB' in its name, DuckDB is a Python package that's specifically designed for data analysis tasks and is not a full database replacement. It excels at handling data type conversions without the need for the custom extensions you might encounter when using libraries like PyArrow.
import sqlalchemy as sal
import pandas as pd
import duckdb
# Define your SQL variables
connection_string = "your_connection_string_here"
query = "your_query_here"
# Connect to your SQL database using SQLAlchemy
engine = sal.create_engine(connection_string)
conn = engine.connect()
# Run your query and load the results into a DataFrame
df = pd.read_sql(query, con=conn)
# Close the SQL database connection
conn.close()
# My Solution
# With duckdb installed and imported implement the code below
output_file_path = "your_output_file_path_here"
# Connect to DuckDB in-memory
duck_conn = duckdb.connect(':memory:')
# Write the DataFrame (with complex types like UUID) to a
# snappy-compressed Parquet file with DuckDB
duck_conn.query(f"COPY df TO '{output_file_path}' (FORMAT PARQUET)")
# Close the DuckDB connection
duck_conn.close()
Additional Notes for Consideration:
It's worth mentioning that there are methods of extending PyArrow with a custom UUIDType, as detailed in the Arrow documentation. However, in my experience, these can lead to further DataType issues, particularly with complex Postgres schemas. This can create a significant maintenance burden. As a result, I find that using DuckDB's native capabilities for these conversion tasks prevents these kinds of complications and recommend it as the more straightforward and reliable solution.
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With