Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find corrupt data in xlsx file

Tags:

perl

xlsx

corrupt

We are generating xlsx files using a perl script. Files usually contains thousands of records. This makes spotting errors a very difficult operation. This process was working since years without problems.
This week we got a request to check a file which contains errors. While opening Excel prompted that the file contains errors and asked whether we want to repair them.
enter image description here In fact we do not want to recover the data but want to know which part of the file is corrupt. The error should be coming from corrupt data and we are interested to identify these data. the log message shows the following:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error068200_01.xml</logFileName> </br>
<summary>Errors were detected in file 'D:\Temp\20161020\file_name.xlsx'</summary>
<repairedRecords summary="Following is a list of repairs:"><repairedRecord>Repaired Records: Cell information from /xl/worksheets/sheet1.xml part</repairedRecord>
</repairedRecords>
</recoveryLog>

The error should come from corrupt data. Is there any tool/method which helps to spot this corrupt data? I tried renaming it a zip file, extracting it and opening it via an XML editor but was not able to find any errors in XML file.
We also checked that the different XML file structures are fine.

Thank you and best regards

like image 880
Arthur bauer Avatar asked Oct 19 '25 22:10

Arthur bauer


1 Answers


As expected, the problem was coming from text cells containing numbers having an E in the middle.
I used the following steps to identify the erronous cells.
1. Wrote small Java class to read the file. The class was checking the cell type and displaying the value afterwards.
The java program generated an Exception at some line "Cannot get a numeric value from a text cell" even If I was correctly checking the cell type before displaying the content.
2. I checked the opened Excel file at that line and found that the cell contains only 'inf'.
3. I opened the file using open office and looked at the same cells. They contain 0.
4. I debugged the program generating the data and found out that these cells contain data like '914E5514'. Seems that E which was interpreted by Excel as an exponent.
We changed the program to use the format '@' for that cell and this solved the issue.
Thank you.

like image 55
Arthur bauer Avatar answered Oct 24 '25 12:10

Arthur bauer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!