Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using machine learning to parse excel file and extract table data with no named tables involved? If so, how do I get started?

I have read several articles on machine learning.

An example is this https://medium.com/technology-invention-and-more/how-to-build-a-simple-neural-network-in-9-lines-of-python-code-cc8f23647ca1 and was wondering if it's possible to modify the code to do the use case below.

Let's say the machine is given an excel file (assuming .xlsx) that has only 1 sheet.

Inside the sheet there will be a table (defined as having a header row and at least 2 data rows and will be agreed as a table by common sense humans who read the file. Not a named table or object as defined by MS Excel)

The following is unknown before hand:

  • the position of the table
  • the number of rows and columns
  • the data type in the table

This table in the excel file is also NOT named or identified beforehand in any way. There are no objects in this excel file. No named tables.

If I were to attempt to use VBA to list objects, I will get this error message

enter image description here

There will be at least 1 other cell outside of this table that will contain some text or value.

How or what algorithm using what software library can I train a machine to identify where does the table start and end?

E.g.

The table first cell (identified as the top left most cell in the header row) is A4 and the table last cell (identified as the bottom right most cell) is G12

The machine is able to read the excel file and say that the table is at A4:G12 even if there are other cells outside the table that has data.

My various google search involving "excel and machine learning" or "identify tables in excel" tends to give me articles on how to feed data to machine learning software using excel files or how to find/name data tables that are named in the Excel. I want to emphasize the table data is NOT named.

Also come across answers talking about extracting such table data from PDFs. But my focus is on Excel.

I have read that machine learning can be used to read images. So I imagine trying to "read" an excel file which is a highly structured file just to find where a table is should be possible.

Here are some sample screenshots

enter image description here

enter image description here

Here are the links to the Excel files

https://www.dropbox.com/s/l3vjjsgunp0zu23/A4toG12.xlsx?dl=0

https://www.dropbox.com/s/nwzw0211ruhwvf0/G7toN19.xlsx?dl=0

Please note that these are dummy files to illustrate my point about identifying the table data.

  • The first one should be identified as A4 to G12
  • The second one should be identified as G7 to N19

The comment by JanLauGe about ctrl + a is interesting See Using machine learning to parse excel file and extract table data with no named tables involved? If so, how do I get started?

I have googled for the equivalent in VBA but to no avail.

What I have tried

I wrote a script as suggested by Scott Craner to find all the cells with values I get this.

Thanks to his comments, I got it to work.

Public Sub LookForCells()
    For Each block In ActiveSheet.Columns("A:Z").SpecialCells(xlCellTypeConstants, 23).Areas
       MsgBox block.Count
    Next block
End Sub

And you should see the following

enter image description here

like image 925
Kim Stacks Avatar asked Oct 21 '25 00:10

Kim Stacks


2 Answers

I realize this is an old question, but found it while considering this topic and going through the exercise of extracting tabular data from Excel files. We also concluded that ML wasn't the right answer, at least for now, but that a rules-based approach was better.

The rules that we landed on were to identify table "corners" that had either a blank cell or a sheet boundary above and to the left, that were part of at least a 2x2 matrix of densely or sparsely populated data.

The second thing that helped was to extract greedily, and rely on filtering to narrow down the data we were interested in.

If you are still interested in this topic, I released a Python tool on on GitHub and PyPI (called "eparse") that crawls a filesystem for Excel spreadsheets, parses tabular data from them, and injects the parsed tabular data into a database. I also wrote an article about it.

like image 67
Chris Pappalardo Avatar answered Oct 23 '25 15:10

Chris Pappalardo


I think for your particular use case you will be much better off with more conventional approaches, e.g.:

  • rule-based approaches (largest area of connected cells)
  • VBA script as provided by @Scott Craner in the comment above

However, to answer your question: To apply the most conventional flavours of machine learning (supervised learning) to your problem you will need labelled training data. This would consist of a number of excel spreadsheets for which the table is known and saved, ideally as the index of cells that are part of the table.

An alternative would be to use a pre-trained model if you can find one, or more sophisticated methods such as reinforcement learning (but not sure how this would work in this case)

like image 26
JanLauGe Avatar answered Oct 23 '25 16:10

JanLauGe