Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract bullet information from word document?

I want to extract information of bullets present in word document. I want something like this : Suppose the text below, is in word document :

Steps to Start car :

  • Open door
  • Sit inside
  • Close the door
  • Insert key
  • etc.

Then I want my text file like below :

Steps to Start car :

<BULET> Open door </BULET>

<BULET> Sit inside </BULET>

<BULET> Close the door </BULET>

<BULET> Insert key </BULET>

<BULET> etc.</BULET>

I am using C# language to do this.

I can extract paragraphs from word document and directly write them in text file with some formatting information like whether text is bold or is in italics, etc. but dont know how to extract this bullet information.

Can anyone please tell me how to do this?

Thanks in advance

like image 304
Shekhar Avatar asked Dec 03 '25 17:12

Shekhar


2 Answers

You can do it by reading each sentence. doc.Sentences is an array of Range object. So you can get same Range object from Paragraph.

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

Into paraNumber you can get paragraph level and into buttetStr you can get bullet as string.

like image 134
Barun Avatar answered Dec 06 '25 12:12

Barun


I am using this OpenXMLPower tool by Eric White. Its free and available at NUGet package. you can install it from Visual studio package manager. enter image description here

He has provided a ready to use code snippet. This tool has saved me many hours. Below is the way I have customized code snippet to use for my requirement. Infact you can use these methods as it in your project.

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" + (listItem == "" ? "&bull;" : listItem) + "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&bull;" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}
like image 31
kumar chandraketu Avatar answered Dec 06 '25 10:12

kumar chandraketu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!