How to extract text from MS office documents in C#

Question

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

adrianbanks · Accepted Answer · 2009-06-18 08:28:28Z

up vote 11 down vote accepted

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

answered Jun 18 '09 at 8:28

adrianbanks
28.7k359111

Interesting... a very sneaky solution :) – Skurmedel Jun 18 '09 at 9:05

Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why. – adrianbanks Jun 18 '09 at 11:03

2

I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack microsoft.com/downloads/… – Elias Haileselassie Jun 19 '09 at 8:25

Does this solution work on PDF docs as well? – Nick Feb 22 '10 at 16:40

Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)] – adrianbanks Feb 22 '10 at 17:15

show 4 more comments

KyleM · Answer 2 · 2011-12-28 18:27:09Z

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)
    {
        const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
        {
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);
            xdoc.Load(wdDoc.MainDocumentPart.GetStream());

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)
                {
                    textBuilder.Append(textNode.InnerText);
                }
                textBuilder.Append(Environment.NewLine);
            }

        }
        return textBuilder.ToString();
    }

@adrianbanks I feel that this answer is currently better than the accepted answer because the accepted answer will not work on certain versions of Windows and because IFilter is a deprecated interface. Of course at the time adrian's post was written that was not the case.

joshcomley · Answer 3 · 2009-06-18 07:38:03Z

Simple!

These two steps will get you there:

1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

Only sad part with the Office interop library is that you need to have Office installed.

Skurmedel · Answer 4 · 2009-06-18 10:24:30Z

up vote 1 down vote

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

edited Jun 18 '09 at 10:24

answered Jun 18 '09 at 7:25

Skurmedel
7,74011941

	this is really greate! I am done with docx, and what about for the rest? – Elias Haileselassie Jun 18 '09 at 9:22
	You can "connect" to a xslx-file like it were a database with ODCB I think. A quite cumbersome solution I think. I have no idea on how to read .doc-files or .xls-files, so I can't help you there. Here is a reference for .xls files though: sc.openoffice.org/excelfileformat.pdf – Skurmedel Jun 18 '09 at 10:32
	I couldn't find anything better on XLSX than the specification itself sadly: ecma-international.org/publications/files/ECMA-ST/… – Skurmedel Jun 18 '09 at 10:37

imanabidi · Answer 5 · 2012-07-08 15:54:41Z

up vote 1 down vote

Read Document Text Directly from Microsoft Word File

http://www.codeproject.com/KB/cs/getwordtext.aspx

edited Jul 8 '12 at 15:54

answered Apr 12 '11 at 9:22

imanabidi
1,58831627

asked	4 years ago
viewed	12560 times
active	11 months ago

How to extract text from MS office documents in C#

5 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged c# ms-office text-extraction or ask your own question.

Community Bulletin

Linked

How to extract text from MS office documents in C#

5 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c# ms-office text-extraction or ask your own question.

Community Bulletin

Linked

Related