Article Author: Marcus Peters
Introduction
To get an idea of how the IFilter interface is used in Microsoft’s indexing infrastructure to extract the content of various file formats I would like to take you back to the time when Windows NT could be decorated with IIS via the "Option Pack". This little trip back in the history will explain why Microsoft has created this interface.
Back in those days the IIS Team wanted to ship the web server (IIS) with search capabilities, thus they developed a search engine which was called Index Server. When developing a search engine you need to extract the content of the files which Microsoft described as "Filtering Documents". See the Related Links section for more information on how a search solution works. Indexing files with a search engine includes gathering the documents for indexing, filtering the documents, and using word breaking and stemming strategies to extract phrases and words and then store the information into the index.
The developers at Microsoft designed the process of filtering documents with a pluggable architecture to filter various document formats with the same mechanism. The idea behind this mechanism is that the appropriate authority which knows how to extract the content for a certain file is loaded via a factory. This decouples the filtering process from the filter itself. The caller of such a factory does not need to know anything about the file which is filtered. The caller just calls the factory and gets the suitable filter for the desired file. All the vendor of a file format must do is develop a filter component which matches this interface driven factory pattern.
And that is exactly what the IFilter interface is: a contract which is used by Microsoft’s indexing infrastructure to load the suitable filter for a given file and perform the filtering of that file based on the contract.
This is the mechanism that is still used today in applications that use indexing services – including for example Site Server, SQL Server and Sharepoint Portal Server. Even on Windows XP you can enable the Indexing Services to improve the Search for files and folders functionality. In the future Windows Vista will keep this infrastructure to implement search capabilities.
This is where you, the application developer, come on stage: The indexing infrastructure is already part of the system and desktop search applications like MSN Desktop Search and LookOut are using it, so why don’t you? This article will show you how to use the IFilter interface to extract the content of a file from managed code.
System Requirements
To run the code for this sample you should have the following:
- Windows XP or Windows 2003
- Internet Information Server 5.x or higher
- The .NET Framework version 1.1
- VS.NET 2003
You should be familiar with setting up ASP.NET projects with Visual Studio.
Installing and Compiling the Sample Code
To install the sample code you need to extract the download into a directory. Then create a virtual directory which points to that directory in the default application of IIS and name it ASPIFilter (Figure 1 shows the created virtual directory). Now you should be able to reach this directory via IIS with the URL http://localhost/ASPIFilter. If you change the name be sure to change the file ASPIFilter.csproj.webinfo in relation to it.

Figure 1. Creating a virtual directory in IIS
The last thing you have to do is to install / register the dsofile.dll component which is part of the sample application by running the registerDSOFile.cmd. This component is developed by Microsoft and can be downloaded for free (see Related Links section) to extract OLE Properties of a file. The dsofile.dll will be explained in detail later in this article.
Diggin’ into the IFilter World
As stated previously the IFilter interface is intended to be used during the index process to extract the information on files which were gathered for indexing. Search engines often consider a document as a set of properties or features. Having this in mind, you can store the documents content and its metadata like author, title and so on in such properties.
The indexing strategy of Microsoft follows this idea and so an IFilter considers a document as a set of properties (See Properties of Documents in the Related Links section of this article). The guys from Redmond extended this idea and use document properties for plain text data and binary data.
Textual properties contain information which can be used directly where value typed properties can contain various types of information, such as embedded objects. This could be a MS Excel Spreadsheet or just other metadata for example.
In the case of Microsoft Office documents, some metadata is also known as OLE Properties and can be accessed via the standard OLE interfaces. For example the properties author and title of a MS Word document are not stored in a text property. They are stored elsewhere in the file and can be accessed by using OLE interfaces. You won’t need to deal with these interfaces because luckily Microsoft provides a component which encapsulates the retrieval of typical OLE properties in Office Documents. I will explain the use of this free component a little bit later, for now just keep in mind that the content of a file can be stored in the file using the good old COM based structured storage mechanism.
Chunks of Information
To get the content out of a given document the IFilter makes methods available to read the document in chunks, which are small pieces of information. A chunk consist basically of a property, which can be either Text type or Value type, some locale information, the position in the document and the length of the chunk. A typical filter session would be:
- Load the appropriate component that implements IFilter for a given document
- Initialize the filter session
- Read a chunk
- Depending on the chunks type, read either text or a value
- Go to the next chunk
In the table below are the methods of the IFilter interface to extract the data from documents:
| IFilter method | Description |
|---|---|
| Init() | Initializes a filtering session |
| GetChunk() | Positions filter at the beginning of the first or next chunk and returns a reference to the current chunk |
| GetText() | Retrieves text from the current chunk |
| GetValue() | Retrieves values from the current chunk |
| BindRegion() | Retrieves an interface representing the specified portion of object. Currently not being used. |
Loading an IFilter
As you read in the previous section, the first thing to do in an IFilter session is to load the appropriate filter component for a given document. Loading this component can be done via Windows API functions. Of course these functions are pure C++ code which you may not be familiar with, however you can easily port this to C# thanks to the platform invocation mechanism of the .NET framework, I will show you how to do this later on. For now, we’ll work with the C++ API function definitions:
The dollowing function above will load a component that implements IFilter for a file located at a given path.
STDAPI LoadIFilter(WCHAR const * pwcsPath, IUnknown * pUnkOuter, void ** ppIUnk );
This code loads the component for an object from a given stream:
STDAPI BindIFilterFromStream(IStream * pStm, IUnknown * pUnkOuter, void ** ppIUnk);
Lastly this code loads an IFilter for a structured storage object such as embedded objects which can be accessed by the OLE interfaces.
STDAPI BindIFilterFromStorage(IStorage * pStg, IUnknown * pUnkOuter, void ** ppIUnk);
All of these methods return an IFilter reference that will filter the required object. This could be a path to a file, a stream object or a structured storage OLE object.
Where Can You Use IFilters in Your Application?
You have now learnt about the history and origins of IFilters and you have some idea how they are used by the Microsoft Indexing System to support their big server applications, like Sharepoint Portal Server, Exchange Full Text Search and so on. Even the MSN developer guide refers to this system of content extraction. So how when you use the Microsoft’s indexing infrastructure in custom applications?
The answer is – where ever you need the content and/or metadata of a file.
Imagine a web application which allows for the uploading of files. It may be necessary to scan an uploaded files for certain words or phrases. It could also be necessary to scan the files for special metadata. This could be the author of MS Word documents, the number of slides in MS PowerPoint presentations or the version of a MS Excel spreadsheet.
Content Extraction with the IFilter Interface from Managed Code
Now it is time to start looking at some code. Throughout the rest of the article I will develop a sample ASP.NET application which displays the content of files which are located in a given directory on the server. This will provide you with an example of how you can use the IFilter interface within your own applications.
IFilter Implementation with C#
As you may recall it is possible to load an IFilter for a given file with the following call.
STDAPI LoadIFilter(WCHAR const * pwcsPath, IUnknown * pUnkOuter, void ** ppIUnk );
The API function requires a path to the desired file, a reference to the standard COM interface IUnknown and it returns the corresponding IFilter reference as an output parameter. If you are not that familiar with COM Programming you should regard COM as a binary standard for software components to work together. The IUnknown interface is a base interface which allows a client to get a reference to the desired type.
The following code snippet shows the managed C# declaration of LoadIFilter().
[DllImport("query.dll", SetLastError=true, CharSet=CharSet.Unicode)]
private static extern int LoadIFilter(string pwcsPath,
[MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter,
ref IFilter ppIUnk);
DllImport is the .NET attribute to perform platform invocation of good old WIN32 DLLs. Here you see the invocation of query.dll which holds the definitions for the LoadIFilter() API function. If you do not supply a path to the DLL you would like to load, the .NET runtime tries to find it in the %windir%system32 directory. This method outputs a reference to the correct IFilter. So far so good, but there is no type of IFilter in the .NET framework and no type library from where you can import the type information to build a RCW (Runtime Callable Wrapper) with Visual Studio. You need to build an RCW for the IFilter interface on your own. The next code snippet shows the RCW for the IFilter interface with the methods I introduced above.
[ComImport]
[Guid("89BCB740-6119-101A-BCB7-00DD010655AF")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface IFilter
{ void Init( [MarshalAs(UnmanagedType.U4)] IFILTER_INIT grfFlags, uint cAttributes, [MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] FULLPROPSPEC[] aAttributes, ref uint pdwFlags ); [PreserveSig] int GetChunk(out STAT_CHUNK pStat); [PreserveSig] int GetText(ref uint pcwcBuffer, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder buffer ); void GetValue(ref UIntPtr ppPropValue); void BindRegion( [MarshalAs(UnmanagedType.Struct)] FILTERREGION origPos, ref Guid riid, ref UIntPtr ppunk);
}
The first thing you probably noticed when examining the RCW definition of IFilter are the various attributes which are used in the interface definition. Basically these attributes are used to bridge COM and .NET. The table below lists the attributes used and their meaning.
| Attribute | Meaning |
|---|---|
| ComImport | Marks a class to be an externally defined COM class. As you may know COM classes are registered in the system registry. In conjunction with the GUID attribute the runtime knows which interface to use. |
| GUID | Unique ID for a class or interface (GUID stands for Globally Unique ID) |
| InterfaceType | Defines if a COM interface derives from IUnknown or IDispatch. I already provided a brief explanation what IUnknown does, and IDispatch does something similar but is intended to be used in scripting environments. |
| PreserveSig | Defines if standard COM errors (HRESULTs) should be converted to .NET exceptions. Usually the .NET runtime treats any HRESULTs other then S_OK (the standard ok result in the COM world) as an exception. Using this attribute we suppress that behavior and receive the origin HRESULT. |
| MarshalAs | Indicates how to marshal the data between managed and unmanaged code |
The second thing that comes out is the occurrence of some structures which you have to implement because they are directly or indirectly used by the IFilter interface. I don’t have the space to show the C# implementations of those structures. You can find them well documented in the sample code. But I would like to give you a brief introduction of what they are for:
- The FULLPROPSPEC structure and its containing PROPSPEC structure are used to transport document properties
- The STAT_CHUNK structure transports a chunk
- The FILTERREGION structure which is used by the IFilter::BindRegion() method and is intent to be used in future versions
Beside some enumerations, there are the main types to declare in order to implement the IFilter interface RCW with C#. Of course you will find all of those structures and enumerations documented in the sample code. If you are interested in going deeper please refer to the Related Links section where you will find some links to dig into.
Extracting Metadata by Reading OLE Properties
There can also be more information stored in a file besides the content. This information is called metadata and can include information such as the author or the title of the document. If you right click on a file in Windows Explorer, you can access some of those properties. (See Figure 2 and Figure 3 for an example). It is not very easy to access these properties yourself and you have to deal with a lot of interfaces from the OLE specification. Help comes with the DSOFILE.DLL, a library developed by Microsoft which you can use to read and write OLE Properties of a given file. This is a free library Microsoft provides, see the Related Links section for the download information.

Figure 2. OLE Properties of a MS Office File

Figure 3. OLE Properties of a simple text file
All you have to do to use this library is to put a reference to it using Visual Studio. This library is a COM library, but Visual Studio will create an Interop Assembly for you. The following code snippet demonstrates the use of DSOFILE.DLL. It is taken from the sample application which uses this code to extract OLE Properties.
/// <summary>
/// Gets OLE properties of a given file
/// </summary>
/// <param name="filename">the file to filter</param>
/// <returns>A StringDirectory with the extracted information</returns>
public static StringDictionary GetOLEProperties(string filename)
{ StringDictionary sd = new StringDictionary(); OleDocumentPropertiesClass oleDocument = null; try { oleDocument = new OleDocumentPropertiesClass(); oleDocument.Open(filename, true, dsoFileOpenOptions.dsoOptionOpenReadOnlyIfNoWriteAccess); SummaryProperties properties = oleDocument.SummaryProperties; Type t = typeof(SummaryProperties); foreach(PropertyInfo p in t.GetProperties(BindingFlags.Public | BindingFlags.Instance)) { try { object val = p.GetValue(properties,null); if(val != null) sd.Add(p.Name,val.ToString()); } catch(Exception ex) { //Just write out the error System.Diagnostics.Debug.WriteLine(ex.ToString()); continue; } } CustomProperties customProps = oleDocument.CustomProperties; foreach(CustomProperty cp in customProps) if(!sd.ContainsKey(cp.Name)) sd.Add(cp.Name,cp.get_Value().ToString()); else sd.Add(string.Format("{0}_customProperty",cp.Name), cp.get_Value().ToString()); } finally { oleDocument.Close(false); Marshal.ReleaseComObject(oleDocument); } return sd;
}
In this sample, I just create an instance of the OleDocumentPropertiesClass, open the desired file and extract all the properties. The Open() method allows us to open the file in different ways, and since viewing properties is main purpose of the sample it is okay for us to open it read only. See the Related Links section for more details of the use of this component.
Of course the guys from Redmond were so kind to provide named properties for all of the known ones in Microsoft Office Documents like author, title and so on. But I was lazy in typing all of them and so I decided to use the .NET reflection mechanism to extract them in one step. The last thing I did here is to close the file and release the unmanaged resource by calling Marshal.ReleaseComObject(). This method is used to explicitly control the lifetime of a COM object used from managed code. You should use this method to free the underlying COM object that holds references to resources like a file object in this case. Although the COM runtime has no garbage collection we could compare it a little with the Dispose() method used in .NET.
The Sample Application
So far, you have seen how to implement the IFilter interface using RCW and you have seen how to get the OLE properties of a file. With this, you are equipped to build a simple sample ASP.NET application which filters all files within a given directory and display the result of the filter process on a web page. Figure 4 shows a screenshot of the sample application.

Figure 4. Sample Application
The facade for all the functionality is the class loader which provides methods to extract the content of a file as a string or extract the content of the OLE properties and stores them in a dictionary which is bound to a DataGrid. Figure 4 shows all of the properties found in a file.
| Method | Meaning |
|---|---|
| GetContent() | Concatenates the value of all Text chunks |
| GetOLEProperties() | Extracts the OLE properties |
| Extract() | Extracts the OLE properties and the text content by calling GetContent() and GetOLEProperties() |
I already introduced the main part of the Loader.GetOLEProperties() method in the previous section Extracting Metadata by Reading OLE Properties. The Loader.GetContent() method loads the IFilter for a given file and reads the text chunks. The Loader.Extract() method just calls the previous methods and stores the result in a StringDictionary.
The following code snippet shows the extraction of the content of a file using the IFilter interface. I separated the loading of the IFilter from the content extraction by implementing a factory method which tries to load the IFilter for a given file and returns either a valid reference or null. In the case where no IFilter could be loaded, the GetContent() method returns an empty string.
/// <summary>
/// Gets the text content filtered of a given file.
/// </summary>
/// <param name="filename">The file to filter</param>
/// <returns>The content or an empt string</returns>
public static string GetContent(string filename)
{ IFilter filter = null; try { StringBuilder plainTextResult = new StringBuilder(); filter = loadIFilter(filename); //if we have not valid ifilter just quit if(filter null) return string.Empty; ...
Getting the Content
After initialization the IFilter.GetChunk() stores the first chunk in the STAT_CHUNK structure. I don't want to set some initial flags so I initialize the session with IFILTER_INIT.NONE.
... STAT_CHUNK ps = new STAT_CHUNK(); IFILTER_INIT mFlags = IFILTER_INIT.NONE; uint i = 0; filter.Init( mFlags, 0, null, ref i); int resultChunk = (int)IFILTER_RETURN_CODES.S_OK; resultChunk = filter.GetChunk(out ps); ...
If the chunk is a text chunk, the text is stored in a buffer and then stored to the result in a StringBuilder.
... if (ps.flags CHUNKSTATE.CHUNK_TEXT)
{ uint sizeBuffer = 65000; int resultText = 0; //read all text by taking bit of the buffers size while (resultText (int)IFILTER_RETURN_CODES.FILTER_S_LAST_TEXT || resultText (int)IFILTER_RETURN_CODES.S_OK) { System.Text.StringBuilder sbBuffer = new System.Text.StringBuilder((int)sizeBuffer); resultText = filter.GetText(ref sizeBuffer, sbBuffer); if (sizeBuffer > 0 && sbBuffer.Length > 0) { string chunk = sbBuffer.ToString(0, (int)sizeBuffer); plainTextResult.Append(chunk); } }
}
return plainTextResult.ToString();
…
The last thing to do is to release the unmanaged resource.
…
finally
{ if (filter != null) { Marshal.ReleaseComObject(filter); filter = null; }
}
…
Limitations
You should be aware that some of the IFilters around do not support multiple threaded environments – and unfortunately IIS is multithreaded. A good example of this is the Adobes PDF IFilter. If you do some research you’ll find lots of information concerning undefined behavior like this. So if you have to implement filtering on your own in a production environment, you may want to run the filtering work in a separate process which does all the extraction work in order to avoid these kinds of issues.
Conclusion
In this article you have seen how to use the IFilter interface and how to extract OLE properties with managed code. I have shown you how to manually create a Runtime Callable Wrapper to implement the IFilter interface and how to call the Windows API LoadIFilter() method to load an IFilter for a given file. I demonstrated all these things with a sample ASP.NET application which extracts content information for a given file. With this, you should have a general understanding of what goes on behind the scenes of applications which are using the Microsoft Indexing Framework like Sharepoint Portal Server or MSN Desktop Search to extract information from various file types.

