CiDaemon Process
Filter DLLs
Associating File Types with Extensions
Word-Breaker DLLs
Noise Words
CiDaemon Priority Settings
Related Performance Counters
Disk Full Condition
Microsoft Index Server filters documents by inserting data from the document files into content indexes. Content filters break documents into words (keys) and create word lists, which supply raw data for the index. Filtering is a three-step process:
The CiDaemon process is a child process created by the Microsoft Index Server engine. The Index Server engine gives a list of documents to the CiDaemon process and it is responsible for filtering the documents by identifying the correct filter DLL and word-breaker DLL associated with a specific document.
Filtering is done as a background activity so as not to interfere with any foreground activity. On local drives, if a document opened by the CiDaemon process for reading is needed by another process for writing, the CiDaemon process closes the document as soon as possible. The document will be retried for filtering at a later time. (This feature is not available on network shares.)
If the CiDaemon process stops, it will be automatically restarted by the Index Server engine.
A filter DLL understands one or more document formats and is capable of extracting text and properties out of those document types. A filter DLL implements the IFilter ActiveX interface. The CiDaemon process uses the IFilter interface to extract the text out of a document. To track down a problem with a filter DLL, an administrator needs to know where to look to find out the filter DLL for a particular document. Editing the registry is also a good way to avoid filtering documents with no useful content.
Caution Editing the registry incorrectly can cause serious problems, including corruption that may make it necessary to reinstall Windows NT or Microsoft Index Server. Using the Registry Editor to edit entries in the registry is equivalent to editing raw sectors on a hard disk. If you make mistakes, your computers configuration could be damaged. You should edit registry entries only for settings that you cannot adjust through the user interface, and be very careful whenever you edit the registry directly.
Document types and the associated filter DLL entries are specified in the registry under the \HKEY_LOCAL_MACHINE\Software\Classes tree. To find out the filter DLL associated with a particular document type, navigate through the registry entries in the \HKEY_LOCAL_MACHINE\Software\Classes tree.
The four steps to find out the filter DLL for a document follow. The example is for HTML files.
Find the CLSID associated with the document type under the registry key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes. Let this be <Value1>.
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes htmlfile = Class for WWW HTML files CLSID = {25336920-03F9-11CF-8FD0-00AA00686F13}
Using <Value1> found out in Step 1, find the PersistentHandler value for the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value1> key. Let this be <Value2>.
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID {25336920-03F9-11CF-8FD0-00AA00686F13} = WWW HTML files PersistentHandler = {EEC97550-47A9-11CF-B952-00AA0051FE20}
Using <Value2> determined in Step 2, find the IFilter Persistent Handler GUID for the document type. The value under the
key \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value2>\PersistentAddinsRegistered\
89BCB740-6119-101A-BCB7-00DD010655AF yields the IFilter Persistent Handler GUID for this document type. Let
this be <Value3>. 89BCB740-6119-101A-BCB7-00DD010655AF is the IFilter interface GUID.
\Registry\Machine\Software\Classes\CLSID {EEC97550-47A9-11CF-B952-00AA0051FE20} = REG_SZ HTML File Persistent Handler PersistentAddinsRegistered {89BCB740-6119-101A-BCB7-00DD010655AF} = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}
Using <Value3> determined in Step 3, the filter DLL can be found under the entry \HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\<Value3>\InprocServer32.
\Registry\Machine\Software\Classes\CLSID {E0CA5340-4534-11CF-B952-00AA0051FE20} = REG_SZ HTML Filter InprocServer32 = REG_SZ htmlfilt.dll
In this example, the filter DLL for HTML documents is Htmlfilt.dll.
File types are associated with file extensions under the \HKEY_LOCAL_MACHINE\SOFTWARE\Classes tree. Following are the associations for htmlfile document type:
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes .htm = REG_SZ htmlfile .html = REG_SZ htmlfile .htx = REG_SZ htmlfile .stm = REG_SZ htmlfile
By default, the extensions listed above are considered to be htmlfile documents. To add another extension to this list, an entry must be created in the registry associating that extension with htmlfile type. For example, to treat .htx files as htmlfile type, add the following entry:
\HKEY_LOCAL_MACHINE\SOFTWARE\Classes .htx = REG_SZ htmlfile
To add new filter DLLs, please refer to the documentation provided with the filter DLLs.
To remove a filter DLL, the IFilter PersistentHandler entry associated with a document type and the filter DLL entry must be deleted. Please refer to the Filter DLLs section to see how to find out a IFilter PersistentHandler for a particular document type.
For example, to remove the installed Htmlfilt.dll, the following two entries must be removed:
\Registry\Machine\Software\Classes\CLSID {EEC97550-47A9-11CF-B952-00AA0051FE20} PersistentAddinsRegistered {89BCB740-6119-101A-BCB7-00DD010655AF} = REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}
\Registry\Machine\Software\Classes\CLSID {E0CA5340-4534-11CF-B952-00AA0051FE20} = REG_SZ HTML Filter InprocServer32 = REG_SZ htmlfilt.dll
When a registered binary file is encountered, the NULL filter is used. The NULL filter retrieves only the system properties. The contents of a binary file are not filtered. Examples of system properties are the FileName, last Write time, file Size, Attributes, and so on.
A file with a certain extension is considered to be a binary file if its type in the registry is set to BinaryFile. For example, to associate the extension .lib with the binary file type, add the following entry to the registry:
\HKEY_LOCAL_MACHINES\Software\Classes \.lib = REG_SZ BinaryFile
The class BinaryFile is a predefined type that uses the NULL filter for its IFilter implementation.
Warning If the extension for which you wish to use the NULL filter already has a file type, do not change it to BinaryFile. Doing so could damage your Windows NT installation. Instead, use the following procedure to set the implementation of the IFilter interface for the file type.
When a file extension already has a file type, use the previous procedure to lookup the PersistentAddinsRegistered key and set the IFilter interface implementation. The example below is for files with the extension .dll.
Find the file type associated with the file extension .dll.
\HKEY_LOCAL_MACHINE\Software\Classes \.dll = REG_SZ dllfile
Look up the CLSID associated with the dllfile type in the registry.
\HKEY_LOCAL_MACHINE\Software\Classes dllfile = REG_SZ Application Extension CLSID = REG_SZ {3cf51a00-84eb-11ce-ac07-00004c752752}
Look up the persistent handler GUID for the CLSID in the registry. If there is no persistent handler, set it to the CLSID for the persistent handler of the NULL filter, {098F2470-BAE0-11CD-B579-08002B30BFEB}. Otherwise, continue with the next step.
\HKEY_LOCAL_MACHINE\Software\Classes CLSID {3cf51a00-84eb-11ce-ac07-00004c752752} PersistentHandler = REG_SZ {098F2470-BAE0-11CD-B579-08002B30BFEB}
Look up the CLSID found in the step above and set the IFilter handler {89BCB740-6119-101A-BCB7-00DD010655AF} to the NULL filter GUID {C3278E90-BEA7-11CD-B579-08002B30BFEB}.
\HKEY_LOCAL_MACHINE\Software\Classes CLSID {098F2470-BAE0-11CD-B579-08002B30BFEB} PersistentAddinsRegistered {89BCB740-6119-101A-BCB7-00DD010655AF} = REG_SZ {C3278E90-BEA7-11CD-B579-08002B30BFEB}
Here is a list of default extensions for binary files:
.aif,.avi,.cgm,.com,.dct,.dic,.dll,.exe,.eyb,.fnt,.ghi,.gif,
.hqx,.ico,.inv,.jbf,.jpg,.m14,.mov,.movie,.mv,
.pdf,.pic,.pma,.pmc,.pml,.pmr,.psd,.sc2,
.tar,.tif,.tiff,.ttf,.wav,.wll,.wlt,.wmf,.z,.z96,.zip
In Index Server, a default filter filters both the system properties (such as file name) and the contents of a file. The default filter does not understand any document formats; when filtering the contents of a file, it treats the file as a sequence of characters. Index Server uses the default filter when a extension of a file has no association in the registry, and if the value of the registry setting FilterFilesWithUnknownExtensions is 1.
Note The default filter filters plain text and files of unknown origin. It assumes all text to be in the default codepage of the server.
If a file is corrupted, the filter DLL may not be able to properly interpret the contents of that file. To get a list of files that could not be filtered, see Unfiltered Files. An event is also written to the event log. Sometimes a file cannot be filtered because of a defective third-party filter DLL. After verifying the contents of a file, an administrator should report the problems to the filter DLL vendor. Files protected by passwords are not filtered.
If a document cannot be filtered, it will be retried a certain maxium number of times. If the document still cannot be filtered, then it will be considered to be an unfiltered file. The registry key FilterRetries controls the maximum number of retries for a document.
To get a list of all the files that could not be filtered, issue the query @unfiltered = TRUE.
A file with an extension that does not have an association in the registry is treated as an Unknown Extension. The behavior of Index Server depends upon the registry setting FilterFilesWithUnknownExtension. If this value is set to 0, then the NULL Filter is used to filter those files. Otherwise, the default filter DLL is used to filter the contents.
By default, directories are not filtered and will not appear in query results. To filter directories, set the registry key FilterDirectories to 1. When directories are filtered, their system properties are filtered.
CiDaemon process is capable of automatically generating summaries or characterization (also called abstract) for documents. If the registry key GenerateCharacterization is set to 1, the characterization will be automatically generated. The maximum number of chatacters in the generated characterization is controlled by the registry key MaxCharacterization.
The list of document types for which filter DLLS are preinstalled is given below:
A word-breaker DLL parses the text and textual properties returned by the filter DLL into words. The word-breaker DLL is language dependent. The following languages are supported by Microsoft Index Server:
Words that are not significant for searching are called noise words or stop words. Noise words are stored in %systemroot%\system32 directory in various noise word files (Noise.enu, by default). The noise word files are language dependent. The noise word file for a particular language is specified in the registry under the key:
HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
\CurrentControlSet
\Control
\ContentIndex
\Language
\<language
>
\
NoiseFile
For example, the noise word file for English_US is listed as the registry key:
HKEY_LOCAL_MACHINE\SYSTEM
\SYSTEM
\CurrentControlSet
\Control
\ContentIndex
\Language
\English_US
\
NoiseFile
\
noise.enu
The noise word files can be edited with a text editor to either add new words or remove words that are not considered noise at a particular installation. Note that querying for noise words will not yield any hits.
Removing all noise words from the noise word files can significantly increase the size of indexes.
The CiDaemon priority is controlled by two settings:
ThreadClassFilter specifies the priority class of the filter daemon. The possible values are:
NORMAL_PRIORITY_CLASS | 0x00000020 |
IDLE_PRORITY_CLASS (default) | 0x00000040 |
HIGH_PRIORITY_CLASS | 0x00000080 |
REALTIME_PRIORITY_CLASS | 0x00000100 |
ThreadPriorityFilter specifies the priority in the specific class. The possible values are:
THREAD_PRIORITY_LOWEST | -2 |
THREAD_PRIORITY_BELOW_NORMAL | -1 |
THREAD_PRIORITY_NORMAL | 0 |
THREAD_PRIORITY_ABOVE_NORMAL (default) | +1 |
THREAD_PRIORITY_HIGHEST | +2 |
By default the CiDaemon process is set to run in the idle priority class to prevent interference with normal foreground activity. On a busy server, this might result in the files never being filtered. To run the CiDaemon process as a normal process, set the ThreadClassFilter to NORMAL_PRIORITY_CLASS and ThreadPriorityFilter to THREAD_PRIORITY_NORMAL. Setting ThreadClassFilter to HIGH_PRIORITY_CLASS or REALTIME_PROCESS_CLASS is not recommended because it may interfere with normal activity on the system.
The following counters are present under the performance monitor object Content Index.
Counter Name | Explanation |
---|---|
# documents filtered | The number of documents filtered since the indexing was started in the current process instantiation. Note that this does not include the documents filtered in prior runs of Index Server. |
Files to be filtered | These are the files remaining to be filtered. |
Total # of documents | Total number of documents known to the index. |
The following counters are present under the perfmon object Content Index Filter
Counter Name | Explanation |
---|---|
Binding Time | Average time (in milliseconds) to bind to a filter DLL. |
Filter Speed | Speed (in megabytes per hour) at which documents are filtered. |
Total Filter Speed | Speed (in megabytes per hour) at which documents are indexed. This includes both the time to filter document contents, plus time to filter properties and generate abstracts. |
If the free disk space on the index disk starts running low (less than 3 MB), filtering will be temporarily paused. A disk-full event will be written to the event log. The administrator should free up disk space by deleting or moving files from that drive.
© 1996 by Microsoft Corporation. All rights reserved.