Index Server

Microsoft Index Server indexes the contents and properties of documents on an Internet or intranet Web site served by IIS. You can set up Index Server so that clients can search a Web site with any browser by filling in the fields of a query form formatted in HTML. When a client executes a query, the Web server forwards the information typed into the query form to the query engine. The query engine finds the pertinent documents, formats the results as an HTML Web page, and returns the results to the client. In addition to indexing Web pages in HTML format, Index Server indexes documents formatted by applications such as Word and Excel. Thanks to this feature, you can add documents to a Web site without having to convert them into HTML format.

Requirement

Index Server minimum hardware requirement is the same as IIS. You need a Windows NT to install Index Server. About 3 - 12 MB of hard disk space is required by the program. Disk space needed by for storing Index Server data depends on the size of the corpus (all of the files searched) and type of documents. For security reason, it is recommended that the Index Server data be stored on an NTFS drive.

The following table shows recommended memory configuration based on the number of documents. The size of the corpus affects the disk space needed for storing Index Server data. The disk space needed for Index Server data in all the cases is approximately 40 % of the size of the corpus. While the average usage is less than 30 percent of the corpus, the peak usage of disk space can be 40 percent. For example, if you have 500 MB of corpus, you'll need 200 MB of free disk space. According to the NT Option Pack's Documentation the minimum disk space needed is at least 30 %. During Indexing up to 45 % will be used.

Number of Documents	Minimum Memory (in megabytes)	Recommended Memory (in megabytes)
Fewer than 10,000	32 for Windows NT Server	Same as minimum
Fewer than 100,000	32	32
100,000-250,000	32	64-128
250,000-500,000	64	128-256
500,000 or more	128	256 or more

If the drive where the catalog is stored starts to fill up, indexing is temporarily paused. If very little free disk space is available on the index disk (less than 3 MB), will be temporarily paused, and cannot continue until more disk space is freed on the catalog drive. A message is written to the event log when the disk starts filling up. Administrators should monitor the event log for these messages and free disk space by deleting, moving, or compressing files.

Performance Improvement

Word lists are small, in-memory indexes. Each word list contains data for a small number of documents. As soon as a document is filtered, its data is stored in a word list. Creation of a word list is very quick and does not require updating any on-disk data. It is used as a temporary staging area during indexing. Two important Registry settings for word lists are MaxWordList and MaxWordlistSize. MaxWordlistSize is by default 14 units of 128K (1,792KB). Everytime this value is exceeded, a new word list is created. By default, after 20 word lists (as defined by MaxWordList) are created, the word lists are merged into a shadow index.

If the word list is taking too much memory, you need to reduce the number of word lists (MaxWordList) so shadow indexes are formed. The word list (MaxWordlistSize) needs too needs to be reduced. Going by the default settings, you will be using (20 x 1,792K = 35 MB). If you reduce the MaxWordlist to 10 and MaxWordlistSize to 10, the memory used will be 10 x 10 x 128 KB = 12.5 MB.

Once the number of word lists exceeds the MaxWordLists (by default 20) parameter, the word lists are merged into a shadow index. This merge process is called the shadow merge. Although the data in word lists is compressed to some extent, the compression is not very high because word lists are temporary structures. Because word lists are in-memory structures, documents in a word list must be refiltered whenever the Content Index service is restarted. The refiltering is automatically detected and performed by the Index Server engine.

If response time for queries begins to slow down, you need to free resources by combining indexes. From time to time, you need to combine smaller indexes into larger ones to free space both in memory and on disk and to resolve queries quickly. To merge smaller indexes, you need to force a merge.

For example, if you had 50 MB of data that index server was tracking. You add 100 more and users complain it is too slow. You need to force a master merge.

Master merge is potentially a long, CPU-intensive activity. Query execution may visibly slow during master merge. Consider displaying an extra confirmation page before running the script itself.

The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved values such as Path, Abstract, Title, Attributes, Last Write time stamp, File Size, and some values for internal use only. The property cache also stores some properties that are created when documents are filtered.

On large index servers, increasing this value tends to yield better performance. But if the physical memory is inadequate, the performance might suffer. For example, if you have 64MB or RAM and 64MB of virtual memory. When you increase the cache, the performance decreases. This is because the cache is now using virtual memory.

Multi-Domain Indexing

If you add a virtual root pointing to the remote UNC share, it will automatically be indexed by Index Server. Make sure that you specify the user ID (domain\user) and password correctly.

Lets say that the IIS is running on Sales1 server on Sales domain. You would like to index something at the Marketing1 server in the Marketing domain. Marketing domain needs to trust Sales domain and a user e.g. Sales\IUSR_Sales1 needs to have Log on Locally user rights at Marketing1. The folder that needs to be index must be shared as \\Marketing1\sharename. A virtual root is then created with the user having permission to access. Before indexing, the Content Index Service needs to be stopped and restarted.

If you have had problems with remote virtual roots before starting this procedure, you should delete the contents of your Catalog.wci directory. Removing the contents of this directory starts Index Server with a clean set of data.

Multi-Web Site Indexing

A catalog is the highest-level unit of organization in Microsoft Index Server. Each catalog is a completely self-contained unit, containing an index and cached properties for one or more scopes (virtual directories). In this version of Index Server, queries cannot span multiple catalogs.

If you have 3 web sites running on one IIS server, and 3 separate catalog are created for each web sites, any query will be limited to one web sites only.

Index server can index Virtual Directories but with poor performance.

Viewing Unfiltered Files

Files that are corrupted or files that could not be filtered because of problems in a filter DLL are called unfiltered files and can be viewed by issuing the administrative command for unfiltered files. This command issues the query: @Unfiltered=true.

Wednesday, December 03, 2003