Friday, September 10, 2004

Using dtSearch and MCMS

So far, every site that we have been asked to do requires a search page. And search, is one component that does not ship with MCMS. Not that it's a bad thing, it gives us the choice to use something that fits our requirements.

We have looked at some of the solutions that integrate with MCMS. They probably work well. But what we wanted to do was to make use of existing licenses that we have purchased for our other sites. In our case, that is dtSearch.

Our requirements for search were simple:

1. The search engine needed to crawl the entire web site that contains framed pages and add meta tags to its index
2. The results displayed should be filtered according to a user's rights. If he doesn't have rights to a posting, he shouldn't be seeing it in the results
3. The results can be retrieved from an index using simple queries
4. It's got to be able to index a windows authenticated site.

Requirements (1) , (3) and (4) could be satisfied by the latest version of dtSearch. The only item that required some customization was (2).

The nice thing about the product is that it is amazingly simple to setup and use. Here's how you can do it too.

Installing dtSearch
First, of course is to install dtSearch itself. This is fairly simple. Just click the setup.exe application on the CD and when its done installing, apply the latest upgrades downloadable from the web site. If you don't have dtSearch and would like to evaluate the software, you can download a 30-day evaluation copy from the web site.

Indexing Meta Tags
For the search engine to recognize meta tags, we provide dtSearch with a list of meta tags used by the postings.

With dtSearch Desktop opened, select Options : Preferences from the toolbar.

In the Preferences dialog, select Indexing Options : Text fields. Add each meta tag that you need indexed into the list. One important piece of information added at this point is the GUID of each posting. Indexing the GUID allows us to use it to get instances of the posting later when coding the search page.

Later, complex queries that filter postings based on meta tags can be written.

Using Windows Authentication
Our MCMS site uses Windows authentication. So we had to specify a user name and password that would be used by the spider to crawl the site. We chose an account with subsciber access to all the postings that needed to be index. The user name and password were entered into the Indexing : Options Spider dialog (part of the Preferences dialog).

One drawback to this particular dialog is that is does not mask the password. Whatever you enter appears as clear text. Fortunately we had a shared account that was used solely for crawling, so this didn't bother us.

Defining the Site Map
Not all postings on the site were linked from an index page. In order to have these postings crawled by the spider, we created a single HTML page that contained links to all postings on the site. This was done by a recursive script coded using the MCMS PAPI.

Creating the Search Index
Next, we create a search index. From the dtSearch Desktop, select Index : Create Index. Specify the name of the index and it's location on the disk. The empty index will be created.

Crawling the entire web site for the first time
Now that the search engine had been configured, we were ready to build the index.
Because postings don't exists as files in folders, the only way for them to be added to the index is to use a spider to crawl them.

1. From the desktop, select Index : Update Index.
2. Select Add web...
3. set the Starting URL for Spider to point to the sitemap created earlier or the index page of your site.
4. set the crawl depth. For our case, a crawl depth of at least 2 was required for the spider to crawl postings in a framed site.
5. Click OK

And that completes the configuration. You can select the other options but these are the required options.

Click the Start Indexing button on the right of the dialog and watch the spider go!

The time it takes to index an entire web site depends on a variety of factors. We run ours on a relatively low-end server and the site has roughly 20,000 postings. It takes approximately 6 hours to complete a full job. On a higher-end server, it takes significantly lesser time to index the same number of pages, about 3 hours.

Building the Search web page
Another good thing about dtSearch is the developer's API. You must download and install dtSearch Developer which contains the libraries and documentation in order to work with it.

You can code with ASP.NET (C# or VB.NET). You can perform just about any kind of query. We have done keyword matches, date comparisons, category filters and so on. I won't go into details here, but samples can be found online on the web.

Here, requirement (2) was satisfied by checking to see if Searches.GetByGuid() returned a null value. If it did, then the user did not have access to the posting and the posting was not added as part of the search result.

All in all, the conclusion is: You can use just about any search engine that uses a spider to crawl MCMS web sites. It's more important to find one that has the features you require and the price tag that fits your budget.


Post a Comment

<< Home