Quick start guide to crawl the web

From OpenSearchServer Wiki
Jump to: navigation, search

Now that you have installed OSS and created your first index (see the previous sections of the Quick Start Guide), this section will show you how to crawl URLs and insert the content into said index.


Crawling URLs

Click on the Crawler tab and add an URL to crawl (such as http://en.wikipedia.org/*), then press the add button.

QuickStartGuide Cap01.png

Since in this example we are crawling an extremely busy web site, we're going to use some conservative values for OSS to make sure that our crawling activities are not being disruptive for Wikipedia. Note the low number of simultaneous threads and the high delay between accesses in the screen capture below.

Please note that the crawling parameters cannot be changed whilst the crawler is running -- to change the parameters stop the crawler, wait for it to finish, then change the values.

Once you put in those reasonable values, make sure dry run is unchecked and Optimize at the end of the crawl session is checked.

Now press the button Not running - Click to start.

QuickStartGuide Cap02.png

The sessions statistics should now be gradually populating the interface.

QuickStartGuide Cap03.png

To known how many pages were fetched refer to the statistics, column Fetched / Count.

QuickStartGuide Cap03 FetchedCountDetail.png

Keep in mind that we asked the crawler to go easy on the target web site, so the crawling will be a little slow.

Once you think you have enough pages crawled, click again the button and wait for the 'Aborting...' message to go away. This delay corresponds to your index being optimized by the OSS engine.


So far, so good ? You can now go back to the Main Page to either check the other ways of crawling data, or experiment with ways to query data from the index.

Personal tools
Get OSS Open Search Server at SourceForge.net. Fast, secure and Free Open Source software downloads