Configuring Magnify Search Crawler

The Crawler allows you to customize various aspects of the crawling behavior, such as:

This section describes the various configuration options that are available so that the Crawler meets your requirements.

  1. Configure logging configuration by editing the log4j.properties file, which is required.

    The Crawler uses the Apache log4j logging framework for logging purpose. For more information, see https://logging.apache.org/log4j/1.2/manual.html.

  2. Configure URL filtering by editing the regex-urlfilter.txt file, which is required.

    The Crawler allows the user to restrict what pages or documents to crawl by URL pattern matching using Java Regular Expressions. For more information on the syntax used for Java Regular Expressions, see http://docs.oracle.com/javase/tutorial/essential/regex/.

    All comment lines begin with a pound sign (#) character. Each non-comment, non-blank line contains a regular expression prefixed by a plus sign (+) or minus sign (-) character, where '+' indicates to include and '-' indicates to exclude. For a website to be crawled, its URL must match all of the regular expressions that are included. However, it will be ignored as long as its URL matches any one of the excluded regular expressions. For example, the configuration shown in the following image will crawl anything within the Information Builders Technical Support Center website.

  3. Configure meta tag injection by editing the meta-tag-mapping.txt file, which is optional.

    This configuration file is used to inject META names from the crawled pages to the Magnify Search category tree. Most of the documents (HTML, PDF, Word, Excel, and so on.) would contain META information. The Crawler extracts that information and saves it to the index files so that the document can be displayed under a designated category tree during search. All comment lines begin with a pound sign (#) character.

    Syntax:

    Meta tag name->Category tree display name

    Consider a scenario where a web page that was crawled contained the following META tags:

    <meta name="description" content="Free Web tutorials">
    <meta name="keywords" content="HTML,CSS,XML,JavaScript">
    <meta name="author" content="Hege Refsnes">
    <meta charset="UTF-8">
    

    For example, you only want to include the "author" information in the category tree when the search result is displayed. In addition, you want it be displayed as "Author" with the first letter in uppercase. In this case, you would need to add the following entry into the meta-tag-mapping.txt file:

    author->Author

    The following is an example of a configured meta-tag-mapping.txt file for the Information Builders Technical Support Center website.

  4. Configure index name customization by editing the index-name-mapping.txt file, which is optional.

    This configuration file is used to map web documents to a predefined index name based on its URL pattern. All comment lines begin with a pound sign (#) character.

    Syntax:

    regex pattern->index name

    For example, if you want to save all web pages that begin with http://www.abc.com/news/ to an index folder named abc_news, then you can add the following entry in this file:

    ^http://www.abc.com/news/->abc_news

    If no matches are found, then the default index name is the domain name, replacing the dot (.) character with the underscore (_). For more information on the syntax used for Java Regular Expressions, see http://docs.oracle.com/javase/tutorial/essential/regex/.

    Warning: If there are any duplicates or overlaps in the URL pattern, then the first matched rule is used. Therefore, the order of the entries in this configuration file is important.

    For example:

    ^http://www.abc.com/news/->abc_news
    ^http://www.abc.com/news/worldnews/->abc_worldnews

    The page with the URL http://www.abc.com/news/worldnews/wn1.html will be indexed to an index folder named abc_news. If you want this page to go to an index folder named abc_worldnews, then you must reverse the order of the above two entries.

    The following is an example of a configured index-name-mapping.txt file for the Information Builders Technical Support Center website.

  5. Configure category injection by editing the regex-url-to-category-mapping.txt file, which is optional.

    This configuration file is used to inject category names to a specified web document that has been crawled.

    Syntax:

    regex pattern->category name->category value

    For example, a web page URL contains .*/f/23445/ as a pattern. You want this page to be displayed under a category called Forum with the value WebFOCUS/FOCUS Forum on Focal Point. To accomplish this, you must add the following entry in the configuration file:

     .*/f/23445->Forum->WebFOCUS/FOCUS Forum on Focal Point

    For more information on the syntax used for Java Regular Expressions, see http://docs.oracle.com/javase/tutorial/essential/regex/.

    The following is an example of a configured regex-url-to-category-mapping.txt file for the Information Builders Focal Point User Forum website.

WebFOCUS

Feedback