FAST Enterprise Search Platform Product Overview Guide

114
FAST Enterprise Search Platform version:5.2 Product Overview Guide Document Number: ESP1000, Document Revision: A, April 3, 2008

Transcript of FAST Enterprise Search Platform Product Overview Guide

FAST Enterprise Search Platformversion:5.2

Product Overview Guide

Document Number: ESP1000, Document Revision: A, April 3, 2008

Copyright

Copyright © 1997-2008 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrightedby FAST’s licensors. All rights reserved.The documentation is protected by the copyright laws of Norway,the United States, and other countries and international treaties. No copyright notices may be removedfrom the documentation. No part of this document may be reproduced, modified, copied, stored in aretrieval system, or transmitted in any form or any means, electronic or mechanical, includingphotocopying and recording, for any purpose other than the purchaser’s use, without the writtenpermission of FAST. Information in this documentation is subject to change without notice.The softwaredescribed in this document is furnished under a license agreement and may be used only in accordancewith the terms of the agreement.

TrademarksFAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FASTContextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,GetSmart, NXT, LivePublish, Folio, FAST Unity, and other FAST product names contained herein areeither registered trademarks or trademarks of Fast Search & Transfer ASA in Norway, the United Statesand/or other countries. All rights reserved. This documentation is published in the United States and/orother countries.

Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks orregistered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Netscape is a registered trademark of Netscape Communications Corporation in the United States andother countries.

Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarksof Microsoft Corporation in the United States and/or other countries.

Red Hat is a registered trademark of Red Hat, Inc.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.

AIX and IBM Classes for Unicode are registered trademarks or trademarks of International BusinessMachines Corporation in the United States, other countries, or both.

HP and the names of HP products referenced herein are either registered trademarks or service marks,or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.

Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United Statesand/or other countries.

XML Parser is a trademark of The Apache Software Foundation.

All other company, product, and service names are the property of their respective holders and may beregistered trademarks or trademarks in the United States and/or other countries.

Restricted Rights LegendThe documentation and accompanying software are provided to the U.S. government in a transactionsubject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure ofthe documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19Commercial Computer Software-Restricted Rights (June 1987).

Contact Us

Web SitePlease visit us at: http://www.fastsearch.com/

Contacting FAST

Fast Search & Transfer, Inc.Cutler Lake Corporate Center117 Kendrick Street, Suite 100Needham, MA 02492 USATel: +1 (781) 304-2400 (8:30am - 5:30pm EST)Fax: +1 (781) 304-2410

Technical Support and Licensing ProceduresTechnical support for customers with active FAST Maintenance and Support agreements, e-mail:[email protected]

For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail:[email protected]

For evaluations, contact your FAST Sales Representative or FAST Sales Engineer.

Product TrainingE-mail: [email protected]

SalesE-mail: [email protected]

Contents

Preface..................................................................................................iiCopyright..................................................................................................................................ii

Contact Us...............................................................................................................................iii

Chapter 1: The FAST ESP Documentation Set................................11Standard FAST ESP Product Documentation........................................................................12

FAST ESP Product Overview .....................................................................................12

FAST ESP Installation ................................................................................................12

FAST ESP Configuration.............................................................................................12

FAST ESP Operations ................................................................................................12

FAST ESP Advanced Linguistics.................................................................................12

FAST ESP Troubleshooting Guide...............................................................................12

FAST Home.................................................................................................................12

FAST Search Business Center....................................................................................12

FAST ESP Deployment Planning ...............................................................................13

FAST ESP Migration Guide.........................................................................................13

FAST Search Front End (SFE) Users Guide...............................................................13

File Traverser ..............................................................................................................13

FAST Classifier ...........................................................................................................13

FAST ESP Query Languages and Parameters ..........................................................13

FAST ESP WebAnalyzer.............................................................................................13

Additional FAST ESP Components and Documentation........................................................13

FAST Enterprise Crawler.............................................................................................13

FAST ESP Software Development Kit (SDK)..............................................................14

Chapter 2: FAST ESP at a glance.....................................................15Introduction to FAST ESP......................................................................................................16

System Architecture....................................................................................................16

Data Flow Overview....................................................................................................16

Module Overview.........................................................................................................17

Chapter 3: Basic Concepts...............................................................21Content and Flow...................................................................................................................22

Collections..............................................................................................................................22

Search Profiles.......................................................................................................................23

Document and Document Elements.......................................................................................24

Index Schema, Index Profile...................................................................................................24

Search Rows, Columns, and Clusters....................................................................................25

5

Chapter 4:Integrating FAST ESP with your Content and Query Infrastructure.27Retreiving and Processing Content........................................................................................28

Integrating FAST ESP on the Content Side............................................................................28

Using the Crawler........................................................................................................28

Using the File Traverser...............................................................................................29

Pushing Content to the Content API............................................................................31

Using a FAST Content Connector...............................................................................31

Integrating FAST ESP on the Query Side..............................................................................32

Custom Integration......................................................................................................32

Administration and Installation Integration.............................................................................33

Chapter 5: Processing Documents .................................................35Document Processing Overview............................................................................................36

Document Processing Engine, Stages, Pipelines..................................................................36

Custom Document Processing...............................................................................................37

Entity Extraction.....................................................................................................................37

Chapter 6: Making Documents Searchable ....................................39Indexing Documents...............................................................................................................40

The FAST Search Engine............................................................................................40

Search Engine Clusters...............................................................................................40

Search Columns and Search Rows.............................................................................40

Defining How Documents are Searchable..............................................................................41

Document Processing, Index Profile, and Search Engine Cluster...............................41

Index Profile Structure.................................................................................................42

Including Metadata......................................................................................................43

Executing Search Queries and Returning Results......................................................43

Partial Document Updates...........................................................................................43

Query Highlighting in Dynamic Teasers.......................................................................43

Query Highlighting in Source Documentation.............................................................44

Chapter 7: Concepts of Relevancy ..................................................45Components of Search Relevancy.........................................................................................46

Contextual Insight...................................................................................................................46

Ranking Concept....................................................................................................................47

Freshness....................................................................................................................47

Authority......................................................................................................................47

Quality.........................................................................................................................47

Proximity and Context..................................................................................................48

Freshness Boosting................................................................................................................48

Analyzing linked web pages using the WebAnalyzer ............................................................49

6

FAST Enterprise Search Platform

Tools to Modify Rank for Individual Documents......................................................................49

Search Business Center..............................................................................................49

Boost Bulk Tool............................................................................................................49

Boosting Mechanisms............................................................................................................49

Relevancy Modifications Based on Business Rules...............................................................50

Proximity Ranking and Matching............................................................................................50

Explicit Proximity.........................................................................................................51

Implicit Proximity..........................................................................................................51

Sorting Overview....................................................................................................................52

Full Text Sorting...........................................................................................................52

Multi-Level Sorting.......................................................................................................52

Sorting on Geographical Coordinates.........................................................................52

Field Collapsing...........................................................................................................52

Controlling Ranking and Sorting of Query Results......................................................53

Boundary Matching................................................................................................................53

Duplicate Removal.................................................................................................................54

Dynamic (Result-Side) Duplicate Removal..................................................................54

Chapter 8: Processing Queries and Results ..................................55Query and Results Server Overview......................................................................................56

Query Concepts.....................................................................................................................56

Query and Result Server.............................................................................................56

Query and Components..............................................................................................56

Query Processing...................................................................................................................57

Query Modifications.....................................................................................................57

Query Resubmission...................................................................................................57

FAST Query Language................................................................................................58

Result Processing..................................................................................................................58

Result Views................................................................................................................58

The FAST Search Front End (SFE)........................................................................................59

Chapter 9: Geo Search......................................................................61GEO Search Overview...........................................................................................................62

Chapter 10: Scope Search and Dynamic XML Indexing ................65Scope Search Overview.........................................................................................................66

Definition of a Scope...................................................................................................66

Example of Using a Scope Search..............................................................................66

How Scope Search Works and Why It is Used............................................................67

Scope Search vs. Fielded Search..........................................................................................67

Scope Search Concepts and Capabilities..............................................................................68

Scope Fields................................................................................................................68

Scope Data Types........................................................................................................69

7

Query Language in Scope Search..............................................................................69

Return Matching Scopes.............................................................................................69

Scope Boosting...........................................................................................................69

Dynamic Document Summary (Teasers).....................................................................70

Linguistics and Scope Search.....................................................................................70

Partial Updates............................................................................................................71

Dynamic XML Indexing..........................................................................................................71

Chapter 11: Taxonomy and Navigation ...........................................73Taxonomy and Navigation Overview......................................................................................74

Navigators..............................................................................................................................74

Field Navigators...........................................................................................................74

Deep and Shallow Navigators.....................................................................................75

Contextual Navigators.................................................................................................75

Field Navigators for Values in Scope Fields................................................................75

Taxonomy...............................................................................................................................76

FAST Taxonomy Explorer............................................................................................76

FAST Classifier............................................................................................................77

Unsupervised Clustering........................................................................................................77

Creating Taxonomy on the Fly.....................................................................................77

Chapter 12: Advanced Linguistic Processing ................................79Linguistics Overview...............................................................................................................80

Linguistics and Relevancy......................................................................................................80

Linguistics Concepts....................................................................................................80

Dictionaries.............................................................................................................................81

Automatic Language Detection..............................................................................................81

Lemmatization........................................................................................................................81

What Lemmatization Means........................................................................................82

Advanced Phrase Recognition and Lemmatization.....................................................82

Synonyms and Spell Variations..............................................................................................82

Synonym Overview......................................................................................................82

Dictionary Management..............................................................................................83

Advanced Phrase Recognition...............................................................................................83

Query Transformations................................................................................................83

Advanced Phrase Customization.................................................................................84

Advanced Phrase Recognition and Spell Checking....................................................84

Applying Advanced Phrase Recognition......................................................................85

Spell Checking and Phrase Recognition Framework.............................................................85

Phrase Recognition and Correction.............................................................................85

Spell Checking on Simple Terms.................................................................................85

Applying Spell Checking..............................................................................................85

Required Dictionaries for Spell Checking....................................................................86

Anti-Phrasing..........................................................................................................................87

8

FAST Enterprise Search Platform

Required Dictionaries for Anti-Phrasing......................................................................88

Supported Language for Anti-Phrasing.......................................................................88

Sub-String Search..................................................................................................................88

Sub-String Search Overview.......................................................................................88

Application Scenarios..................................................................................................89

Applying Sub-String Search........................................................................................90

Wildcard Search.....................................................................................................................90

Special Characters and Accents............................................................................................90

Chapter 13: Operation and System Administration .......................91Operation Overview................................................................................................................92

ESP Administrator Interface...................................................................................................92

Main Views..................................................................................................................92

FAST Home and Search Business Center ............................................................................94

Licensing................................................................................................................................95

Fault Tolerance.......................................................................................................................96

Security..................................................................................................................................96

Chapter 14: Supported Document Formats ...................................97Supported Formats Overview................................................................................................98

Supported Input File Formats.................................................................................................98

Word Processing Formats...........................................................................................98

Desktop Publishing Formats......................................................................................100

Database Formats.....................................................................................................100

Spreadsheet Formats................................................................................................101

Presentation Formats................................................................................................102

Graphics Formats......................................................................................................102

Compressed Formats................................................................................................105

Email Formats...........................................................................................................105

Other Formats...........................................................................................................106

Chapter 15: Glossary.......................................................................107ESP Term Definitions...........................................................................................................108

9

10

FAST Enterprise Search Platform

Chapter

1The FAST ESP Documentation Set

This chapter lists the components of the FAST ESP documentation set andexplains how and when to use them.

Topics:

• Standard FAST ESP ProductDocumentation

• Additional FAST ESPComponents and Documentation

Standard FAST ESP Product DocumentationThe FAST ESP documentation set consists of both reference and task-oriented documentation. It includesthe following guides:

Note: The FAST ESP documentation set covers both standard and optional FAST ESP features. Optionalfeatures are enabled with individual license keys. If you have not purchased these optional features, theywill not be enabled in your installation of FAST ESP.

FAST ESP Product OverviewThe Fast ESP Product Overview Guide explains the basic concepts of FAST ESP and describes its features.It contains a glossary of terms for FAST documentation. It serves as an introduction to FAST ESP and basicconcepts.

FAST ESP InstallationThe FAST ESP Installation Guide describes the procedures needed to install FAST ESP.

FAST ESP ConfigurationThe FAST ESP Configuration Guide describes the basic procedures for creating a collection, configuringdocument processing, managing index profiles, configuring advanced linguistics, and other configurationinformation. In addition, it contains the DTDs used in FAST ESP.

FAST ESP OperationsThe FAST ESP Operations Guide describes core operational procedures such as starting and stopping thesystem and back-up procedures. It is a task-oriented guide that answers the question How, as opposed tothe Configuration Guide which answers the question What. The Operations Guide is for skilled users anddoes not provide descriptions of concepts or features.

FAST ESP Advanced LinguisticsThe Advanced Linguistics Guide provides information about advanced linguistic processing in ESP. It providesdescriptions, configuration information for linguistics features, and it provides procedures required to performlinguistics customizations such as creating own dictionaries, customizing existing dictionaries, or advancedtokenization configuration, for example. Basic conceptual information about linguistic processing is describedin the Product Overview Guide. The Advanced Linguistics Guide is for advanced users and systemadministrators.

FAST ESP Troubleshooting GuideThe FAST ESP Troubleshooting Guide is a task-oriented guide intended to help you out of a bind whileworking with ESP. It provides scenarios and possible solutions to those scenarios. It also lists log errors andlog messages.

FAST HomeThe FAST Home Guide describes the FAST Home graphical user interface. FAST Home is the BusinessManager's personal portal to the FAST ESP installation. FAST Home is where you create and set up theinitial search profiles, and where you manage the users and groups that should have access to work with thesearch profiles. FAST Home has links to other FAST applications, such as Search Business Center and theAdministration GUI.

FAST Search Business CenterThe Search Business Center Guide describes the Search Business Center application. Business Managersuse Search Business Center to manage ranking, relevancy, synonyms, navigators and more. Search Business

12

FAST Enterprise Search Platform

Center is where you tune and configure the search experience for the search profile before you publish it toyour production environment. In Search Business Center you can monitor the end-users’ query behavior(query logs and reports).You can make changes to the search profile settings and test them out in the internalPreview before publishing the changes to the Published Search Front End.

FAST ESP Deployment PlanningThe FAST ESP Deployment Planning Guide describes what to consider before installing the product. Forexample, it describes the concepts and overall principles for system dimensioning, fault-tolerance setups,and component optimization.

FAST ESP Migration GuideThe FAST ESP Migration Guide describes what you need to take into consideration when migrating FASTESP 5.0 to FAST ESP 5.1. It provides migration scenarios, reference information about requirements, andprocedures for successful migration. The Migration Guide provides information for the following types ofusers:

• Managers and supervisors, who need to understand what is involved in the migration, and• System administrators and operators, who need detailed information on how to perform the migration.

FAST Search Front End (SFE) Users GuideThe Search Front End User’s Guide describes how to use the default Search Front End in FAST ESP. Theguide describes how to search using the different search types (Simple, Contextual, Similarity and Fieldedsearch), how to understand and navigate in your search results, and what you can do to improve relevancyand search effectiveness to get the right results. It also describes how you can customize the Search FrontEnd using Search Business Center.

File TraverserThe File Traverser Guide describes the File Traverser and how to configure it.

FAST ClassifierThe FAST Classifier Guide describes how to use the FAST Classifier. It provides an overview of what theFAST Classifier is and how to use it through either the Command Line or GUI.

FAST ESP Query Languages and ParametersThe Query Language and Parameters Guide describes the parameters available for controlling querysubmission, transformation and result gathering. Parameter interfaces are presented for both the API andthe FAST Query Language (FQL) directly.

FAST ESP WebAnalyzerThe WebAnalyzer is a FAST ESP module that uses links between documents to improve search relevancy.The WebAnalyzer Guide describes the feature and provides installation, configuration, operations, andtroubleshooting information.

Additional FAST ESP Components and DocumentationFAST provides the following additional components with separate documentation sets.

FAST Enterprise CrawlerThe FAST Enterprise Crawler Guide related to ESP describes the Enterprise Crawler for version 6.6, andincludes migration, installation, configuration, and operational information. The guide also includes somedeployment planning information that is specific to the Crawler.

13

The FAST ESP Documentation Set

FAST ESP Software Development Kit (SDK)The SDK contains additional integration tools for query, content and document processing integration, andSearch Front End development.

Content IntegrationThe FAST ESP Content Integration Guide describes the available programming interfaces along with therequired steps needed to integrate FAST ESP with your content sources.

Query IntegrationThe FAST Query Integration Guide describes the available programming interfaces along with the requiredsteps needed to set up a customized search interface towards your FAST ESP implementation.

Indexing Database Content and XMLThe FAST ESP Indexing Database Content and XML Guide provides an overview of how to set up a FASTESP installation for structured data.

Document Processor IntegrationThe FAST ESP Document Processor Integration Guide describes how to create your own document processors.

Document Hit HighlightingThe FAST ESP Document Hit Highlighting Guide describes how to configure and use the document hithighlighting feature in FAST ESP. The feature enables you to view the matching sections of the full documentwith the matching query terms highlighted.

Search Front End Developers GuideThe SFE Developer's Guide explains how the Search Front End (SFE) and Search Front End API (SFEAPI)interact, and how the SFE and SFEAPI can be customized using the Java Struts framework and Velocitytemplates. This guide is for people using the SFE and SFEAPI as as the basis for building custom searchfront ends. It can also be used for people who need some information when building a search front end fromscratch, as the SFE provides examples on how different parts of the search front end can be implementedand how the Search API can be used.

Content Connector Tool Kit (CCTK)The FAST ESP Content Connector Toolkit Guide describes the Content Connector Toolkit (CCTK), a frameworkthat makes it easier to develop connectors for ESP and InStream. This guide provides some conceptualinformation about content integration, conceptual information about connector development and guidelinesrelated to that, explanations of the content connector framework and architecture, and procedures for usingthe CCTK.

Application IntegrationThis guide gives a description of the ESP Administration integration architecture, and describes how tointegrate applications with the ESP Administration and Application services.The services include componentmanagement, collection management, search profile management and query log handling.

Query Reporting FrameworkThe Query Reporting Framework Guide describes what the query reporting framework is and how to use it.

14

FAST Enterprise Search Platform

Chapter

2FAST ESP at a glance

This chapter gives you an overview of the FAST ESP product, its systemarchitecture, features, and modules.

Topics:

• Introduction to FAST ESP

Introduction to FAST ESPFAST ESP is an integrated software application that provides a platform for searching and filtering services.It is a distributed system that enables information retrieval from any type of information. ESP combinesreal-time searching, advanced linguistics, and a variety of content access options into a modular, scalableproduct suite.

FAST ESP does the following:

1. Retrieves or accepts content from web sites, file servers, application-specific content systems, and directimport via API

2. Transforms all content into an internal document representation3. Analyzes and processes these documents to allow for enhanced relevancy4. Indexes the documents and makes them searchable5. Processes search queries against these documents6. Applies algorithms or business rule-based ranking to the results7. Presents the results along with the navigation options

System Architecture

Figure 1: FAST ESP System Architecture

Data Flow OverviewThe data flow through the FAST ESP system consists of the following basic steps:

1. Submitting Content2. Analyzing and Processing Documents3. Matching Documents and Search Queries4. Matching Documents and Triggers5. Managing and Tuning

Submitting Content

Content is submitted using either:

• one of the Content Connectors that are included with FAST ESP• one of the FAST Content Connectors, available as separate software packages

16

FAST Enterprise Search Platform

• the FAST Content API to push content directly to FAST ESP.

For detailed concepts about submitting content, refer to Integrating FAST ESP with your Content and QueryInfrastructure.

Analyzing and Processing Documents

Once a content entity has been submitted to the FAST ESP system, it is converted to a document that complieswith the FAST ESP internal document format. Each document goes through a set of document processingsteps performed by the FAST Document Processing Engine.

The purpose of document processing is to extract additional information, such as the language of the content,and to add additional information to the document to improve the search relevancy.

For detailed concepts about document processing, refer to Processing Documents.

Processed documents are passed on to the FAST Search Engine.

Matching Documents and Search Queries

As new documents arrive at the FAST Search Engine, the Search Engine generates search indices fromthem.

1. The end-user or external query application submits search queries through a search front end or directlyto the FAST Query API.

2. The Query API in turn sends the query to the FAST Query & Result Server, which pre-processes thesearch queries to improve the relevancy of the results returned. Examples of such pre-processing arespell-checking or proper name recognition (see Advanced Linguistic Processing).

3. After having been pre-processed, the search queries are sent to the FAST Search Engine. The enginematches them against its indices and returns a list of resulting documents along with result set navigationoptions which let the user further refine the search. The FAST Query & Result Server can then performpost-processing on the result list, such as category result grouping, sorting, or adding navigators fordynamic drill-down.

4. Finally, the result list is returned through the FAST Query API back to the Search Front End (SFE) or theexternal query application.

For detailed concepts about matching documents and search queries, refer to Making Documents Searchable,section Defining how Documents are Searchable and Processing Queries and Results.

Managing and Tuning

1. The FAST Administrator Interface (also referred to as the Admin GUI) allows you to easily manage andmonitor your FAST ESP implementation. It displays status messages from a range of administrativemodules such as the FAST Log Server, or the FAST Configuration Server. Monitoring via SNMP enablesESP to be monitored from other systems such as IBM Tivoli and HP OpenView, and supports ESP statusreads (component, indexing, document processing, and query statuses).

2. The Taxonomy Explorer allows you to manage categories and categorization rules for grouping the searchresults. For details about taxonomy management, refer to Taxonomy and Navigation, and to the TaxonomyExplorer Guide .

3. FAST Home and Search Business Center applications are used for setting up and tuning search sites,creating users and user access, and accessing other FAST interfaces. Refer to the Fast Home and SearchBusiness Center Guides.

Module OverviewThe FAST ESP system consists of different types of modules that can be categorized according to thepurposes of the modules: in other words, what the module does in the system, such as matching andquery/result processing.

17

FAST ESP at a glance

DescriptionModuleCategory

Locates and retrieves files on Webservers.

FAST CrawlerData Sources

Traverses and retrieves files fromdirectories on file servers.

File Traverser

Performs document processing tasksfor format conversion and document

FAST Document Processing EngineDocument Processing

relevancy such as language detection,Asian language tokenization, andlemmatization.

Performs the indexing and searchingtasks within FAST ESP. It indexes new

FAST Search EngineMatching and Query/ResultProcessing

documents coming from the FASTDocument Processing Engine, matchesthem against search queries submittedby the Query Result Server, and returnsa list of resulting documents and resultset navigation options to the Query andResult Server.

Processes search queries and searchresults to enable relevancy-focused

FAST Query & Result Server

searching and result presentation. Itprovides linguistic query processingfeatures like spell checking, and resultsprocessing features like resultclustering.

Allows the standard data sourcemodules of FAST ESP, as well as

FAST Content APIAPIs

custom applications, to push content tothe FAST Content Distributor.

The API is available in Java, C++, and.NET

Allows external search front endsystems to submit their queries andreceive result sets in return.

The API is available in Java, C++ and.NET.

FAST Search API

Provides a browser-based graphicaluser interface that allows the system

FAST ESP Administrator Interface(Admin GUI)

Administration

administrator to monitor and configureFAST ESP.

License server for all componentscontrolled by the licensing scheme.

License Manager

Allows system administrators andbusiness users to monitor the

AdminServer

end-users’ query behavior and tofine-tune the ranking of individualdocuments based on the monitoringresults.

Can be used to import rank boostingspecifications for individual documents

Boost Bulk ToolRelevancy Tuning

into an existing search index. It readsboost records from an XML file and

18

FAST Enterprise Search Platform

DescriptionModuleCategory

applies the boosts to the specifieddocuments.

Various FAST Content Connectorsallow you to submit content from

FAST Content ConnectorsAdditional Modules (distributedseparately)

databases such as DB2, Oracle, orSQL Server, and other specificapplications.

Provides additional integration tools--inaddition to standard APIs--for query,

FAST SDK

content, and document processingintegration and search front enddevelopment.

Used to create taxonomies fordocument organization and/or useconcept extraction.

FAST Taxonomy Explorer

Provides application level security whenintegrating FAST ESP with securityenvironments such as Active Directory.

FAST Security Access Module

19

FAST ESP at a glance

Chapter

3Basic Concepts

This chapter introduces you to the basic concepts of FAST ESP.Topics:

• Content and Flow• Collections• Search Profiles• Document and Document

Elements• Index Schema, Index Profile• Search Rows, Columns, and

Clusters

Content and FlowData that has not yet been submitted to the FAST ESP system is called content. Searchable content entitiesare called documents.

Examples of content are MS Word files, HTML-pages, or database entries.

Content that is submitted to and flows through FAST ESP undergoes different steps of normalization, documentprocessing, and indexing before it is available for searching.

CollectionsContent is retrieved, processed, made searchable, and then grouped into collections in ESP. Collectionsallow you to treat different groups of content differently, specifying for each collection the way in which itsdocuments are to be processed and indexed.

Grouping of content into collections is typically based on criteria like:

• Different views of the content seen from the end-user application, such as product data, Web site pagesand news. (See Search Profiles.)

• Content ownership, such as intranet versus extranet content• Special processing rules, such as metadata handling

Grouping content enables end-users or external query applications to narrow down the scope of a search tospecific types of documents.

In addition, the collection concept allows you to specify the order in which different types of content are to beprocessed during document processing by prioritizing individual collections.

22

FAST Enterprise Search Platform

Figure 2: Content Refinement Showing Different Collections

The collection concept does not imply any physical partitioning of the index. FAST ESP can effectively supportvery large numbers of collections with minor performance impacts.You can also partition based on collection(collection-based routing).

A collection is set up by defining the content source, for example a set of Web domains, and the documentprocessing rules (see Processing Documents) to be applied. For procedural details on how to do this, seeBasic Setup, section Creating a Basic Web Collection, in the Configuration Guide.

Search ProfilesWhile collections group the documents and/or other indexed content, search profiles define what to searchand how your queries and results should be processed and displayed.

Search profiles are created through Fast Home, and are monitored and tuned through the Search BusinessCenter. Refer to the Fast Home and Search Business Center Guides for more information.

23

Basic Concepts

Document and Document ElementsContent is submitted to the FAST ESP system and converted into documents. A document represents thecontent entity as a set of data elements. These elements contain information extracted from the originalcontent entity, such as the information contained in the title or body section of an HTML-page.

A document represents a searchable entity within the FAST ESP Index. Generally, there is a one-to-onerelation between a content entity and a document. The definition of a content entity depends on the way yourcontent is structured. This document representation is used for the processing performed prior to indexing.Refer to the chapter Processing Documents for more information.

In addition to the information included in the original document, information improving search relevancy isadded to the document.

Examples of elements are:

• title text• author• body text• an ID that uniquely identifies the document• the language the document is written in

The conversion preserves the structure of the documents, as well as meta data if embedded in the documents.

By default, text elements are assumed to be encoded according to UTF-8 (Unicode).

The document concept is independent of the type of data being added to the system. For example, if thecontent source is a database table, each row of information from a table or view may become a document.For both search and filtering, each document is treated as one searchable item and is listed as such in theresult list.

Each document has a document identifier that is unique across the entire set of documents handled by theFAST ESP system.

Note that the document identifier is not necessarily a URL. It may be a constructed URI representing, forexample, the exact location of a record in a database.There are no restrictions to the format of the documentidentifier. However, for crawled content, it makes sense to use the URL of the crawled document. For contentpushed to the system from a custom application using the Content API, the client pushing the document intoFAST ESP needs to supply the URI. In this case, the document identifier may, for example, be the key forstoring and loading documents in external storage.

Index Schema, Index ProfileThis topic explains the relationship between fields and the index profile.

Prior to indexing a document, the FAST Search Engine maps the document's elements to fields. Fields aredefined document elements that are to be searchable.

Defining fields allows the end-user or external query application to specify searches that cover only individualparts of a document such as the title or body part.

You define fields by creating and specifying an Index Profile . FAST ESP supports text, signed and unsignedinteger, float, double, and datetime fields. Text fields may contain words or numbers, and queries can bespecified for single words, phrases, or a combination of these. Integer, float, and double fields contain numericalvalues that can be matched against a query by using numerical comparisons such as less than, greater than,and equal to.

24

FAST Enterprise Search Platform

Multiple fields may be grouped into composite fields, allowing a query to be executed on several fields at thesame time.

Scope Fields are special field types that support dynamic indexing and searching in hierarchical content,such as XML. Refer to Scope Search and Dynamic XML Indexing for details.

For details on the Index Profile structure, refer to the Configuration Guide.

Search Rows, Columns, and ClustersThe concepts of clusters, rows and columns--and the relationships between them, are used in FAST ESP.

Search Engine instances are grouped into Search Clusters. Each Search Cluster shares a common IndexProfile, that is the content must be possible to represent within a common index schema. Within each SearchCluster any number of Search Engine Nodes may exist in organized rows and columns. Rows are used forquery scaling and columns are used for document volume scaling.

Figure 3: Multi Node Search Installation Showing Columns and Rows

25

Basic Concepts

Chapter

4Integrating FAST ESP with your Content and QueryInfrastructure

This chapter introduces you to the basics of integrating FAST ESP into yourcontent and query infrastructure.

Topics:

• Retreiving and ProcessingContent

• Integrating FAST ESP on theContent Side

• Integrating FAST ESP on theQuery Side

• Administration and InstallationIntegration

Retreiving and Processing ContentContent can be retreived from data sources in two ways: content pull and content push.

The content pull approach leverages content connectors to retrieve the informatino via standard APIs orinterfaces provided by the source content repositories. This is the core technology of most search solutions,and includes retreival of Internet-based information (Enterprise Crawler), databases and other enterpriseapplications (FAST Smart Connectors) or file server-based documents (File Traverser).The content connectorsdo not require integration programming towards the target data repositories.

The content push approach requires that the data repositories, applications or messaging middleware sendthe data directly to FAST ESP via the ESP Content API. This omits the latency of crawling but it requires acloser relationship between the content application and the search engine.

Integrating FAST ESP on the Content SideThe FAST ESP system accepts content submitted using one of its standard data source modules or pushedthrough the Content API.

Table 1: Content Access Options

Type of ModuleData Source Module to UseType of Content to be Submitted

Standard FAST ESP data sourceFAST Enterprise CrawlerContent stored on Internet, Intranet orExtranet Web servers

Standard FAST ESP data sourceFile TraverserContent stored on file servers, includingXML data exported from databases

Standard FAST ESP data sourceFile TraverserContent stored on file servers, includingXML data exported from databases

Customer application using the FASTContent API

Pushing content through the FASTContent API

Other content

Optional data source moduleFAST Content Connectors. ContentConnectors may also be created using

Content stored in databases, or specificapplications

the Content Connector Toolkit availablein the FAST ESP SDK.

The content push approach implies that a custom application or third-party messaging middleware sendsdata directly to FAST ESP through the Content API.

Using the CrawlerYou can access content on Web sites using the FAST Enterprise Crawler.

The crawler scans specified web sites and follows hyperlinks, extracts the desired information and detectsduplicates. The document processing converts the HTML into structured data as defined by the Webrepresentation. This means separating heading and body, as well as extracting relevant meta-informationfrom HTML pages.

The Enterprise Crawler usually begins from a single URL or list of URLs and follows every link from this setaccording to the configuration of the collection. FAST ESP enables specific parameters to be set on thecrawler such as: crawling frequency, excluded documents, paths, and domains. Intelligent loop detectionkeeps the crawler from repeatedly traversing the same page. Loop detection is instensitive to minor changesin URLs and time. During crawling process, duplicate files are identified and excluded from the index.

Intranet, Extranet and Internet content from Web servers can easily be submitted using the FAST Crawler.It scans specified web sites by following links for appropriate content and extracting the relevant information.

28

FAST Enterprise Search Platform

The FAST Crawler:

• Allows crawling based on an unlimited number of start URLs.• Scales in a cost-efficient manner with total content size, number of documents, and number of different

sites being crawled.• Allows you to specify sub collections within collections with separate request rates and refreshes. This

enables you to crawl individual subdomains of sites differently.• Enables incremental crawling. The FAST Crawler can be configured to focus on retrieving new content

only, or detecting modified or deleted items in previously retrieved content.• Allows you to specify the types of files to be crawled by adding the MIME type through the FAST ESP

Administrator Interface, telling the FAST Crawler to recognize and bring back the desired file types only.• Detects whether content on a Web server has been deleted. When a document once detected has not

been seen for a given period of time, the FAST Crawler regards it as deleted. This document is deletedfrom the collection(s) it belongs to.

• Enables specific crawling parameters per collection such as crawling frequency, excluded documentspaths and domains.

• Retrieves both static and dynamically generated web content.• Allows you to manually activate crawling of specific URIs, sites, or collections.

The crawling process consists of two steps – content retrieval and post processing.

During content retrieval, Web content is retrieved. During post processing, the retrieved content is analyzedto determine new or modified content and the parts of the content on the crawled Web server that have beendeleted. In addition, during this step, the FAST Crawler detects duplicates within a collection.

The FAST Crawler interfaces directly with the Content API to submit the content.

Note: To retrieve content from locations other than Web servers, you can use the File Traverser forregular file server. Or you can purchase one of the FAST Connectors, allowing you to retrieve contentfrom specific applications like Microsoft Exchange or Documentum. For purchase information, contactFAST Support.

For details on the features of the FAST Crawler and how it is configured, refer to the Crawler Guide.

Using the File TraverserYou can retreive files from a file server using the File Traverser.

The File Traverser scans specified file directories on file servers, retrieves content of various formats, andsubmits it to a collection in your FAST ESP installation.

The File Traverser:

• Works on any reachable file server.• Allows you to locate individual types of files by specifying individual file extensions like html, htm, pdf, and

doc, for example.• Sends the located files to the Content API in batches. The size of the batches is configurable by two

parameters – total file size and number of files.• Allows you to locate files incrementally by reporting only those files that have changed since the last run

(mods_only mode). Typically, file servers contain a lot of static content: there are many documents thatdo not change frequently. If the File Traverser is run in mods_only mode, it will only submit content thathas changed since the last run. This saves your FAST ESP installation from processing documents thatit has processed before, and helps to increase system performance while ensuring index freshness.

• Allows you to determine the files that have been deleted between two runs of the File Traverser (dels_onlymode) and to delete them from their collection(s).

• Can be run without actually performing any operations (report mode). This allows you to verify your FileTraverser configuration.

• Traverses and submits any XML files, including FastXML.

29

Integrating FAST ESP with your Content and Query Infrastructure

• Can run independently from FAST ESP on a separate node.

Retrieving Macromedia Flash FilesThe Enterprise Crawler includes functionality to retrieve the Flash files, and they are indexed as separatefiles within the searchable index.

FAST ESP includes the ability to follow hyperlinks and index textual content from Macromedia Flash files.The following document processing pipelines are used: Generic, SiteSearch and NewsSearch. For moreinformation, refer to the Configuration Guide, and to the Enterprise Crawler Guide.

Internal Process and Data FlowThis topic explains how files are processed with the File Traverser.

The File Traverser is a command line tool. It works on any reachable file server, recursively locating any filesassociated with the top directory specified in the command line. It processes files that match some specifiedfile extensions like .html, .htm, .pdf or .doc. Furthermore, you can configure the File Traverser to map filenames to URLs based on a given URI prefix.

There is also a GUI-based configuration option in FAST ESP.You can configure the File Traverser via theData Sources Admin GUI tab. This is to be activated through the Connector Controller. Refer to GUI-basedoperation via Connector Controller for more information.

Interfacing with File TraverserThe File Traverser interfaces directly with the Content API to submit content.

Monitoring and Logging with File TraverserThe File Traverser logs to the FAST Log Server.You can monitor its log messages in the FAST ESPAdministrator Interface (also known as the Admin GUI).

In addition, the File Traverser logs output to the shell it is started in.

To retrieve content from locations other than file servers, you can use the FAST Crawler for Web servers,which is included in your FAST ESP distribution. Or you can use one of the FAST Connectors, allowing youto retrieve content from applications like Microsoft Exchange or Documentum (see section Using a FASTContent Connector ). Contact your FAST Account Manager or FAST Technical Support for purchaseinformation.

For details about the File Traverser features, refer to the File Traverser Guide.

Optional Data Source ModulesIn addition to the FAST Crawler and the File Traverser, the optional FAST Connectors provide support forextracting and submitting content from databases such as DB2 and individual content management systems.

For purchase information, contact your FAST Account Manager or FAST Technical Support.

GUI-based Operation via Connector ControllerThe Connector Controller acts as a proxy between the File Traverser and the Administration Interface (AdminGUI), and as a proxy between some connectors and the Admin GUI. It allows for configuration and operationof the File Traverser and connectors.

With the proper configuration, the selected connector or File Traverser appears in the Admin GUI as a DataSource which, if selected, will enable you to work with the File Traverser or connector settings through theuser interface.

See the Configuration Guide, Integrating the File Traverser Connector Controller. The process is similar forconnectors as for the File Traverser, but there are some variations. Refer to the Connectors Guide for thespecific connector you are using for information on how to install and configure the connector controller.

Note: Not all connectors support the Connector Controller in FAST ESP.

30

FAST Enterprise Search Platform

Pushing Content to the Content APIIf the content you want to submit is not retrievable from a Web server, a file server, or one of the applicationscovered by the optional FAST Connectors, you may use the Content API directly to push your content toFAST ESP.

The FAST Content API:

• Allows submission of content and attached meta data.• Packages the raw data and submits it to the Document Processing Engine.• Allows for passing the content entity as such or passing a URL pointing to the content.• Allows the standard data sources and the custom application to add, remove, and update content within

the FAST Search Engines.• Is provided for Java, .NET and C++.

The FAST Content API allows the standard data sources of FAST ESP as well as custom applications topush content to the FAST Document Processing Engine. This implies improved freshness, as content maybe submitted when published, and allows integration with applications not supported by the standard FASTESP data source modules or one of the FAST Connectors.

You can use the Content API to submit all types of content formats compliant with FAST ESP.

When content is pushed to FAST ESP through the Content API, the structure of the retrieved content maybe preserved and mapped to Document Elements. XML content that is already coded according to theFastXML structure is processed and mapped directly into the index. Other XML dialects are converted duringdocument processing (see the chapter Processing Documents) using a built-in XML Mapper stage.

The Content API uses HTTP as the underlying transport mechanism between the API client and FAST ESP.

For details about the FAST Content API, refer to the Content Integration Guide.

Allowed Content Formats

FAST ESP allows you to submit content in the following formats:

• One of the multiple document formats that the FAST Document Processing Engine is able to handle. Fordetails, refer to Appendix A Supported Document Formats.

• Directly from an application using the Content API. The API enables you to submit structured data thatcan be mapped to the FAST ESP Document Model.

• A format complying to the FastXML DTD.• Any XML format. In this case the mapping from XML to the FAST ESP Document Model can be performed

using scope search or an XPath-based conversion stage.

Using a FAST Content ConnectorIn addition to crawling internet sites, traversing file servers, and using the content API, FAST ESP allows youto submit content from other specific applications using the respective FAST Content Connectors.

A content connector is a program that extracts content from some source system, maps the content from thesource document model to the document model of FAST ESP, and feeds the documents to FAST ESP forindexing.

There are Connectors for databases, content management systems, portal servers, and e-mail applications.FAST Content Connectors are optional modules. For purchase details, contact FAST Technical Support.Refer to the individual connector guides for information related to a particular connector.

The FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connectorapplication. Refer to the Content Connector Toolkit Guide for information on how to use it.

31

Integrating FAST ESP with your Content and Query Infrastructure

Integrating FAST ESP on the Query SideFAST ESP provides some application programming interfaces (APIs) for creating search interfaces andintegrating FAST ESP on the query side.

• Search API, available in Java, C++ , and .NET

• HTTP-based Query Interface• FAST Web Service interface

The FAST Search API handles the search query and result traffic between the Search Front End and theFAST QR Server.

The Search API:

• takes search queries sent by the end-user and passes them to the QR Server.• takes results coming from the QR and provides these as query result objects to the API application.• provides abstraction layer interfaces for handling query result features such as Result Clustering and

Dynamic Drill-down.

For detailed deployment information, refer to the Query Integration Guide and the Query Language and QueryParameters Guide.

The Search API uses HTTP as the underlying transport mechanism between the API client and FAST ESP.

For details about how to use the APIs, refer to the Query Integration Guide.

Custom IntegrationFAST ESP technology uses a modular approach with well-defined APIs for customer integration. A varietyof content types can be retrieved using APIs, specialized connectors, and other tools. Here are some examples.

Content Interface

• The Content API supports integration of applications via C++, Java, and .NET. A Java-based ContentConnector Toolkit provides a set of integration tools that simplifies the development of connectors. Referto the Content Connector Toolkit Guide for more information.

Search Interface

• FAST ESP is typically integrated into an existing Web site through the Query API.You may also use aSOAP/WSDL-based Web Services interface for query integration. Refer to the Query Integration Guidefor more information.

Document Processing Interface

• FAST ESP provides an interface for inclusion of customer-defined document processors, e.g. for advancedtext analysis.

Query/Result Processing Interface

• FAST ESP provides an interface for dynamic linking of custom query and result processors. For example,for custom query analysis/re-write and result parsing. Refer to the Query Integration Guide for moreinformation.

Administration Interface

• FAST ESP supports API integration for system administration and collection configuration.You can useeither a Java-based API or command-line tool. Refer to the ESP Application Integration Guide for moreinformation.

Security Integration

32

FAST Enterprise Search Platform

• Securty Access Module provides document-level security capabilities for integration with your content andportal infastructure. Refer to the separate Security Access Module (SAM) documentation for moreinformation.

SDKsESP Content SDK, Search SDK, and Application SDK provide various interfacing capabilities.

• ESP Content SDK provides integration capabilities for interfacing your content applications with FASTESP.

• ESP Search SDK provides programmatic API and Web Services integration capabilities for your searchapplication.

• ESP Application SDK supports a Java/Web-based SDK for interfacing to a set of core services in theFAST ESP platform. Examples are reporting and rank tuning.

Refer to the SDK documentation set for more information for details on how these work.

Web Services InterfaceWeb services are a collection of standards and protocols that allow computers to communicate across theinternet using XML and the ubiquitous HTTP protocol.

Web services interfaces are particularly popular because they eliminate typical barriers to technical integration– differences in, for example, hardware platform, operating system, and software language.

For more information on web services, refer to Using the FAST Web Services Query Interface in the QueryIntegration Guide.

Administration and Installation IntegrationAdministration and Installation Integration can be performed in ESP using the View Admin Tool, the FASTESP Installer, and Application SDKs.

View Admin Tool

• The View Admin Tool can be used if the client administration system is not able to utilize the Java API.• The tool can be used to perform Fast ESP administrative tasks from a UNIX or Windows command line

including collection management.• The tool can be executed on any Fast ESP node. For more information on the View Administration Tool,

refer to the Operations Guide.

Installation Integration

• You can configure and invoke the FAST ESP Installer from another application (OEM installation).

Application SDK Integration

• With this SDK it is possible to interface to a set of core services in the FAST ESP platform incluing reportingand rank tuning.

33

Integrating FAST ESP with your Content and Query Infrastructure

Chapter

5Processing Documents

This chapter introduces you to the basic concepts of processing documents.Topics:

• Document Processing Overview• Document Processing Engine,

Stages, Pipelines• Custom Document Processing• Entity Extraction

Document Processing OverviewAfter content has been retrieved, submitted via the FAST Content API, and converted to documents, thesedocuments are processed within the FAST Document Processing Engine for format conversion and relevancyenhancement.

As explained in Basic Concepts, Documents and Document Elements, a document consists of a set of namedelements, which contain values such as text strings or integers. Within the Document Processing Engine,these element values are read, analyzed and modified when required. New values can be added to emptyelements.

How document processing is performed, is defined per collection.

Document Processing Engine, Stages, PipelinesThe Document Processing Engine provides linguistic processing of documents through customizable documentprocessing pipelines. These consist of multiple document processing stages.

The Document Processing Engine also:

• allows customers to modify document processing pipelines.• allows customers to write specific document processors with a minimum of constraints and plug them into

arbitrary points in any pipeline.• provides support for entity extraction.

Document processing pipelines consist of multiple document processing stages.These document processingstages read element values of the document to be processed, compute analyses on them, and modify or addelements to the document.

The Document Processing Engine consists of multiple document processing pipelines. Any incoming documentis sent through a specified document processing pipeline.

A document processing stage performs a particular document processing task and can modify, remove, oradd elements to a document. It takes one or more document elements to be input and the resulting outputis new or modified elements that may be further processed.

With each document processing stage focusing on one particular area of document processing, documentprocessing stages can be reused in a multitude of settings.

When you configure one of the data sources provided with FAST ESP, you specify the collection(s) to whichthe data source submits documents.Then you assign the collection to a unique document processing pipelinethat defines how the collection's documents are processed prior to indexing.

Document processing pipelines are configurable through the FAST ESP Administrator Interface. For detailsabout configuring document processing pipelines, refer to theConfiguration Guide.

You can define new document processing pipelines from the interface, as well as specify the documentprocessing stages to be involved and the sequence of execution within each pipeline.

A typical document processing pipeline for web-retrieved information consists of the following stages:

• format detection to detect the MIME type of the document and determine if a format conversion is required.• format conversion to convert the document's format from one of a whole range of external formats to the

internal FAST ESP document structure.• HTML parsing to extract structure from HTML documents such as title or body.• language and encoding detection to enable language dependent processing and narrowing the scope of

a search.• unification of character encoding to UTF-8 Unicode representation.

36

FAST Enterprise Search Platform

• tokenization.• special tokenization for Asian languages.• extraction of document summary.• lemmatization.

The Document Processing Engine also includes a Content Distributor which is responsible for dispatchingincoming documents to the right document processing pipelines by controlling processor servers.

The Content Distributor sends the current document to the processor server along with a pipeline request,and the processor server executes the stages in the requested pipeline on the document.

The Document Processing Engine interfaces with data sources or the Content API for input and with theSearch Engine for output.

The Document Processing Engine sends its log messages to the Log Server.

The Document Processing Engine can be monitored through the FAST ESP Administrator Interface (AdminGUI).

The FAST Document Processing Engine supports a large variety of document formats.

Custom Document ProcessingIf you want to apply custom document processing to a set of documents without using and customizing oneof the document processors provided with FAST ESP, you can do so by using the ExternalDataFilterTimeoutdocument processor as an interface from and to which you can output and input documents. Also, it is possibleto develop custom document processing stages using the FAST SDK. Refer to the Document ProcessorIntegration Guide for details.

Entity ExtractionDocument processing also includes entity extraction. Entity extraction is detecting, extracting, and normalizingentities from documents such as names of people or companies.This makes unstructured data more structured,and enables navigation or relevancy enhancements possible on specific entities.

Both pre-defined and customized entities shipped with FAST ESP can be detected and extracted. Extractionof pre-defined entities is supported out-of-the-box for English, German, French, Spanish, Portuguese,Japanese, Italian and Dutch.

Examples of pre-defined entities are:

• person• company• location• job title• newspaper• university• sentence• date• paragraph• price• measure• upper• acronym

37

Processing Documents

• airline• car• e-mail• file name• ISBN• phone• zip code• ticker• time• quotation

Entity extraction is, per default, part of the NewsSearch processing pipeline for extracting entities on documentlevel and the Semantic pipeline for extracting entities on scope level. Entity extraction can, however be usedin custom document processing pipelines as well.

Extraction of other entities is possible by:

• using the Admin GUI to specify additional extractors• via a regular expression document processor which supports entity extraction based on regular expressions.

The default configuration of this document processor supports extraction of E-mail addresses and USlocations. Additional regular expressions can be defined, for example, extracting product names or customerspecific information.

Refer to Creating Entity Extractors in the Document Processor Integration Guide for more information.

For support on extending the entity extraction feature, contact FAST Technical Support.

38

FAST Enterprise Search Platform

Chapter

6Making Documents Searchable

This chapter introduces you to the basic concepts of indexing documents to makethem searchable.

Topics:

• Indexing Documents• Defining How Documents are

Searchable

Indexing DocumentsThis topic explains how the FAST Search Engine, Search Clusters, and Search Columns and Rows affectdocument indexing.

The FAST Search EngineThe FAST Search Engine receives processed documents from the Document Processing Engine and makesthem available for searching.

The Search Engine consists of three sub-modules:

• the RTS Indexer: It indexes all documents arriving from the Document Processing Engine and stores theindex.

• the RTS Searcher: It runs queries submitted by the end-user or external query application against theindex stored by the RTS Indexer.

• the RTS Dispatcher: It distributes queries to different Search Columns, selects Search Rows based onload balancing and merges search results from different Search Columns and Search Partitions withinthe Columns.

On the content side, the Search Engine interacts directly with the document processing pipelines. On thequery side, the Search Engine interfaces with the Query & Result Server.

Both RTS Indexer and RTS Searcher may be made operative on one or more machines.They may be spreadacross columns and rows to balance load and network traffic. For details about how to arrange RTS Indexerand RTS Searcher instances, refer to the Deployment Planning Guide.

Search Engine ClustersSearch Engine instances are grouped into search engine clusters. A search engine cluster is a group ofSearch Engine instances that share the same index schema, which is provided by an index profile.

A search engine cluster has a number of collections—logical groups of content—assigned to it. One collectionresides inside one search engine cluster, but may be spread across multiple search columns. Since all SearchEngine instances in one cluster share the same index profile, all collections assigned to this cluster areindexed in the same way.

There is a one-to-one relationship between an index profile and a search engine cluster: Each search clusterin your system needs one index profile. That means, if you want all content fed to your FAST ESP system tobe handled according to one index profile, only one search cluster is required. If the content fed to your FASTESP system consists of different types of content, where each content type requires a separate index profile,several search clusters are needed. As a rule of thumb, you select a single cluster configuration wheneverpossible—especially if you want to be able to integrate results from Web and other sources for the samequery, within a common result list sorted by relevance.

Each cluster requires its own instance of the QR Server.

Defining multiple clusters normally requires some support from FAST Solution Services. Consult FASTTechnical Support for details.

For details on how to deploy the FAST Search Engine, refer to the Deployment Planning Guide.

Search Columns and Search RowsWithin one search cluster, multiple Search Engine instances can be arranged in search columns and searchrows to distribute query traffic and document load.

Sets of indexed documents are stored in all Search Engine instances within a search column to scale datavolume. That means that member rows of a search column share the same set of indexed documents.

40

FAST Enterprise Search Platform

Queries are shared among all Search Engine instances within a search row to scale query rate. This meansthat when a query is sent to the search engine cluster, it is sent to all members of one search row (one nodewithin each column) within this cluster to be matched against all sets of indexed documents.

Defining How Documents are SearchableIn the process of creating the search index, FAST ESP uses an index profile. An index profile is an XML-basedconfiguration file. It is an index schema that defines the way documents are searchable.

It specifies search properties like:

• which document elements are to become searchable fields• which document elements are to become fields that are returned as part of a result• how to calculate values that are used for sorting and ranking

The purpose of an index profile is, to some extent, similar to the process of defining a database schema.

Each document arriving at the FAST Search Engine is parsed and indexed based on the document’s elements.These elements are mapped to the fields given in the index profile. Once the document resides in the index,you can search directly on these fields.

You can set up and use several index profiles to address different types of content, for example Web pagesand product database entries. Setting up several index profiles is done by defining multiple Search clusters.

When you install FAST ESP, you can choose between standard index profiles or load a custom index profile.

All default index profile files are located in $FASTSEARCH/index-profiles/ (UNIX) or%FASTSEARCH%\index-profiles\ (Windows) with $FASTSEARCH and %FASTSEARCH% environmentvariables set to the directory where FAST ESP is installed.

Document Processing, Index Profile, and Search Engine ClusterThe index profile concept is closely tied to the concepts of document processing and search engine clusters.

During document processing, each document is represented by a set of elements that can be processed andlater mapped to searchable fields related to the index profile. Both elements and fields represent contentparts and attributes related to the document, for example, body, title, heading, URI, author, and category.

Figure 4: The relationship between Document Processing, Indexing and Search Engine Clusters

41

Making Documents Searchable

The index profile defines the layout of the searchable index, and specifies how fields are to be treated byquery and result processing. Each search engine cluster has an associated index profile.

The index profile also includes one or more result views. A result view defines alternative ways for a queryfront end to view the index with relation to queries.

Index Profile StructureThe structure of the index profile is the composite of fields and attributes.The index profile can be configuredto allow different features in ESP.

FieldsThe basic entity of an index profile is a field with its attributes. A field is searchable by default, and is also thebasic entity in the result presentation. Typical field attributes are name, specifying the field's name, type,specifying the type of content the field holds, or index, specifying whether the field should be searchable.

Scope FieldsThe FAST ESP Indexer is based on a field structure that defines the schema of the indexed content. Theschema is defined using the Index Profile. The Scope Search feature is facilitated by introducing a new fieldtype in the FAST ESP index, named scope field. Hence, a scope-enabled index may include different typesof fields.

A scope-enabled index may include the following types of fields:

• Basic field. A basic field may be of type string (any textual content), int32 (32 bit signed integer), float,double or datetime (representing a date/time value as a numeric value in the index), uint32.

• Composite field. A composite field includes a set of basic string fields that can be matched using the built-indynamic ranking mechanisms in FAST ESP.

• Scope field. A scope field contains hierarchical scope content. The individual subscopes of a scope fieldmay be of any data type supported by FAST ESP (string, int32, float, double or datetime). For textualscopes, a subset of the dynamic ranking mechanisms as provided for composite fields will apply. Whendefining a scope field, there is no need to define the actual scope structure within the scope field in advance.

A FAST ESP index profile may contain a combination of one or more fields, composite fields and scope fields.Hence, it is possible to combine in one index both schema based content in fields with and scoped dynamiccontent.

In the query language you may specify individual fields, composite fields or scopes to limit the scope of aquery. For scope queries the scope specification in the query must include the scope field name (also calledthe root scope) and sub-scopes within the indexed scope structure.

A scope field may include a hierarchy of scopes in arbitrary depth.

The Scope Indexing is generic in the sense that it does not require any specific content input format. FASTESP supports XML input format - other input may be supported by creating custom document processors.

Composite FieldsComposite fields allow you to group individual source fields by referencing the source fields through field-reftags or field-ref-group tags.You can use this feature to apply a common rank score to a group of fields or tomake them searchable as a unit.

Features Enabled by the Index ProfileThe Index profile is configurable to enable different features in ESP.

The following features are enabled by being specified in the Index Profile:

• ranking• sorting• tokenization• lemmatization

42

FAST Enterprise Search Platform

• teaser generation• navigation (dynamic drill-down)

For procedural details about how to configure an index profile, refer to the Configuration Guide.

Including MetadataESP offers different ways to include meta data about content in the indexing process.

Metadata can be included in the following ways:

• You may push meta data information along with the content using the Content API. The meta datainformation is treated as any other element of the content entity and transformed into a document elementafter the content has been submitted.You then need to design the index profile accordingly to catch anydocument elements you might want to be included in searching or result presentation, regardless ofwhether they originate from content meta data or not.

• If your content consists of HTML files, the built-in HTML parser extracts all HTML <meta> tags as documentelements whose names are prefixed with meta_. For example, with the HTML fragment <metaname=”DC.Identifier” content=”http://www.fastsearch.com”>, the internal document representation of thisHTML file includes an attribute called meta_DC.Identifier, that has the value http://www.fastsearch.com.

• Meta_xxx attributes are text chunks instead of plain strings. Refer to the Document Processor IntegrationGuide for information on what text chunks are and how to handle them.

• Meta names are lower cased, that is <meta name=”KEYwordS” content=”abc, 123, abc, 123”> -->meta_keywords = “abc, 123, abc, 123”

• For other content that complies to one of the formats the FAST Document Processing Engine is able tohandle (see Processing Documents), the Document Processing Engine is able to detect meta data. Forcontent in MS Word format, for example, it extracts meta data using the MS Word Properties field.

Executing Search Queries and Returning ResultsThe FAST Search Engine receives queries from the FAST Query & Result Server, which may havepre-processed the query to perform spell checking for instance.

When a query is received, the FAST Search Engine matches it against the search index to identify a list ofdocuments that match the query. The ordering of the list is either based on field based sorting or on ranking(see Concepts of Relevancy, section Sorting Results). The FAST Search Engine returns a set of fields fromthe most relevant documents – based on ranking or sorting – from this ordered list to the FAST Query &Result Server.The number of documents to return can be specified as part of the query.Which fields to returnfor each document is defined in the Index Profile. Finally, the FAST Query & Result Server may performpost-processing on the result before it is returned to the end-user.

Partial Document UpdatesIn some situations, updates to a document can be frequent, but each update consists of a single changedvalue. Examples can be temperature measurements or bids in an auction.

In this case, performing an update operation on the document in question will result in the new value beingunavailable for several seconds as it has to wait for the smallest index to be re-indexed. This may be anunacceptable delay, in which case a partial update scheme can be used to improve the freshness of smallupdate operations. The update may be performed without a need for re-indexing of the entire document.

Partial updates can be performed via the Content API.

Fields to be updated can (but don’t have to be) updated as real-time properties. Real-time property fields canbe configured in the Index Profile using the field attribute “latency=low”.

Query Highlighting in Dynamic TeasersQuery Highlighting in Dynamic Teasers extracts a range of the document centered on representativeoccurrences of the query terms.

43

Making Documents Searchable

The document body is stripped for markup during document processing and stored. A maximum of 64 KB isstored. For each document on the result page, this document extract is retrieved and text segments aregenerated that include the best matches of the query in that document.

The query highlighting supports advanced query operators such as proximity (NEAR/ONEAR), and alsosupports linguistics processing such as lemmatization and spell check.

Query Highlighting in Source DocumentationFAST ESP supports query highlighting in source documentation. The Document Hit Highlighting featureenables you to create a search application where the end-user may browse through the query hits within thefull context of a matching document.

When using this feature, FAST ESP keeps an HTML representation of the original document. If the originaldocument is an HTML page, a copy of the HTML is stored in this field. If the original document is a differentformat (e.g. MS Office, PDF), a dedicated document processing stage converts the document to a similarHTML representation which will be stored and used for the hit highlighting on the client side.

As part of the regular search results, the dynamic teaser contains HTML links which will lead you to the HTMLrepresentation of the source document.

The link from the dynamic teaser will bring you to the most relevant query match within the document. Fromthere you can browse through the query hits within the document by relevance.

Figure 5: Components and features in Document Hit Highlighting

Refer to the SDK Document Highlighting Guide for more information.

44

FAST Enterprise Search Platform

Chapter

7Concepts of Relevancy

This chapter introduces you to the basic concepts of relevancy features andtuning.

Topics:

• Components of SearchRelevancy

• Contextual Insight• Ranking Concept• Freshness Boosting• Analyzing linked web pages using

the WebAnalyzer• Tools to Modify Rank for

Individual Documents• Boosting Mechanisms• Relevancy Modifications Based

on Business Rules• Proximity Ranking and Matching• Sorting Overview• Boundary Matching• Duplicate Removal

Components of Search RelevancyRelevancy is the measure of how well a set of results answers or addresses the intent of a given query.

FAST ESP supports search relevancy through the following key steps:

• Data mining

The document processing framework provides support for extensive data mining to perform real-time contentrelevancy refinement. This includes embedded relevancy tools and integration points for 3rd party modules.For details on document processing, refer to Processing Documents.

• Linguistic processing

Multiple linguistic processing features provide a number of approximate matching techniques to improvequery recall.This includes automatic spell check, matching with inflectional variations of terms (lemmatization),thesaurus (synonym) matching and natural language support (anti-phrasing).The advanced linguistics featuresare described in further detail in Advanced Linguistic Processing, and in the Advanced Linguistics Guide.

• Sorting

Sorting results based on individual document elements allows for highly relevant result presentation. Fordetails on sorting refer to section Sorting Results.

• Rank value calculations

The calculation of a rank value based on the FAST ESP ranking model provides a multi-faceted measurementof the quality of the match between the query and a candidate result document. This rank value consists ofquery dependent and query independent parameters.

• Query context analysis

Query context analysis refers to the ability to present the information from the query results in context of thequery. FAST ESP supports dynamic document summaries that display the segments of the matching documentthat provide the most relevant match with the query.

• Navigation

Data driven navigation provides drill-down into the query result or related areas. Drill-down queries may bebased on document similarities, category, entity, or terminology information extracted from the documents,parametric drill-down into multiple dimensions of the query result (dynamic drill-down) and drill-down intocontent domains (for example, all documents from a given site).This feature is further described in Taxonomyand Navigation.

Contextual InsightContextual Insight is the next generation for search relevancy, drastically improving precision without sacrificingrecall. Using Contextual Insight you can create fact-finding applications enabling queries like ‘when was NNborn’, ‘where is the next winter Olympic games’. Conventional search engines will return links to documentsthat include the name NN or the terms ‘winter Olympic games’. With Contextual Insight you can also detectthe intent of the query, search for the terms/phrases and return requested entities that appears in context ofthe matching text – in this case dates or cities.

Contextual Insight is based on the following ESP features:

• Contextual processing – Detection and markup of text flow such as sentences, paragraphs and othersemantic structures in unstructured content.This is mapped to a hierarchical scope structure which enablessearch within different contexts of the document. This enables you to limit your search to paragraphs orother semantic elements in the text.

46

FAST Enterprise Search Platform

• Context aware entity extraction – Automatically detects entities in text and annotates the detectedsemantic structures with normalized entities. Entities include such things as people’s names, phonenumbers, geographic locations, and in our example, company names.

• Scope Search – Enabling efficient indexing and search in a contextual decomposition of the documents.• Contextual navigation (also sometimes referred to as scope navigation)– Previously, successful navigation

has been limited to global metadata. Contextual Navigation unlocks the semantic meaning of contextualmetadata in the form of extracted entities. It is able to extract textual entities from the results of the previoussearch. Unlike taxonomies and facets, entity extraction draws its navigators directly from the results andis contextually aware so that you can return navigation entries that appears in the context of the matchingsentences or paragraphs.

• Natural language – a natural language query processor is available to enable creation of natural languagequery rules, enabling semantic query transformation for e.g. where, when, what, who type of queries.

Ranking ConceptFAST ESP ranking is based on a multi-faceted measurement of the quality of the match between the queryand a candidate result document.

The relevancy of a document with respect to a query is represented by a ranking value.

In the index profile, you specify one or more rank profiles. A rank profile specifies the relative weight of eachrank component for a given query. This enables individual relevance tuning of different query applicationsusing a FAST ESP installation.

The FAST ranking model is based on the individual tuning of the ranking parameters of freshness, authority,quality, proximity and context.

FreshnessFreshness denotes the age of a document compared to the point in time when the query is issued. Refer tosection Freshness Boosting for more information.

AuthorityAuthority denotes the importance of a document as determined by the links from other documents to thedocument in question.

To determine a document’s authority, FAST ESP detects links from other documents and uses the anchortexts associated with these links to compute an authority rank component.

Refer to section Anchor Text Analysis for more information.

QualityQuality denotes the assigned importance of a document. Since quality metrics are assigned to individualdocuments or groups of documents directly, the quality of a document is query independent.

FAST ESP provides a set of business manager tools that allow you to assign quality metrics to individualdocuments or groups of documents.

It is also possible to apply quality metrics through metadata when submitting content via the Content API orContent Connectors. This is further described in the Content Integration Guide. The source (element name)and weight of the quality component can be specified in the Rank Profile.

The Quality ranking component may also be referred to as Static Rank.

Note: The term “quality” here refers to document quality. This should not be confused with the wordquality when talking about Result Quality, which refers to quality criteria in the search result itself.

47

Concepts of Relevancy

Proximity and ContextProximity and context measurements determine how well the content of a document matches the query.

This is based on the following aspects of a query match:

• The number of query terms matching a document within the result set (for an OR type query).• Query term weighting. Different relevancy weights may be applied to different terms in a query.• Proximity. When a query contains multiple terms that are not detected as known phrases, the ranking

process takes the relative position of the terms and determines the most relevant results based on theproximity the matching terms in the document have to each other. Proximity denotes the distance between,and location of, query terms in the documents.

• Frequency of query terms occurring in a matching document, compared to the global frequency of theterms in the index. More occurrences in the matching document imply a higher ranking value. However,if the term has high frequency over the total index, this will reduce the ranking value.

• Context based Relevance Tuning. Different document fields, for example title, body, description, price, ortype, may be assigned different relevance weight. This allows you to specify for example that a match inthe title field of a document contribute more to the document's ranking value than a match in the bodyfield of a document.

The proximity and context parameters of the rank profile control these statistics metrics, except for the queryterm weighting which is selected at query time.

Proximity and context metrics only apply to composite fields, not to query terms with wildcards.

Freshness BoostingThe freshness rank boosting feature controls to what extent relative age of the documents impacts the rank(relevance score). If enabled, newer documents will appear higher up in the result set.

The date of a document is set when processing a document. The date source may be the content sourceitself (for example, an application submitting documents via the API), date information from file servers orweb servers, or the time of processing within FAST ESP.

The Crawler, File Traverser and Content Connectors will set the time stamp automatically for the documentwhen submitting to FAST ESP.The Content Integration Guide describes how you can apply a custom time/datesource for documents submitted via the API.

When performing a query, the document date/time value is converted to a 'freshness' parameter that reflectsthe age of the document from the time of a query. The age is scaled to reflect the perceived importance ofage (the difference between 1 and 5 days age may reflect the same relevance difference as between 1 and12 months age).

The freshness boost feature is controlled using the Rank Profile feature of the Index Profile and by queryparameters. This feature can be controlled on a per query basis by:

• Selecting rank profile for the query by a query parameter. Multiple rank profiles (defined in the Index Profile)may have different weight on the freshness boost within the total rank.

• Selecting the time base for calculating the freshness boost. The freshness boost is calculated based onthe relative age of each document compared to the given time base. Default time base is the current timewhen performing the query.

Note: Boosting can also be managed through the Search Business Center. Refer to Managing Boosts& Blocks in the Search Business Center Guide.

48

FAST Enterprise Search Platform

Analyzing linked web pages using the WebAnalyzerThe hyperlink structure of Web pages may provide valuable information about the importance of a web page.

A web page to which a high number of other web pages refer to, is assumed to be more important than aweb page to which only few or no other web pages refer. In particular, links from what are referred to as goodpages - pages that are referenced by many other good pages - indicate that the linked page is important.

The WebAnalyzer uses links between documents to improve search relevancy. The WebAnalyzer is a FASTESP module that uses links between documents to improve search relevancy. The WebAnalyzer Guidedescribes the feature and provides installation, configuration, operations, and troubleshooting information

Refer to the the WebAnalyzer Guide for information about the WebAnalyzer, including procedural informationabout the tasks you can perform from the WebAnalyzer Overview tab.

Tools to Modify Rank for Individual DocumentsThere are tools that enable you to perform Absolute Query Boost, Relative Query Boost or Relative DocumentBoost for given documents in the index. An example could be a product database where it might be desirableto boost products with highest profit margins, boost products related to campaigns, etc.

The following tools exist for this purpose: Search Business Center and Boost Bulk tool.

Search Business CenterSearch Business Center (SBC) is a GUI-based administrative tool that enables rank tuning on a documentlevel. The boost value may also be negative, in order to avoid pages to appear on the top of a result list.

In Search Business Center you can boost or block documents to change their rankings for a specific query.

Note that only query-side boosting and blocking can be handled using the Search Business Center.

Using the SBC you can change the ranking for each query using three different methods:

• Top Ten - to position the document in one of ten reserved places that will be returned at the top of theresults list.

• Add boost points - to add a value to a document to increase its relevancy relative to the other documentsreturned in the search results.You can also add negative boost points to a document.

• Block from query - to prevent the blocked document from appearing in the search results for the query.

Refer to the Search Business Center guide for details.

Boost Bulk ToolThis is a standard FAST ESP tool that enables you to perform the same rank tuning as the SBC, using anXML file as input. The XML file contains a specification of the rank modifications to be performed.

This approach is preferred if you have the ability to extract the rank boost information from other data or otherapplications.

Refer to Boostbt Tool in the Operations Guide for information on how to use this feature.

Boosting MechanismsFAST ESP supports the following types of boosting mechanisms: Absolute Query Boosting, Relative QueryBoosting, and Relative Document Boosting.

49

Concepts of Relevancy

Absolute Query Boosting

Suppose you want a document to be consistently displayed at a given position in the result set, for exampleat position one, when a user searches with a specific query. Then you can specify adocument-query-combination and assign a fixed absolute ranking position that the specified document is toget within the result list, whenever a user is searching with the specified query.

Absolute Query Boosting also allows you to exclude individual documents from being displayed at all whena user searches with a specific query.

Relative Query Boosting

Suppose you want to ensure that a particular document is always displayed among the first 20 documentsin the result list, provided a user searches with a specific query. For all other queries, the ranking position ofthe document shall not be impacted by any boost.

Thus, you specify a document-query-combination and assign an amount of ranking points with which thedocument’s overall ranking value is to be increased whenever a user is searching with the specified query.

Relative Document Boosting

Suppose you want to ensure that a particular document is always displayed within the first 20 documents inthe result list, no matter which query a user has submitted. At the same time you do not want to assign afixed result list position to the document.

For this purpose it is possible to specify that the overall ranking value of the particular document to be rankedhigher, be increased with a certain amount of ranking points.

Relevancy Modifications Based on Business RulesAnalyzing the impact of business rules makes it possible to impact the relevancy model and direct searchend-users to business-generating results.

Organizations are governed by business rules and workflows. Business rules should be adjusted in line withmarket trends, analytic regression patterns, etc. to meet the needs of the business and the market. An exampleof a business rule could be that a credit check is not necessary for returning customers. FAST ESP lets youapply business rules at various stages of a search.

FAST ESP allows you to impact or override the automatic ranking of documents based on business relatedrules, such as to direct the end-users to business-generating pages.

The following tools are available for this purpose:

• Search Business Center• Boost Bulk Tool

Proximity Ranking and MatchingThe term proximity denotes the degree to which a query and a document match, based on the distancebetween the query terms within a document. The calculated proximity value of a query contributes to theoverall ranking value of a document within a result set.

In general, a document in which the query terms are located close to each other is expected to be morerelevant to the query than a document in which the query terms are located far from each other. A highproximity value boosts the overall ranking value of the respective document.

Proximity only has affect on queries with multiple words, and is likely to have higher impact on the result setthe more terms the query includes. Like the overall ranking value, proximity does not change the total number

50

FAST Enterprise Search Platform

of resulting documents for a query, but will improve the ranking order of the result presentation for searchesthat return multiple results.

FAST ESP supports proximity as a selection and ranking criteria in two main ways:

• Explicit Proximity• Implicit Proximity

Explicit ProximityExplicit proximity denotes the fact that you can restrict a query by combining query terms with special proximityoperators.

These operators are:

NEAR

This operator returns documents that contain the two terms combined by the NEAR-operator with no morethan n words separating them. Each term may be a single word or a phrase (enclosed in double quotes).Theorder of the query terms does not matter for the matching, only the distance.

ONEAR

The ONEAR operator provides ordered near-functionality. This means that the query terms combined by theONEAR operator must have the same order in the matching document section as in the query. This queryexpression returns documents that contain the first and second term with no more than n words separatingthem. Furthermore, the first term must appear before the second term in the matching section of the document.

Adding explicit proximity constraints to a query will improve the precision of the result by eliminating irrelevantresults from the result set.

For details about applying explicit proximity, refer to Proximity relevance features in the Query Language andParameters Guide.

Implicit ProximityImplicit proximity denotes the fact that documents get a higher rank value the closer the query terms theycontain are to each other.

Implicit proximity will not change the total result set, but will improve the ranking precision of the result set,as documents that contain the query terms close to each other are ranked higher than documents that containthe query terms less close to each other. As such, implicit proximity provides much of the effect of explicitproximity constraints.

Implicit proximity is only one part of the entire ranking of a matching document. Matching text segments in adocument are assessed along the following criteria (in decreasing order of significance):

• Completeness:The higher the number of query terms present in the same element of a matching document,the higher the document’s ranking value gets. In addition, important query terms, that is, words that arenot stopwords, add a higher boost to the ranking value of the document than stopwords.

• Distance: Query terms occurring very near to each other add more to a document’s rank value than queryterms that are less near to each other.

• Position: The earlier a query term occurs in a document, the higher the document’s rank value gets.

A built-in feature provides a rank boost if the query terms occur close to each other within the 255 first wordsof a field or composite field. This feature is applied on index level, that is, based on indexed proximityinformation. This feature is not configurable.

In FAST ESP, implicit proximity boosting features are applied as part of the core matching within the FASTSearch Engine. Proximity boosting features are applied to both AND, OR, NEAR, and ONEAR queryexpressions.This means for example that a query expression a AND b will give higher relevancy to a documentwhere a appears closer to b than in other documents.

51

Concepts of Relevancy

Note: Proximity boosting is not applied to query expressions using the ANY operator.

Sorting OverviewThe term sorting denotes the ability to order search results according to a value in one or more index fields.Sorting search results depends only on the fields in a document. The position or frequency which the queryterm or terms may have in the matching documents do not influence sorting.

Which field to use for sorting is specified as part of the query. For details on how to enable this, refer to theQuery Language and Parameters Guide.

FAST ESP supports sorting along the following data types:

• field values (numeric and full text)• rank (relevancy and score)• geographical distance (geo field)

Full Text SortingFAST ESP supports sorting on full text.This means that you may sort on a configurable number of characters,without any limitations on the text string. Full text sorting includes national text sorting rules.

FAST ESP allows you to sort results in ascending or descending order.

Multi-Level SortingSorting may be defined for either single or multiple fields. Specifying multiple fields allows for multi-levelsorting. This way, you may for example sort a result set by product name, then by price, and then by date.Multi-level sorting enables database-type sorting schemes with a list of fields to be used for sorting.

Combining Ranking and Sorting

Multi-level sorting allows you to combine ranking and pure sorting, as the rank field may be one of the sortlevels. Multi-level sorting is supported for any field that has been defined for sorting in the index profile.Ascending and descending sort order is available for all fields as part of multi-level sorting. The sort order isspecified at query time.

Sorting on Geographical CoordinatesThe GEO Search feature enables sorting based on distance from the end-user location. Refer to Geo Search.

Field CollapsingA feature related to sorting is field collapsing. Field collapsing allows for folding of results with identical valuefor a given result field.You can use this feature in order to collapse results with given attributes.

There are two kinds of field collapsing:

• field collapsing which removes collapsed documents• field collapsing which does not remove collapsed documents (default)

Refer also to Hit and Navigator Count in the Query Language and Parameters Guide for information fieldcollapsing.

Field Collapsing without Document RemovalYou may for example want to collapse all results with the same product name or code in the result set. Theresult will be re-sorted in such a way that the collapsed results are presented last.

52

FAST Enterprise Search Platform

This type of field collapsing can be enabled/disabled at query-time. However, the result specification area ofthe index profile must be specified to support the feature.

Field Collapsing Including Document Removal, Query-Side CollapsingThis type of field collapsing is controlled at query time and does not require additional index profile support,which makes it possible to select collapse fields on a per query basis. Unlike the default option, this type offield collapsing makes it possible to remove documents from the result set. (Collapsed documents are alwaysremoved.)

The following options are possible with field collapsing including document removal:

• simple collapse, which removes collapsed documents from the result set.• collapse on specified numeric fields, where fields for collapsing can be determined at query time, and thus

doe not need to be specified in the index profile.• collapse and keep N number of collapsed documents, where it is possible to keep a specified number of

documents for each collapsed group.

More information on field collapsing can be found in the Query Language and Parameters Guide.

Controlling Ranking and Sorting of Query ResultsESP lets you control ranking and sorting of query results in several different ways.

Controlling ranking and sorting can be done by:

• Specifying multiple rank profiles in the index profile.• Specifying sorting attributes for individual fields in the index profile.This will define which sorting attributes

are available.• Controlling result sorting on a per query basis. By default the result is sorted based on the default rank

profile. Query parameters enable you to specify an alternative rank profile for the query, or a set of fieldsthat the result set is to be sorted by.

For details on specifying rank profiles and sorting attributes in the index profile, refer to the ConfigurationGuide.

How to use the result sorting query parameters is described in further details in the Query Language andParameters Guide.

Boundary MatchingFAST ESP provides support for boundary-sensitive matching. This means that you may search for words inthe start/end of a field, as well as an exact token match with a field. Boundary matching can be applied tofields of type string.

Use case examples may be a product name field where the full name of one product is a substring of anotherproduct name, or a field containing a list of string values, for example, a list of names. In this case it may bedesirable to be able to match the exact content of each string, and to avoid query match across stringboundaries.

Boundary matching is applied on the tokenized text. This means that it is not a true exact match includingupper/lower case etc.

Exact matching is enabled per index profile field.

For more information about Boundary Matching refer to the Query Language and Query Parameters Guide.

Applying Boundary Matching

Applying boundary matching requires you to configure the relevant field in the Index Profile

53

Concepts of Relevancy

Refer to the Configuration Guide for details on how to configure the Index Profile accordingly.

Refer to the Query Language and Query Parameters Guide for details on how to apply boundary matchingin queries.

Duplicate RemovalFAST ESP provides different ways of detecting and removing duplicate documents.

Crawler Duplicate Removal - The FAST Crawler is able to detect duplicates within collections.This duplicateremoval may be configured to exclude metadata in the HTML document. Refer to the Crawler Guide for moreinformation.

Dynamic (Result-side) Duplicate Removal - A result-side duplicate removal feature may be used to detectand remove duplicates across collections, and also enable a more flexible definition of perceived duplicates.

Field Collapsing, which does not remove duplicates, but re-ranks documents based on similar value for agiven field.

Dynamic (Result-Side) Duplicate RemovalThe result-side duplicate removal feature may be used to detect duplicates across collections, and also toenable a more flexible definition of perceived duplicates. This feature is called dynamic duplicate removal.

The dynamic duplicate removal feature ensures that, within a result set, duplicate documents are representedonly by one single document. This document is the one that has the highest relevancy ranking within the setof duplicate documents.

A field to which you typically would apply dynamic duplicate removal is the field of a document containing itsURI. When this is specified in the index profile, only documents with different URIs will be included in theresult set.

In general, basic duplicate removal based on URI is performed prior to indexing by the data sources. However,in certain cases the same document, that means one specific URI, may appear in different collections withinyour FAST ESP installation. As queries may be applied to selected collections only, it is not possible to detector remove such duplicates prior to indexing. In such cases, dynamic duplicate removal allows you to filter outduplicates from the current result set.

There are two ways to use this feature:

• Activating the feature in the Index Profile Result Specification.

For details, refer to the Configuration Guide. The result set to a query will then only list those documents thathave different values in this field.

• Activating the feature on a per query basis.

For details, refer to the Query Language and Parameters Guide.

In this case the index profile configuration is optional.

54

FAST Enterprise Search Platform

Chapter

8Processing Queries and Results

This chapter introduces you to the basic concepts of processing queries andresults.

Topics:

• Query and Results ServerOverview

• Query Concepts• Query Processing• Result Processing• The FAST Search Front End

(SFE)

Query and Results Server OverviewThe FAST Query & Result Server (QR Server) provides query and result processing prior to submitting thequeries to the Search Engines and presenting the result list on the search interface.

It receives search queries from the Search API, analyses them, and, if required, transforms them. It distributesthem to the appropriate Search Engine nodes and creates a feedback about what the query analysis hasbrought up and what search results it gives. Depending on configuration, this feedback is sent back to theend-user, ignored, or used for automatic query re-submission.

Furthermore, the Query & Result Server receives search results from the Search Engine nodes, processesthem and forwards them to the Search API. For more details about the Search API, refer to the QueryIntegration Guide.

Query Concepts

Query and Result ServerThe FAST Query & Result Server provides query and result processing prior to submitting the queries to theSearch Engines and presenting the result list on the search interface.

The Query & Result Server contains multiple transformers that perform specific query and result processingtasks. There are two types of processors:

• Processors that contribute to query processing. They form the query transformation framework. Theirnames have the format qtf_*.

• Processors that contribute to result processing. They form the result processing framework. Their nameshave the format rpf_* or rff_*.

The Query & Result Server provides:

• Linguistic query processing such as spell checking and anti-phrasing• Result Clustering• Navigation• Find Similar• Dynamic Duplicate Removal

Query and ComponentsA query submitted to the FAST ESP system consists of two main components: a natural language querycomponent which is subject to linguistic query processing and proximity/context ranking; and a structuredfilter component which is not modified during query processing.

The search query must comply to one of the supported Query Languages.

Refer to the Query Language and Parameters Guide for a detailed description of the supported query languagesand query features.

56

FAST Enterprise Search Platform

Query ProcessingWhen an end-user submits a query to the FAST ESP system, the query is subject to query processing forrelevancy enhancement in the FAST Query & Result Server, before it is passed to the FAST Search Engineto perform the original or processed query.

Query processing is based on linguistic analyses of the query string. It includes the following linguisticsfeatures which are explained further in the Advanced Linguistics Processing chapter of this guide and in theAdvanced Linguistics Guide:

• Proper name and phrase recognition• Spell checking• Anti-phrasing• Lemmatization

Query ModificationsQuery processing may be configured globally and per query, and there are several different wasy to modifya query in FAST ESP.

Query modifications may be applied in the following ways:

• as an automatic rewrite of the query before execution against the index

This is most useful for Anti-Phrasing, when common query parts as in Where do I find information aboutJapan are removed and the query is reduced to the essential query string information Japan.

• as a suggested rewrite, typically presented as a search tip on the result page.

This is a more conservative approach avoiding any unexpected query rewrites that the end-user did notintend. It is most useful for proper name recognition, when the query string World Cup is detected as a phrase,and a search tip such as Did you mean World Cup? is returned. It is also useful for spell checking.

• a combination of the two above:The query is first executed in its original form. In case of no hits, the queryis automatically resubmitted using the automatic rewrite option, and the new result is presented to theuser.

This is an approach that is transparent to the end-user. The resubmission parameter is set per query and theresult received on the API will also indicate the transformed query.

Query ResubmissionThe resubmission parameter is set per query and defines which of these features are to be enabled if theoriginal query returns no hits.

FAST ESP is able to perform a number of automatic or suggested transformations of the user’s query, basedon advanced linguistics. This includes spell checking, proper name recognition and anti-phrasing.

There are three types of the query transformation:

• Modify – The query term string is automatically modified using the transformation parameters.The modifiedquery is executed and the result set is returned

• Conditional Modify – The query term string is automatically modified if no hits are returned by the executedoriginal query

• Suggest – The executed query is not transformed, but a suggested transformed query is returned togetherwith the result set, based on the original query. This flexibility allows the application or the user to decidehow to modify the query terms entered by the user.

For details, refer to the Query Language and Parameters Guide.

57

Processing Queries and Results

FAST Query LanguageThe FAST Query Language (FQL) is used to express query terms, operators and query modes/options. Thisis further described in the Query Language and Parameters Guide.

In addition to the FAST Query Language (FQL), FAST ESP provides two alternative query languages, theSimple Query Language and the Advanced Query Language. These query languages are included forbackwards compatibility and do not support all features provided in FAST ESP. Refer to BackwardsCompatibility in the Query Language and Parameters Guide for information.

You can use the FAST Query Language to perform exact searches and to narrow the scope of your searchto values belonging to a specific FAST ESP field, composite field or scope field.

A query language expression may contain a number of nested sub-expressions of one or more of the followingtypes:

• query term: A query term consists of one or more words, strings or numeric values. (A query consists ofone or more query terms.)

• scope specification: A scope specification limits the possible matching sections of the documents to aspecific field, composite field or a scope structure within the field or composite field.

• operators: Operators may apply boolean operations (AND, OR, etc.), define certain constraints to theoperands (for example, filter()), apply explicit proximity constraints (max word distance between matchingterms), apply numeric range operations, or specify data types and attributes to the data (such as linguisticsoperations).

Result ProcessingAfter the query sent by the end-user has been processed, it is passed on to the FAST Search Engine, whichmatches it against the index and returns the list of results to the FAST Query & Result Server.

Result processing includes the following features:

• Category result grouping• Find Similar• Field-based categorization• Query highlighting through teasers• Duplicate removal

Result ViewsA result view includes the information that is returned with each search result.

In its simplest form the result view is a short teaser summarizing the content of a document. However, theresult view in FAST ESP is completely configurable and may contain a smaller or larger set of fields from theinitial document. In certain cases, such as database indexing, it is convenient to provide all the indexed fieldsof each database record in the result view, so that the customer application may present the data in variousways without the need for retrieving the database record once more.

When defining the index profile for a certain collection, you specify which fields are to be returned as part ofresult views.This configuration impacts the total size of the index, as this information will reside on disk withinthe index. Based on that, it is possible to define different result views that can be applied to a query. Whichof these specified result views to apply when a result set actually is to be presented, is specified by a queryparameter.

The definition and selection of field views impacts the amount of data returned from a query.Therefore, moreinformation in the result view implies more bandwidth used between FAST ESP and the customer application,and will also have some minor performance constraints.

58

FAST Enterprise Search Platform

Query Result Highlighting through TeasersThe result view may include a teaser field. Teasers allow you to highlight important parts of a query result. Ateaser is a summary field that is generated in order to be used as a general result summary of documents inthe result set presentation. Two types of teasers are supported.

For details about defining teasers in the index profile, refer to the Configuration Guide.

DescriptionType of Teaser

This is a generated summary field that is convenient to use when presenting results from,for example, web pages or text documents. This teaser is created during document

Static teaser

processing, and typically analyzes a HTML document, extracting a few lines of text thatreflect the most relevant content of the document.

This is a generated summary field that enables presentation of a document extract incontext with the search query. The text of the document body is used during result

Dynamic teaser

processing in order to retrieve the text segments that include the best matches of thequery. In most cases the dynamic teaser provides a more relevant text for the result pagesthan the static teaser. The relevancy of a text segment is determined by (in decreasingorder of significance):

1. phrase matching.2. completeness: The more search terms a text segment contains, the more relevant

this text segment is.3. proximity:Text segments that contain query terms that occur near each other are more

relevant than others.4. position: The earlier a text segment containing one or more search terms occurs in

the document, the more relevant it is compared to the others.

Which teasers to use in a result view is specified in the result view sections of the index profile for the respectivesearch engine cluster.You can specify one teaser per result view; in addition, you may specify a field to beused as a fallback teaser field in case the generation of the original teaser field fails.

Query Result Highlighting in Source DocumentFAST ESP also enables Query Result Highlighting in the source document. See Making DocumentsSearchable, section Query Highlighting in Source Documentation for more information. Refer also to theFAST SDK Document Highlighting Guide for details.

The FAST Search Front End (SFE)The Search View selection in the Administration interface (Admin GUI) allows you to view the default SearchFront End (SFE) provided with FAST ESP.This front end lets you search the documents of your implementationfor testing purposes.

59

Processing Queries and Results

Figure 6: Example of the Search Front End (SFE)

Refer to the SFE User's Guide for more information about how the SFE works. Refer also to Default SearchFront End Features in the Query Integration Guide.

60

FAST Enterprise Search Platform

Chapter

9Geo Search

This chapter provides an overview of the FAST ESP Geo Search feature. TheGeo Search feature provides capabilities for sorting and filtering query resultsbased on geographical location.

Topics:

• GEO Search Overview

GEO Search OverviewThe Geo Search feature provides capabilities for filtering, sorting and boosting query results based ongeographical location.

To enable the geo search feature, you must provide location specific information for each document on thecontent access side.You can add several sets of coordinates for a single document, to imply that the documentprovides relevant information for all the specified locations.

The location specific information can be added as meta data from the content source, or added duringdocument processing based on analysis of the content, mapping from URL, etc. In some cases the processingmay require some interaction with an external, customer specific application or data base. Note that thelocation information must be in the form of one or more longitude/latitude pairs prior to indexing. Thegeographical coordinate information is indexed using optimized geo index structures for high performancesearching. The fields/elements used for geographical coordinates are configured in the Index Profile.

On the query side it is possible to filter and sort the result set, using the end-user location and the geographicaldistance between end-user position and the positions associated with each document in the index. The usercan also specify an alternative center coordinate along with the end-user location. All sorting and filtering willthen be performed using the end-user location, but the distances shown in the result set will be calculatedusing the alternative center coordinate. The latter approach may, for instance, be useful in cases where thedisplayed distance shows the distance from the current user location, while the sorting/filtering is based ona displayed map extract where the user is not located in the center of the map.

Filtering of the result set can be based on a radius (a circle) or a square box.The square box is typically usedin association with a map presented on the result page. Both the radius and the box size are configurableper query, and are specified using dedicated query parameters (in addition to the query string). When sortingthe result set based on the distance from the end-user location to the hits, the hits closest to the end-userposition appear on top of the list.When no sorting is specified, the result set is sorted according to the dynamicrank values. Use the geo boosting feature to combine the two: By boosting the dynamic rank values with anoffset based on the distance, the most relevant hits (in terms of both dynamic rank and distance) will appearcloser to the top.

Important: When using geo boosting in combination with the stopword threshold feature, it may happenthat a hit very close in distance still ends up at the end of the result-set. The reason is that the stopwordthreshold defines a limit, a maximum number of matching documents, where the system stops computinga dynamic rank value for the given search term. Instead it sets the dynamic rank value to zero, diminishingthe boosting effect. Refer to the Configuration documentation for details.

62

FAST Enterprise Search Platform

A typical location application would enable drilling down on distance from the end-user location. Such adrill-down would not be implemented using the FAST ESP Dynamic Drill-Down feature, but instead the queryapplication may provide the end-user with +/- selections or similar which map to a different restriction ondistance and/or area.

Examples

In the illustration above, C1 indicates the end-user’s geographical position, and the bullets 1-9 indicate thegeographical position associated with 9 different documents in the index. One document, 7, contains twodifferent locations (for example, two different offices for the same company).

Filtering

When filtering using a radius is enabled, the user can specify a center coordinate C1 and a radius r, and onlydocuments within the given radius will be included in the result set. In the illustration, this would includedocuments 4, 6, 7, and 8.

When filtering using a box is enabled, the user can specify an area defined by coordinates B1 and B2, andonly documents with coordinates within the box will be included in the result set. In the illustration, this wouldinclude documents 2, 3, 4, and 6.

Sorting

When sorting on distance is enabled, the user can specify a center coordinate C1 in the query, and the resultset will be sorted in either ascending or descending distance from C1. If a document contains more than onecoordinate, the coordinate with the shortest distance to C1 will be used. In the illustration,, document 7 hasthe shortest distance to C1 (location 7 1 ), followed by 6, 4, 8 and so on.

If you want the dynamic rank value to influence the sorting of the result set, as well as the distance, you canuse the boosting feature instead of regular sorting. For example, if the documents 6 and 7 in the illustationare given the rank values 0 and 1 respectively, and document 8 and 9 the value 2, this will be the resultdepending on the sorting criteria:

Result Set OrderSorting Criteria

(7, 6, 8, 9) or (9, 8, 6, 7 )Distance

(9, 8, 7, 6) or (8, 9, 7, 6)Dynamic rank values

Note that the order of documents with the same rank value is not defined.

(7, 8, 9, 6)Boosted rank values

63

Geo Search

Result Set OrderSorting Criteria

Note that the order of the documents depends on the weight of the boosting.

Using Alternative Center Coordinate

When the user specifies an alternative center coordinate C2 along with the coordinate C1, the distance D1in the illustration (between C1 and document 6) would be used for sorting and filtering the result set, but it isdistance D2 that will be displayed in the actual result.

Combining Features

It is possible to combine the two approaches to filtering as well as sorting. In the illustration, filtering usingthe radius r and the box defined by B1 and B2, includes only documents 4 and 6 in the result. Then you cansort the filtered result set based on distance.

64

FAST Enterprise Search Platform

Chapter

10Scope Search and Dynamic XML Indexing

This chapter explains what Scope Search is and how it works. It also explainswhat dynamic XML indexing is and describes its core capabilities.

Topics:

• Scope Search Overview• Scope Search vs. Fielded Search• Scope Search Concepts and

Capabilities• Dynamic XML Indexing

Scope Search OverviewScope Search is a feature that enables search in hierarchical content structures without a need to know theindex schema in advance. Using Scope Search makes searches more precise than searching using a standardindex schema. It allows you to specify hierarchies to be used as the basis for identifying exactly what kind ofinformation you want to extract, and how you want it to be presented. When using scope search, it does notmatter how this schema is defined.

Definition of a ScopeA scope is an entity or object that has a name and has content. A scope can have sub-scopes.When a searchis performed on a scope, the search is done in the content of the scope and all its sub-scopes.

Fields can support nested scopes or sub-fields. A field is a top-level, root scope. Examples of fields are:content (body), authors, or metadata. Multiple fields can be defined. A nested scope or sub-field is an elementwithin that root scope.

Figure 7: Scope Example

This example of the document scope structure shows a simple example of a scope field within a searchabledocument. The indicated scope field in the Index Profile is named book, and contains Authors elements,which in turn contain one or more Author elements.

Scope fields, normal (text/numeric) fields, and composite fields can be combined within the same IndexProfile.

Example of Using a Scope SearchThe following example shows an excerpt from the play “Hamlet” in XML form. The expanded elements havea “-” in the margin and the collapsed elements have a “+”.

FQL syntax to get the famous speech from Hamlet is:

- <PLAY> <MAINTITLE> The Tragedy of Hamlet, Prince of Denmark</MAINTITLE> + <FM> + <PERSONAE> <SCNDESCR>SCENE Denmark.</SCNDESCR> <PLAYSUBT>HAMLET</PLAYSUBT> + <ACT> + <ACT> - <ACT> <TITLE>ACTIII</TITLE>- <SCENE> <TITLE>SCENE I. A room in the castle.</TITLE> <STAGEDIR>Enter KING CLAUDIUS, QUEEN GERTRUDE,POLONIUS,OPHELIA, ROSENCRANTZ, and GUILDENSTERN</STAGEDIR>

66

FAST Enterprise Search Platform

+ <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>Exeunt ROSENCRANTZ and GUILDENSTERN</STAGEDIR> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>Exit QUEEN GERTRUDE</STAGEDIR> + <SPEECH> + <SPEECH> + <SPEECH> <STAGEDIR>EXEUNT KING CLAUDIUS and POLONIUS</STAGEDIR> <STAGEDIR>Enter HAMLET</STAGEDIR> - <SPEECH> <SPEAKER>HAMLET</SPEAKER> <LINE>To be, or not to be: that is the question:</LINE> <LINE>Whether ‘tis nobler in the mind to suffer</LINE> <LINE>The slings and arrows of outrageous fortune,</LINE> <LINE>Or to take arms against a sea of troubles,</LINE> <LINE>And by opposing end to them? To die: to sleep;</LINE> <LINE>No more; and by a sleep to say we end</LINE>

How Scope Search Works and Why It is UsedHierarchical content is represented as a hierarchy of scopes inside the FAST ESP index.

Scope Search in FAST ESP is based on the following:

• Scope Indexing provides a scope-aware indexing of content with hierarchical structure, enabling efficientsearch in scope structures. Scope Indexing is generic in the sense that it does not require any specificcontent input format. FAST ESP supports XML input format - other input may be supported by creatingcustom document processors.

• Dynamic XML Indexing provides a mapping from any XML to the internal FAST Scope structure.• The FAST Query Language (FQL) is a query language which supports scope search queries.

Scope Search can be used for:

• Indexing customer XML content without any knowledge of the DTD/schema. FAST ESP includes a dynamicXML pipeline that maps submitted XML to one or more scope fields.

• Indexing a more dynamic field structure using the Scope Search framework. In this case it is possible tochange field structure without changing the Index Profile. In this case XML is used as an intermediateformat in order to submit structured data to the system.

Scope Search vs. Fielded Search

Scope search Strengths

• Entities tagged in their original position (instead of stored in global meta fields)• Search and navigation precision• search witin context (sentence, paragraph, etc.)• search for entity or existence of entity

67

Scope Search and Dynamic XML Indexing

• return specific scopes (for instance sentences) instead of teaser• navigation menus generated from local context only• full schema flexibility• preserve existing hierarchical document structure – for instance XML docs

Normal Search Strengths

• Performance• Deep navigation (accurate counts based on full index)• Current relevancy model is really targeted for documents (not sentences or paragraphs)

Scope search and normal search are often used together, with normal global meta fields (such as documentdate), and the body content as a scope field.You can refer to both scope fields and normal fields in the sameFQL query.

Scope Search Concepts and CapabilitiesThe core concepts and capabilities in Scope Search are described in this topic.

Scope FieldsThe FAST ESP Indexer is based on a field structure that defines the schema of the indexed content. Theschema is defined using the Index Profile. The Scope Search feature is facilitated by introducing a new fieldtype in the FAST ESP index, named scope field. Hence, a scope-enabled index may include different typesof fields.

A scope-enabled index may include the following types of fields:

• Basic field. A basic field may be of type string (any textual content), int32 (32 bit signed integer), float,double or datetime (representing a date/time value as a numeric value in the index), uint32.

• Composite field. A composite field includes a set of basic string fields that can be matched using the built-indynamic ranking mechanisms in FAST ESP.

• Scope field. A scope field contains hierarchical scope content. The individual subscopes of a scope fieldmay be of any data type supported by FAST ESP (string, int32, float, double or datetime). For textualscopes, a subset of the dynamic ranking mechanisms as provided for composite fields will apply. Whendefining a scope field, there is no need to define the actual scope structure within the scope field in advance.

A FAST ESP index profile may contain a combination of one or more fields, composite fields and scope fields.Hence, it is possible to combine in one index both schema based content in fields with and scoped dynamiccontent.

In the query language you may specify individual fields, composite fields or scopes to limit the scope of aquery. For scope queries the scope specification in the query must include the scope field name (also calledthe root scope) and sub-scopes within the indexed scope structure.

A scope field may include a hierarchy of scopes in arbitrary depth.

The Scope Indexing is generic in the sense that it does not require any specific content input format. FASTESP supports XML input format - other input may be supported by creating custom document processors.

NavigationDynamic Drill-down (Navigation) in FAST ESP provides functionality for drilling down into the query resultbased on value distribution of one or more individual fields.

Note: Dynamic Drill-down can only be applied to non-scope textual or numeric fields, that is, it is notpossible to apply this feature to scope fields as such.

68

FAST Enterprise Search Platform

If a specific element from the scope structure is desired to use for dynamic drill-down, it is possible to extractthis element from the source content (for example, an XML element) during content processing prior to thescope mapping and assign this to an individual field in the index, with an associated Navigator specification.In this case the element may still be searchable within the scope structure, but may also be used for drill-down.Refer to section Mapping XML to One or More Scope Fields for further details.

It is also possible to apply a result-side navigator on an extended document summary. When this feature isenabled, dynamic extraction of entities and concepts is enabled on the n first matching documents of theresult set, where ‘n’ is by default 100. The entities are extracted from the sections of the documents wherethe query matches best, i.e. similar to the dynamic teaser.

Scope Data TypesScopes may be of any supported data type. The data types are supported in a similar way as for individualfields.This means that string scopes support dynamic ranking mechanisms, linguistics, phrasing and wildcardsas for composite fields.

Numeric scopes support exact matching and range matching mechanisms as for individual numeric fields.

Numeric matching requires that the query term and target numeric scope is of the same type. Matching aquery term of type float with a scope of type double will not return any results. Therefore, it is required toapply consistent data typing in queries. Such data typing can be applied using explicit type conversion (forexample, double(24.5)) or implicit default typing based on the term format.

When querying non-scope numeric fields the system will know the type for the field, and will perform anautomatic type detection based on the indicated field.

Refer to Numeric Operators in the Query Language and Parameters Guide for details on literals and explicittype conversion.

Query Language in Scope SearchScope queries are only supported within the FAST Query Language (FQL).

The Simple Query Language and the Advanced Query Languages do not support scope queries, but maybe used for individual fields and composite fields even if the index also includes scope fields. Refer to theQuery Language and Query Parameters Guide for more information.

The scope query root:date:foo will search for the term ‘foo’ in all scopes with name date within the scope fieldnamed root, including all sub-scopes to such date scopes.

Return Matching ScopesUsing Scope Search, any scope in your query can be returned. This is referred to as Matching Scopes, andmakes it possible to make retrieval more specific by returning a sub-section of a scope, if desired.

Using the example of the Hamlet play, you can have the entire Hamlet play returned, or, you can specify thatyou want to see the speech that contains the line “question” which will only return the speech for you.

Using Matching Scopes, you can specify to search for the speech in specific instead of the entire play.

Return matching scopes can be used to provide data for Navigation as well. Refer Taxonomy and Navigation,section Navigators for more information.

Scope BoostingSelected scopes are assigned a boost level during document processing. There are eight possible levels.The boost level is inherited in descending scopes unless they are explicitly assigned a new boost level.

At query time, the scope boost level is used when calculating the dynamic relevance score (ranking) for termsin a query (together with the term statistics (tf-idf)). For more information on document scope boosting, contactyour FAST Account Manager.

Refer to the Configuration Guide for more information on Scope Boosting.

69

Scope Search and Dynamic XML Indexing

Dynamic Document Summary (Teasers)The dynamic document summary (teaser) is a short abstract of the matching document where the matchingterms are highlighted within the context.

FAST ESP supports creating dynamic document summaries for non-scope text fields and scope fields. Thedynamic document summary for a scope field only displays matching contexts with query terms that are withinthe same scope field.

The dynamic document summary will, by default, highlight query matches within all string scopes of the scopefield. Full scope search is supported for FAST ESP.

• For scope fields the default is to return the matching scopes as valid xml, inside a<matches><match>...</match></matches> envelope. Up to the most relevant 100 matches will be returned,although it is configurable. This means that the document summary may highlight string scopes that donot match the sub-scope specification but match the query terms.

• For normal fields based on scope field input, the default is to generate a normal teaser that is scope aware.• For normal text fields without markup in the document summary source, teasers are not scope aware.• The dynamic document summary identifies sub-scope boundaries, so that each unique text segment

within a document summary is within one scope.

Metadata within the query (using the filter operator) will be considered during matching, but not highlighted.In other words, matches that do not “pass” the filter will not be presented in the dynamic teaser.

A scope field can be configured to fall back to a “static” (query independent) document summary. In addition,there are several index profile features (source-ref, default-result, dynamic-type) related to the dynamicsummary generation. Refer to the Configuration Guide for more information.

Linguistics and Scope SearchScope fields support the standard FAST ESP linguistic features. In addition, Word Stacking is supported forlemmatization and synonyms. Refer to the Advanced Linguistics Guide for information on the FAST ESPlinguistics features.

Word Stacking and NormalizationWord Stacking is supported for scope fields.The concepts of Word Stacking and Normalization are explainedbriefly here. Refer to the Advanced Linguistics Guide for more details.

The FAST ESP Scope Search Index supports indexing multiple variations of the same word (token) at thesame word position within the index. This concept is called Word Stacking. Word Stacking enables you toindex multiple normalization variants for the same word, e.g. original form, lemmatized, lowercased,de-accentuated.This is a more flexible alternative to the traditional indexing approach, which means a uniformnormalization of all words in documents and queries. Uniform normalization means that you need to selectone level of normalization for all content (e.g. lowercasing and accent removal).This works well in most cases,but removes the ability to select higher precision for the queries (e.g. case sensitive search).

When using Word Stacking it is possible to select the desired level of normalization on a per query basis.This in turn enables the following advanced linguistics features:

• Phrase or proximity (NEAR/ONEAR) queries in association with wildcards, lemmatization and characternormalization (accent normalization)

• Per query selection of accent/case sensitive/insensitive matching• Efficient handling of lemmatization in a multi-lingual environment. If you do not know the language of the

query, and the index contains content in multiple languages, then a simple normalization to the base form(based on the default language setting) is not appropriate, as this can introduce ambiguities. Instead thelinguistic query processing expands the query term to an OR between the original word and the base formof the word.

• Efficient handling of character normalization in a multi-lingual environment. Example: French publicationswill include all appropriate accents for French language words whereas English language documents

70

FAST Enterprise Search Platform

containing French words often omit the accents. Suppose a French document contains the phrase "Côted'Azur" and another English document contains the phrase "Cote d'Azur" without the accent. In the firstdocument, "Côte" would be indexed as the two variants "côte" and "cote". A user query of "Côte d'Azur"would hit only the first document (if selecting accent sensitive search), but a user query of "Cote d'Azur"would hit both documents.

• Normalization, which is replacing a character or sequence of characters is not always sufficient. If wesimply normalized both versions to "cote" this would be less acceptable to French-speaking users becausethe accents differentiate "côte" meaning "coast" from "côté" meaning "side". In this case precision wouldbe lost.

Refer to the Advanced Linguistics Guide and the Configuration Guide for information on how to configurelinguistics normalization features.

Partial UpdatesFAST ESP supports partial updates on scope fields as a whole, but not sub-trees within the scope field. Thismeans that it is possible to update a scope field and nothing else in the document which results in less contentbeing fed through the document processing pipeline. This can be particularly useful when using Connectors,for example.

Dynamic XML IndexingDynamic XML Indexing implies mapping of XML content to the FAST ESP scope Indexing framework.

FAST ESP provides document processors that can be configured to map any XML structure to a FAST ESPscope structure. The document processor can be configured to map one or more input document elementscontaining XML content to corresponding scope fields.

The document processor does not take into consideration the DTD, but will map all XML elements andattributes to scopes and sub-scopes within the scope field.

By default the scope representation does not differentiate between XML attributes and sub-elements. Bothwill be represented as sub-scopes. The attribute names are prefixed with a ‘@’ which must be used if usingthe attribute name in queries. For details refer to Configuring the FAST Document Processing Engine in theConfiguration Guide.

There are 2 document processing pipeline templates that support XML to scope mapping. One that feedsthe XML structure as-is, and one that extracts entities and adds further scopes. They are called “LightweightXML” and “XML” respectively. By default, the pipeline expects that the XML is included in the data documentelement. This is also the default option when using the File Traverser.

All XML elements and attributes are indexed as text (type ‘string’) by default. It is, however, possible to specifya data type for elements using a pre-defined attribute.The name of this attribute is configured in the documentprocessor. The values for this attribute and the mapping to FAST data types are also configurable.

This default data type support enables typing of elements, not attributes. All attributes will be treated as‘string’. Other custom data type handling may be implemented by creating a custom document processingstage.

For information on Submitting XML and mapping XML, refer to the Content Integration Guide and theConfiguration Guide.

71

Scope Search and Dynamic XML Indexing

Chapter

11Taxonomy and Navigation

What appears in the result set of a search, and how it is displayed, depends ona number of factors. How data is structured, and how the system is set up to

Topics:

display results affect what the user sees after submitting a search. Navigation,• Taxonomy and NavigationOverview taxonomy, and unsupervised clustering make it possible for users to have different

views on the result sets.• Navigators• Taxonomy• Unsupervised Clustering

Taxonomy and Navigation OverviewWhat appears in the result set of a search, and how it is displayed, depends on a number of factors. Howdata is structured, and how the system is set up to display results affect what the user sees after submittinga search. Navigation, taxonomy, and unsupervised clustering make it possible for users to have differentviews on the result sets.

• Navigators allow users to view a list of values or ranges.• Taxonomy is an organized classification structure that groups documents by category.• Unsupervised clustering allows for automatic grouping of similar documents in the result set and suggested

naming of these groups or clusters.

NavigatorsNavigators provide functionality for drilling down into the query results based on value distribution of one ormore individual fields. It is possible to apply navigators to all fields—or just some fields—from, for example,a database or product catalog. FAST ESP supports navigation on scope fields and non-scope documentfields—both textual, like product name, and numeric, like price attributes.

Different types of navigators can be applied depending on the field types.

Refer to Navigators in the Query Integration Guide for more information.

Field NavigatorsIn FAST ESP, it is possible to perform multi-dimensional navigation in structured data based on facets of thecontent (such as database rows, product catalog descriptions, etc.).

Navigators are used to limit overhead on search environment for e-commerce, Yellow Pages, Supply Chain,CRM, etc. Relevant results can be found faster using a combination of searching and browsing by parametricvalue and range.

Navigators can also be used on taxonomy fields to apply deep navigation (meaning the entire result set) intocategories that occur within the results.When used with a taxonomy, each taxonomy node that appears withinthe result set appears as a navigation entry.

Refer to the Query Integration Guide and the Configuration Guide for details on navigators and how to configurethem.

The left column displays the results in the usual ranked order, which is based on FAST ESP static and dynamicrank mechanisms, including the relevant parameter fields.The right column displays the drill-down and binningattributes. The feature is dynamic. The range for numeric values per bin is computed on-the-fly, trying to givea mean distribution of values to displayed range categories. It is also possible to manually specify internalboundaries.

For each field range drill-down links are provided in order to navigate within the displayed value range, forexample, "Lease Price 30-40".

It is also possible to reverse an applied navigator, i.e. reversing filtering criteria.

The navigation parameters are sometimes denoted faceted metadata, and may apply for applications suchas:

• Product databases may have attributes such as price, weight, color, country of origin and product type.• Music store: songs have attributes such as artist, title, length, genre and date.• Recipes: cuisine, main ingredients, cooking style and holiday.• Travel site: articles have authors, dates, places, prices.

74

FAST Enterprise Search Platform

• Regulatory documents: product and part codes, machine types and expiration dates.• Image collection: artist, date, style, type of image, major colors and theme.

An indexed field or attribute can be seen as a dimension in which the query can be refined.The search resultsare examined on the fly, and data is produced that can be rendered in the form of hyperlinks. This will helpthe user navigate to find what he or she is looking for by modifying the query.

This is especially relevant in the context of shopping search, where the searchable index is a database orproduct catalog. The fields indexed for each product may vary according to the type of the product. Bysupplying a navigational aid on top of the search engine that is adaptive to the search results for the user’squery, relevant results can be found faster.

Deep and Shallow NavigatorsField navigators can be deep or shallow. Shallow navigators are based on values specified in flat fields, andin scope fields. Deep navigators are based on values specified in flat fields only.

Deep navigators reflect the entire result set and usually require re-indexing when a new navigator is added.This type of navigator is recommended for all commonly used navigators associated with individual fields(not scope fields).

Shallow navigators work immediately after being defined. They are based on a smaller number of resultsthan deep navigators. Shallow navigation is used when it is not convenient to keep aggregation data in mainmemory within the search nodes.

Scope navigators are always provided as shallow navigators, as they are based on matching scopes only(not known at index time).

For more information refer to:

• The Configuration Guide for information on how to configure navigators.• The Query Integration Guide, Search API Overview chapter for information on navigator interfaces in the

Search API.• The Query Language and Parameters Guide, Query Parameters chapter, for information on navigator

parameters in the Search API.

Contextual NavigatorsContextual navigation (also referred to as scope navigation) is applying navigation to scope fields. Scopefields represent the content in a hierarchical structure as opposed to a flat field. It is not necessary to knowthe index schema in advance.

Applying navigation on scopes lets you limit your search results by narrowing in on a scope such as a paragraphor sentence. The values that are shown in the navigators come from the scope used in the search and notthe full document.

Contextual navigators are shallow. It is possible to create navigators over structural elements in the matchingscope, as well as on scope attribute values and the content of scopes.

Refer the Configuration Guide for information on configuring shallow navigators.

Field Navigators for Values in Scope FieldsWhen an element from a scope structure is desired to use for navigation, it is possible to extract the elementfrom the source content (such as an XML element) during content processing prior to the scope mapping.

This can be assigned to an individual field in the index with an associated Navigator specification.The elementis still searchable within the scope structure, and is also used for navigation (drill-down).

If you have, for example, scope fields for product codes, you can put all the product codes in a flat field inorder to get navigators on them. Otherwise, the product codes would all have to occur in the same contextin order to be seen.

75

Taxonomy and Navigation

If you want to create a field navigator for values in a scope field, you can extract the values during documentprocessing and show them in a flat field and associate the navigators with the field.

TaxonomyA taxonomy is an organized classification structure that groups documents by category. A document could,for example, belong to the category sports, or news. Categorization is the process of mapping documents tospecific categories.

FAST ESP lets you configure and maintain taxonomies and the mapping of categories. It is also possible toapply navigation to taxonomies.

When a set of results is returned, a taxonomy tree is created, which lets you browse information by category.

Figure 8: Example of Taxonomy Tree

FAST Taxonomy ExplorerThe FAST Taxonomy Explorer, an optional ESP taxonomy management tool, contains categorization basedon advanced Linguistic technologies which classify documents and organize information into a hierarchicalor a flat set of categories.

The categorization process inserts category tags into the documents prior to indexing.This is done in severalways. Refer to the Taxonomy Explorer Guide for more information.

When the documents in an index have been categorized, end users can restrict a query to a specific categoryin that index.

Figure 9: Example navigation using a taxonomy

Applying a taxonomy gives a category view of the result set.

76

FAST Enterprise Search Platform

FAST ClassifierThe FAST Classifier provides a framework for training-based classification that can be used when there is asufficient set of documents pre-tagged with category information. Refer to the FAST Classifier Guide for moreinformation.

Unsupervised ClusteringIf there is no taxonomy information associated with documents it is possible to setup the system to automaticallysuggest a category for a document in the result set.This is referred to as unsupervised clustering. Unsupervisedclustering is a kind of automatically-generated taxonomy.

Creating Taxonomy on the FlyUnsupervised clustering means that documents are clustered (“grouped” or “categorized”) based on howsimilar they are rather than using static taxonomy information. Similarity is calculated by comparing documentvectors, which are lists of prominent words in the document.

Document vectors are representations of the unstructured textual content that is associated with a document.Vectorization is the process of computing document vectors and is performed as part of document processingusing the Vectorizer document processor.This is a standard part of all document-related processing pipelinesin FAST ESP.

When a set of documents have been put into a cluster, appropriate name(s) or label(s) for the group arecalculated based on the terms in the document vectors. Refer to Configuring Similarity Vector Creation in theConfiguration Guide for information on vectors.

77

Taxonomy and Navigation

Chapter

12Advanced Linguistic Processing

This chapter introduces you to the basic concepts of advanced linguisticprocessing.

Topics:

• Linguistics Overview Note: Refer to the Advanced Linguistics Guide for information onconfiguration and customization of linguistics features.• Linguistics and Relevancy

• Dictionaries• Automatic Language Detection• Lemmatization• Synonyms and Spell Variations• Advanced Phrase Recognition• Spell Checking and Phrase

Recognition Framework• Anti-Phrasing• Sub-String Search• Wildcard Search• Special Characters and Accents

Linguistics OverviewHere you will find information that introduces you to the basic concepts of advanced linguistic processing.Refer to the Advanced Linguistics Guide for information on configuration and customization of linguisticsfeatures.

Linguistics and RelevancyIn search linguistics is defined as the use of information about the structure and variation of languages sothat users can more easily find relevant information. The document’s relevancy with respect to a query is notnecessarily decided on the basis of words common to both query and document, but rather the extent thatits content satisfies the user’s need for information.

Linguistics tools determine the intent behind keywords. For example, a user searching for MP3 player wouldbe interested in a hit that matched iPod. If the site only shows results for the keywords MP3 and Player, asale could be lost.

In order to achieve relevancy, linguistic processing is performed both at the document level – during documentprocessing – and at the query level – during query and result processing. On the query side linguistic processingresults in a query transformation, on the document side, linguistic processing results in document enrichmentprior to indexing in order to cover grammatical forms and synonyms.

FAST ESP provides a comprehensive set of linguistic features.

Linguistics ConceptsThere are a number of basic linguistics concepts that are are used throughout the documentation.Understanding these concepts makes it easier to understand how relevancy with respect to linguistics worksin FAST ESP.These concepts include: entity extraction, lemmatization, tokenization, normalization, synonymexpansion, and spell checking.

Entity extraction is isolating known linguistic constructs, such as proper names or location designators.

Synonyms are words that are related in meaning, such as notebook and laptop. In a search engine, synonymexpansion can be performed at query time or index time. Synonym expansion at query time lets the searchsystem administrator modify thesauri when necessary without the need to re-index.

Lemmatization is the aggregation of different word forms to enable search across different forms of the sameword (such as products and product). Lemmatization enables searches to match documents with similarmeaning, but different word forms in the document or the query. Lemmatization has similar effects as stemming,but is more precise, as it based on dictionaries.

Tokenization (also called segmentation) is the detection of white space characters and symbols that separatewords from each other that are not relevant to the matching process. More complex tokenization is used forCJK languages. For Asian languages, tokenization and lemmatization (by reduction) are combined in oneprocessing step.

Character normalization is the replacement of characters or character sequences with others to enablesearch across variants of words that differ in accents or other character properties. An example is the mappingof the the French é to the unaccented e. Character normalization improves recall, but may have a negativeimpact on precision. It can be beneficial in languages that have accented characters and other non-asciicharacters that are used inconsistently or in different variations.

Phonetic normalization is normalization using phonetic matching rules and is performed on the query anddocument side. Terms that are written differently but sound the same can give the same result. For example,if searching for the name Eyvind, a user could type in Eyvind or Oyvind and get the same result. Contact yourFAST Account Manager for configuration information.

80

FAST Enterprise Search Platform

The Offensive Content Filter is a document analysis tool to filter content regarded as offensive.

The filter is implemented as a separate document processor that can be added to an ESP pipeline.

Refer to the Advanced Linguistics Guide for more information.

DictionariesSome linguistic features depend on dictionaries. By default, FAST ESP provides dictionaries for lemmatization,entity extraction, spell checking including proper name and phrase recognition, synonym expansion, andvariation expansion.

For details on how to edit dictionaries, refer to Configuring Linguistic Processing in the Configuration Guide,and to the Advanced Linguistics Guide (per feature).

Automatic Language DetectionDuring document processing, documents can be analyzed to detect the language in which they are written.This functionality is provided by the Automatic Language Detection feature.

Detecting the language of a document is essential to all other linguistic analysis features, as the resultinglanguage information is used to select language-specific dictionaries and algorithms during documentprocessing and query processing.

During language detection, a given document is analyzed for all supported languages. For each language,a certain score is calculated, based on the occurrences, number, and length of certain test strings. Thelanguage(s) that reaches the highest score, and for which the score exceeds a preset threshold, are specifiedas the document languages.

Attention: For queries, the language has to be explicitly set by the end-user or search application, asthe query itself generally provides too little context for determining the language it is written in.

Default Language

If the language of the document cannot be determined, a value of "unknown" will be specified for the documentelement.

The fallback value can be set in the parameter FallbackLanguage in the LanguageAndEncodingDetector.

Required Custom Dictionaries

Language detection does not require any custom dictionaries.

Supported Languages

A list of supported languages for automatic language detection can be found in the Advanced LinguisticsGuide

LemmatizationThis section explains the concept of lemmatization. Refer to the Advanced Linguistics Guide for more detailson how lemmatization works.

81

Advanced Linguistic Processing

What Lemmatization MeansGenerally speaking, lemmatization means the mapping of a word to its base form and / or all its other inflectionalforms.

Lemmatization can occur for:

• singular or plural for nouns,• tense and person for verbs,• positive, comparative, or superlative forms for adjectives.

Lemmatization makes it possible to submit a query for one form of a word and still get matches that containa different form of the same word.

This allows a user to search for a term like car and get both documents that contain the word car anddocuments that contain the word cars.

In contrast to stemming or wildcard search, which would match all documents containing words starting withcar, such as cared or career, lemmatization allows for recognizing words as matching terms on basis of theirbeing inflectional variations of the query word. With this, lemmatization also takes irregular inflections suchas tooth and teeth into account.

Refer to the Advanced Linguistics Guide for more information on lemmatization.

Advanced Phrase Recognition and LemmatizationAdvanced Phrase Recognition is spell checking for phrases. Lemmatization and Advanced Phrase Recognitioncannot be applied on the same query term at the same time.

Lemmatization will not be applied to query terms that are recognized as proper names or phrases. Theseterms are matched only against the usual search index. For example, FAST Search may be included in thelist of proper names, which would exclude the inflections fasts and searches in the lemmatized index. Likewise,a search for FAST, recognized as a proper name, will not be expanded.

This means that in a standard FAST ESP configuration and Search Front End, lemmatization is available forthe default search index in the any word and all words query modes, but not in the exact phrase mode.

When advanced phrase recognition and lemmatization are applied simultaneously to a query, advancedphrase recognition overrides lemmatization. Thus, if your Search Front End provides both lemmatization andadvanced phrase recognition, not as mutually excluding functionalities, but as options that can be selectedsimultaneously, advanced phrase recognition overrides lemmatization. It is therefore recommended to providethese two selections as mutually excluding radio buttons on your Search Front End.

Synonyms and Spell VariationsSynonyms are words that have the same or idential meaning, for example, live and dwell. In ESP, spellingvariations can be viewed as a special case of synonyms.

Synonym OverviewThere are two available options for synonym handling: Query-side synonym expansion and Index-side synonymexpansion.

• Query-side synonym expansion. This enables dictionary-based synonym expansion on the query side.• Index-side synonym expansion. This feature enables synonym expansion similar to applying

lemmatization – a document to be indexed is expanded with a defined list of synonyms or spell variationsto the words it originally contains.

82

FAST Enterprise Search Platform

As with lemmatization by expansion on the document side, the original document is indexed in the originalsearch index, whereas the expanded document is indexed in a separate expanded index. This allows you tocontrol enabling synonym expansion on a per-query basis.You can decide whether a query is to be executedwith synonym expansion, in which case the query is sent to the synonym index, or without synonym expansion,in which case the query is sent only to the original index.

Dictionary ManagementDictionaries can be edited with LingStudio or with the dictman tool, both explained here.

Synonym dictionaries in FAST ESP can be edited with LingStudio, an interface for advanced editing ofdictionaries and lemmatization strategies. Online help for LingStudio is available through the LingStudioapplication itself.

You can aslo edit dictionaries with the Dictionary Management (dictman) Tool. It is a command-line basedtool that lets you update, extend, and maintain your dictionaries. The tool can be run interactively or as abatch processor.

It is also possible to edit query side synonym dictionaries using the Search Business Center.

For procedures on how to use the Dictionary Management Tool, refer to Configuring Linguistic Processingin the Configuration Guide, and to the Advanced Linguistics Guide.

Advanced Phrase RecognitionAdvanced Phrase recognition is based on a mapping of the query terms against a dictionary of names andphrases, whose content you can modify. Advanced Phrase Recodnition includes phrase and proper namedetection.

Typical proper names are product names, trademarks, product models, part numbers, promotion codes, orstock keeping units. In general, proper names are not part of a language's usual vocabulary, such asexpressions like CJK-400ex. Furthermore, proper names or phrases can be words of a language that havea particular semantic value within the content, such as expressions like Data Search. In either case, propernames and phrases are protected from lemmatization and anti-phrasing.

Restriction: Note that Advanced Phrase Recognition is not available for Chinese, Japanese, or Korean.

Query TransformationsThere is one query transformer that handles the spell checking framework. The didyoumean QT handlesthe transformation from didyoumean queries, phrases, and words.

83

Advanced Linguistic Processing

Figure 10: The Didyoumean QT is where Advanced Phrase Recognition is handled

Advanced Phrase Recognition applies the following transformations to a query:

• It detects implicit phrases and proper names in the query and phrases them explicitly by adding quotationmarks ("the phrase").This means that the detected phrase is protected from further query transformation.In addition, the query will return phrase matches only. By creating for instance a list of product names,you may ensure that queries are directed to the desired pages that match the implicit product name phrase.

• It detects and corrects misspelled phrases and words. Implicit phrases in the query will be spell checkedand corrected. If the dictionary contains the phrase "nissan micra", the queries "nissan macra" will bedetected as a misspelling and corrected to "nissan micra". The spell check will even detect the phrase ifboth terms are misspelled. The phrase "nisan macra", for instance, would then be corrected to "nissanmicra".

• It detects and corrects query terms with alternative spell grouping. If the dictionary contains the term"thinkpad", a query "think pad" will be corrected to "thinkpad". If the dictionary contains the term "alphaserver", a query "alphaserver" will be corrected to "alpha server".

Advanced Phrase CustomizationPhrase dictionaries are customizable. In addition, a list of exceptions allows you to fine-tune terms that areclose to both proper names and valid words in the supported languages.

Advanced Phrase Recognition may be applied in several sequential steps, which may use increasingly broaderdictionaries. This way, you can, for example, apply Advanced Phrase Recognition starting with a narrow listof company specific product names followed by domain specific terms such as computing, pharmaceutical,or engineering terms.

For details about customizing Advanced Phrase recognition, contact your FAST Account Manager or FASTTechnical Support .

Advanced Phrase Recognition and Spell CheckingAdvanced Phrase Recognition is also used for spell checking (see Spell Checking and Phrase RecognitionFramework ). A query term that is close to a proper name is replaced with the proper name.The spell checkingis also applied to phrases, including word splitting and joining. Thus, "datasearch" or "ffast" for example, arerecognized as "data search" and "fast" respectively if the dictionary of proper names includes data search.

84

FAST Enterprise Search Platform

Advanced Phrase Recognition also provides a list of exceptions that avoid spell checking of words that aresimilar to the defined proper names. For example, assuming your content contains the product name eServer,then the English word server should probably not be changed to eServer. In this case, the word server isadded to the exception list for proper name and phrase recognition.

Tip: It is recommended that you modify the exception list on the basis of past queries.

Applying Advanced Phrase RecognitionAdvanced Phrase Recognition is applied as part of the Advanced Spell Checking. The search string theend-user submitted to the system is analyzed. Depending on the configuration of the FAST Query & ResultServer, the result of the query analysis is either sent directly to the FAST Search Engine or sent back to theend-user as feedback.

For details about how to apply Advanced Phrase Recognition, refer to the Advanced Linguistics Guide.

Spell Checking and Phrase Recognition FrameworkThe purpose of spell checking is to improve the quality of the queries by comparing the query terms againstdictionaries and identifying misspelled query terms. As a result of the spell checking process, FAST ESPeither replaces the query terms automatically with the correct terms, or it suggests modifications to the queryterms to the end-user. The latter is referred to as Didyoumean spell checking.

The spell checking algorithm operates on individual query segments. A query segment is a portion of thequery that forms a syntactical entity of some kind. For example, if something within the query is put in quotes,that quoted part forms a query segment.

Spell checking a query is executed in two stages: First, an Advanced Spell Check is performed, followed bya Simple Spell Check.

Restriction: Note that spell checking is not available for Chinese, Japanese, or Korean.

Phrase Recognition and CorrectionDuring the Advanced Spell Check stage, the query terms are run through Advanced Phrase Recognition.

FAST ESP supplies a default dictionary containing names of persons, names of places, names of companies,and other common phrases.You can extend this dictionary with your own custom phrases, for instanceproduct names. This stage combines phrase detection with spell check.

The Advanced Spell Check stage enables all query transformation capabilities included in Advanced PhraseRecognition.

Refer to the section Spellchecking Framework in the Advanced Linguistics Guide for more information.

Note: For previous FDS 4.x users: this was previously referred to as Proper Name Recognition.

Spell Checking on Simple TermsThe Simple Spell Check stage supports spell checking of individual terms against language specific dictionaries.(See Supported Languages for Anti-Phrasing.) This spell check stage will only detect misspelling of singlewords, not phrases. Simple spell checking does not protect the corrected terms from further processing.

Applying Spell CheckingSpell checking is applied during query processing. Spell checking is controlled by the license file (the featureitself and language support).

85

Advanced Linguistic Processing

You activate the Simple Spell Check dictionaries as part of the installation process.

You enable and configure Advanced Spell Check by:

• adapting the required dictionaries’ source files to your content and end-users’ needs• compiling the dictionaries• configuring the appropriate query transformer.

For details, refer to the Advanced Linguistics Guide.

Spell checking can be controlled on a per query basis (on/off/suggest).

Required Dictionaries for Spell CheckingBoth Advanced and Simple Spell Checking require a set of dictionaries. Refer to Configuring LinguisticProcessing in the Configuration Guide for dictionary file locations.

Advanced Spell Check DictionariesThe following dictionaries support advanced spell checking:

• phrase dictionaries

FAST ESP supplies a phrase dictionary that contains common phrases such as names of famous persons(for instance "elvis presley"), names of places (for instance "san francisco"), and names of companies (forinstance "kraft foods").

You may modify the supplied phrase dictionary by adding or removing terms. Alternatively, you may createseparate phrase dictionaries that contain customer specific phrases only. If you choose to create multiplephrase dictionaries, you can enable selecting a specific phrase dictionary to be used for spell checking atquery time.

If a query phrase does not exactly match any entry in the selected or default dictionary, but is close to somedictionary entries, the phrase that is considered the closest match is suggested as a replacement to theoriginal query phrase. If a query phrase matches an entry in the dictionary exactly, the phrase is protectedby quotes and Simple Spell Check will not be allowed to change the terms of the phrase. If there are noentries in the dictionary that are close to matching the query phrase, the query phrase remains unchangedand is sent to the FAST Search Engine.

• the phrase exception list

FAST ESP supplies a default phrase exception list that contains words that are not to be considered for spellchecking. When a query term matches an entry in the exception list, the term will be protected from spellchecking changes.

You can adapt this phrase exception list to suit your content.

Note: All phrase dictionaries are language-independent. Note however that the default phrase dictionariessupplied with FAST ESP are optimized for English.

Simple Spell Check DictionariesThe following dictionaries support simple spell checking:

• single word dictionaries:

FAST supplies language specific dictionaries that contain common words for the particular language.

If a query term does not match any entry in the dictionary, but is close to some dictionary entries, the termthat is considered the closest match is suggested as a replacement to the original query term. If a query termexactly matches an entry in the dictionary, or there are no entries in the dictionary that are considered closeto matching the query term, the query term remains unchanged and is sent to the FAST Search Engine.

You may modify the supplied single word dictionaries by adding or removing terms.

86

FAST Enterprise Search Platform

• single word exception lists:

Single word exception lists are dictionaries that contain words that are not to be considered for spell checking.When a query term exactly matches an entry in the exception list, the term will be protected from Simple SpellCheck.

Note: In contrast to the phrase exception list, the single word exception lists are language specific.

Supported Languages for Simple Spell CheckingDictionaries for simple spell checking are provided for the following languages:

• Dutch• English• French• German• Italian• Norwegian• Portuguese• Spanish

During installation, you select which of the supported languages to be installed. This is mainly in order tosave disk space. To change this after installation, contact your FAST Account Manager or FAST TechnicalSupport.

Dictionaries may also be provided for the following languages (contact your FAST Account Manager or FASTTechnical Support for details):

• Arabic• Czech• Danish• Estonian• Finnish• Hungarian• Latvian• Lithuanian• Polish• Romanian• Russian• Swedish• Turkish• Ukrainian• Hebrew (contact your FAST Account Manager or FAST Technical Support)

Anti-PhrasingAnti-phrasing removes common phrases from the query strings.

These common phrases are defined in the anti-phrasing dictionary.This way, query strings like "Who is MilesDavis?" are reduced to "Miles Davis", which improves query recall, particularly for AND queries.

Anti-phrasing has less effect on the results for OR queries. It may still enhance precision as it may reduceresult rank for documents with many irrelevant occurrences of who is in parts of the document where "MilesDavis" does not appear.

87

Advanced Linguistic Processing

Anti-phrasing is closely related to the concept of stopwords. In contrast to stopwords, however, the anti-phrasingfeature does not remove single words, but entire phrases only. Removing single words implies the risk ofremoving important words that happen to be identical with stopwords. Phrases, in contrast, are moreunambiguous and can therefore be removed from the query more safely.

Required Dictionaries for Anti-PhrasingThe following dictionary is involved (required or optional) in anti-phrasing: default anti-phrasing dictionary.This is a common dictionary for all supported languages.

Supported Language for Anti-PhrasingAnti-phrasing is supported for the following languages:

• Dutch• English• French• German• Italian• Japanese• Korean• Norwegian• Portuguese• Spanish

Anti-phrasing may also be provided for the following languages (contact your FAST Account Manager orFAST Technical Support for details):

• Arabic• Czech• Danish• Estonian• Finnish• Hungarian• Latvian• Lithuanian• Polish• Romanian• Russian• Swedish• Turkish• Ukrainian

Sub-String SearchFAST ESP supports sub-string search, that means searching for parts of a string as with a wildcard search("*term*"). Sub-string search can also be used to enable n-gram for Chinese, Japanese, and Korean. Referto the Advanced Linguistics Guide for more information.

Sub-String Search OverviewThis section explains what sub-string search is compared to a wildcard search, and how a sub-string searchworks.

88

FAST Enterprise Search Platform

A wildcard search is using a wildcard character to substitute for any other character or characters in a string.Refer to Wildcard Search for more information.A wildcard search is using a wildcard character to substitutefor any other character or char.

Sub-string search is based on a specific composite field configuration within the index profile. By setting acomposite field to be a sub-string field, you enable your end-users to search for sub-strings of arbitrary lengthsand at arbitrary positions inside the indexed content.

Wildcard search does not work with phrases: sub-string search does.

Restriction: Sub-string search is not available for scope fields.

When enabled, sub-string search is applied to both document and query. For a field in the index profile thatis specified for sub-string search, each word or token (for Asian language documents) in the field is split upin smaller entities, so called sub-strings, consisting of a defined number of signs. As an example, the word"midsummer" is split up into the sub-strings "mids," "idsu," "dsum," "summ," "umme," and "mmer," providedthe specified number of characters the sub-strings are supposed to have is four.

This allows the end user to search, for example, for the query "summer" and to find a document that actuallycontains the word “midsummer.” The end-user’s query is split into the sub-strings “summ,” “umme,” and“mmer.” During the matching process, the document containing the word midsummer” with all its sub-stringsand the query “summer” will result in a match because both contain common sub-strings.

You may configure the length of the sub-strings into which a word or token is to be split. In addition, you mayconfigure whether white space or other non-word characters functioning as word separators are to be strippedaway, so that sub-strings across words are matched as well.

Application ScenariosSub-string search is useful for application scenarios like the following:

• For certain database applications it may be desirable to be able to search for sub-strings within individualfields, such as product code fields, or name fields.

• Many languages combine several individual words into new words. German, for instance, uses thismechanism a lot. Sub-string search allows your end-user to find documents containing the word"Staatsanwaltschaft" using the query "anwalt".

• In text written in Chinese, Japanese and Korean, there are commonly no spaces separating individualwords. To tokenize documents that are written in these languages, FAST ESP uses a specific, languagesensitive tokenizing document processor in order to find logical places for word boundaries. However,sometimes the process of finding word boundaries is ambiguous, as your end-users may want to searchfor words going beyond what this tokenizer can output. In these cases, the sub-string search functionalityenables queries that are not sensitive to this tokenizer, but go across the word boundaries the tokenizerhas come up with, thus matching any sequence of characters.

• Besides Chinese, Japenese and Korean, there are other languages that do not use space between wordseither. Dedicated tokenizers are not provided for these. In this case, sub-string search still allows theend-user to search for individual words.

• In certain scenarios, documents have insufficient logic and don’t allow for useful word splitting. Examplesare DNA-strings and musical midi-descriptions. In these cases, sub-string search allows the end-user tosearch through these types of documents.

• Sometimes separating characters in a word with spaces is used to emphasize the word, as in the phrase"His name was E L V I S". In this case, sub-string search allows the end-user to find a document containingthis phrase by searching for "Elvis".

• Some acronyms may have different spellings, like "D.N.A." and “DNA” or “dna”. Sub-string search can beone of the alternatives to allow the end-user to find documents containing "D.N.A” by searching for "dna"and vice-versa.

A side effect of this tokenization is that word boundaries are not detected. This means that a query "*erni*"would also match the text "Midsummer Night".This can sometimes be desired, but may also create undesiredmatches. This means that sub-string search is not always applicable for usual text documents of reasonable

89

Advanced Linguistic Processing

size, as the probability for such undesired matches across word boundaries increases with document size.For structured data and Asian language encoded documents, though, sub-string search is a reasonablesolution.

Applying Sub-String SearchSub-string search is enabled by defining the relevant fields as subject to sub-string search in the index profile.

You may configure the length of the sub-strings into which a word or token is to be split. For Western languages,the recommended length is at least four characters. For Asian languages the recommended length is two tothree.

In addition, you may configure whether white space or other non-word characters functioning as wordseparators are to be stripped away, so that sub-strings across words are matched as well.

Wildcards are implicit, meaning that you get the same results by searching for "summer" as you get whenyou search for "*summer*".

Wildcard SearchFAST ESP supports single character, prefix and suffix wildcards.

With full wildcard support it is possible to use '*' and '?' when specifying a query-term, where '*' indicates anynumber of wildcard characters and '?' indicates a single wildcard character. The wildcard characters may beanywhere in the query term.

Sub-string search is a related feature. For details, refer to section Sub-String Search.

Wildcard search is enabled by defining the relevant fields as subject to wildcard search in the index profile.For details refer to the Configuration Guide.

Wildcard support is defined for an individual string field or a composite field in the Index Profile. It must beconfigured explicitly as it will have impact on disk usage.

Restriction: Proximity and context based ranking does not apply to wildcard terms in queries. Thismeans that if you have a query only containing wildcard terms, there will be no rank value in the resultset. If you have a query containing both normal (without wildcards) terms and wildcard terms, the rankingwill be based on the non-wildcard terms only.

Special Characters and AccentsBy default, special characters, such as characters with accents or language specific characters, are preservedin both documents, dictionaries, and queries. This means that words that contain special characters, aretreated as different words than their normalized variants.

It is possible to configure FAST ESP to normalize words with respect to accents and special charactersequences (such as ‘C++’).You can enable character normalization in the tokenizer configuration. Fordocuments, you can enable character normalization by using the Normalizer document processor in theaccording pipeline.

Refer to the Advanced Linguistics Guide for details.

90

FAST Enterprise Search Platform

Chapter

13Operation and System Administration

This chapter introduces you to the basic concepts of operating and administratinga FAST ESP installation.

Topics:

• Operation Overview Note: While this chapter gives you a conceptual view of operation andadministration within FAST ESP, the Operations Guide and the Deployment• ESP Administrator InterfaceGuide give you detailed procedural information about individual operationaland administrative tasks.

• FAST Home and SearchBusiness Center

• Licensing• Fault Tolerance• Security

Operation OverviewThis section introduces you to the basic concepts of operating and administrating a FAST ESP installation.It gives you a conceptual view of operation and administration within FAST ESP. Refer to the OperationsGuide and the Deployment Planning Guide for procedural information about individual operational andadministrative tasks.

ESP Administrator InterfaceFAST ESP is administrated through the Administrator Interface (also referred to as the Admin GUI). This isa graphical user interface that is accessible through a common Web browser.

Figure 11: FAST ESP Administrator Interface (Admin GUI)

Main ViewsThe FAST ESP Administrator Interface (Admin GUI) is a graphical user interface with different tabs that letyou configure different areas of a search setup.

The main views in the Admin GUI are:

• Collection Overview• System Overview• Document Processing• Logs• Search View, Search Front End• Data Sources• System Management• Matching Engines• WebAnalyzer

Collection OverviewThe Collection Overview selection in the Admin GUI allows you to monitor, create, configure, and deletecollections running on your FAST ESP implementation.

For details on the concept of collections, refer to Collections.

For procedural information about the tasks you can perform within the Collection Overview selection, referto Basic Setup in the Configuration Guide.

92

FAST Enterprise Search Platform

Document ProcessingThe Document Processing selection in the Admin GUI allows you to configure the document processingpipelines you want to use to process a collection. It displays statistics, such as host, port, status, stages, orpipelines for each stage of a pipeline.You can view, create, add, edit, or remove stages of the pipeline throughthis selection.

For details on:

• the concepts related to document processing, refer to Processing Documents.• procedural information about the tasks you can perform within the Document Processing selection, refer

to Configuring the FAST Document Processing Engine in the Configuration Guide.

Search View, Search Front EndThe Search View selection in the Admin GUI allows you to view the default Search Front End (SFE) providedwith FAST ESP. This front end lets you search the documents of your implementation for testing purposes.

Figure 12: Search Front End, showing Contextual Search tab

For information about the Search Front End, refer to Processing Queries and Results, section QueryProcessing, and to the Query Integration Guide.

Refer also to the Search Front End User's Guide, and the Search Front End Developer's Guide for informationon the SFE.

System ManagementThe System Management selection allows you to view status information for any node controller that isconfigured in the system. Information includes the node name, date and time of creation, general systeminformation such as the name of the home directory and memory currently being used on the disk, and a listof all installed modules.

This selection allows you to stop or restart any or all of the installed modules as well as add or remove anavailable processor server or FAST Crawler.You can also stop an entire node from this page.

For procedural information about the tasks you can perform within the System Management selection, referto the Operations Guide.

93

Operation and System Administration

Matching EnginesThe Matching Engines selection in the Admin GUI allows you to view hostname, port number and type foreach Search Engine in your FAST ESP installation, and to add a new Search Engine.

For details on the FAST Search Engine, refer to Processing Queries and Results.

For procedural information about how to configure the Search Engine, refer to Index Profile Management inthe Configuration Guide.

Data SourcesThe Data Sources selection in the Admin GUI provides a list of available data sources and allows you to viewthe collections a data source is associated with.

For details on:

• individual data source modules, refer to Processing Queries and Results.• how to configure individual data source modules, refer to Basic Setup in the Configuration Guide.

LogsThe Logs selection in the Admin GUI allows you to view all system generated log files by file name, category,module, or collection. Log entries include the time the entry was generated, the log level, the module, hostand port where the activity occurred, collection, and a text message of the activity. Archived logs can also beaccessed from this page.

For logging information refer to:

• Configuring Logging in the Operations Guide, which explains configuring logging, log levels, and logdestinations, and

• Troubleshooting Guide, for individual log and error messages

System OverviewThe System Overview selection in the Admin GUI gives you a total overview of your FAST ESP installationas well as status information for any modules that are configured in the system. Information includes themodule name, host, port number, and status as well as the option to view detailed information for a specificmodule.

WebAnalyzer OverviewThe WebAnalyzer is a module that uses links between documents to improve search relevancy. TheWebAnalyzer Overview tab in the Admin GUI allows you to monitor, create, configure, and delete WebAnalyzerviews running on your FAST ESP implementation.

Refer to the the WebAnalyzer Guide for information about the WebAnalyzer, including procedural informationabout the tasks you can perform from the WebAnalyzer Overview tab.

FAST Home and Search Business CenterFAST Home and Search Business Center are applications for setting up and tuning search profiles, and forsetting up and tuning the search experience for each of the search profiles.

From a single FAST ESP installation, you can manage multiple search profiles and search experiences.

FAST Home is your personal portal to the FAST ESP installation, with links to the other FAST applications,such as Search Business Center and the Administration GUI.

FAST Home is where you create and set up the initial search profiles, and where you manage the users andgroups that should have access to work with the search profiles.

94

FAST Enterprise Search Platform

Figure 13: Example of the Search Business Center interface

Search Business Center is the central hub for all tuning, monitoring, administering, and reporting of yoursearch environment.You can manage ranking, relevancy, synonyms, navigators and more.

Search Business Center is where you tune and configure the search experience for the search profile beforeyou publish it to your production environment.

In Search Business Center you can monitor the end-users’ query behavior (query logs).You can makechanges to the search profile settings and test them out in the internal Preview before publishing the changesto the Published Search Front End.

Once the search profile is up and running in your production environment, updated reporting information fromthe production system starts flowing back into Search Business Center, and you can see if your changeshave had the desired effect.

This way, you can continuously tune and improve the search experience for your search users.

Refer to the FAST Home Guide and the Search Business Center Guide for information on these interfaces.

LicensingFAST ESP is a system of individually licensed capabilities. These capabilities are either features, modules,or data amount capacities. Some of these capabilities are included in the standard delivery of FAST ESP,while others are additional modules for which you can purchase separate licenses and include them in yourFAST ESP solution.

Based on the agreement with the customer, FAST generates a license file, which - together with the FASTESP license management system - ensures that the purchased capabilities are enabled. The licensemanagement system is based on FLEXlm from Macrovision Corporate.

Note: FAST ESP is provided to you with the understanding that only those components that have beenpurchased will be used.

Time limited evaluation licenses for FAST ESP are available upon request.

The License Management System

95

Operation and System Administration

FAST ESP licenses are floating licenses. This means that the capabilities can be shared between severalFAST ESP processes running on different machines. The license management system keeps track of boththe type and the number of capabilities that are checked out, and registers which processes are using them.

The license manager daemon, named lmgrd, handles the initial contact with the client application programs.It also starts and manages the FASTSRCH daemon. It must be started before any other FAST ESP module.This is handled automatically by FAST ESP upon system start.

Note: To ensure that FAST ESP can be operative even in case your license server fails, FAST ESPcan keep running during four days without the components license server being up and running. It is notpossible to restart FAST ESP, though.

The FAST vendor daemon, named FASTSRCH, grants licenses based on requests from the FAST ESPprocesses, keeps track of the number of licenses that are checked out, and registers the processes that areusing them.

The license file fastsearch.lic defines the features and maximum number of capacities that are enabled inyour FAST ESP installation. The license file also specifies the machine on which the license manager isrunning, and the path to the FASTSRCH vendor daemon.

Your FAST ESP application checks out licenses, and checks them in again after use to free the capabilityfor use by other processes.

The processes listed above can run on one machine or on different machines, as defined by the customerduring the installation process.

Fault ToleranceFault tolerance refer to the system's ability to continue operating when some of its components or servicesare failing. Within ESP, you can implement fault tolerance in the feeding chain and in search.

Fault tolerance in the feeding chain means that document feeding can contine when any node fauils, providedthat at least one instance of each service is running, and that indexing can continue when an indexer nodefails. Fault tolerant search ensures that search results will still be correctly returned even if a search node ora query and result server fails.

In addition, there are several other aspects that should be considered when installing a fault tolerant system,for example feeder, name services and storage fault tolerance. Refer to the Deployment Planningdocumentation for more information.

SecurityFAST ESP can easily be deployed behind a firewall. The firewall must be configured in a way that makes anumber of FAST ESP interfaces available from outside the firewall depending on the requirements to thesystem.

For more details about securing your FAST ESP installation, refer to the Deployment Planning Guide.

96

FAST Enterprise Search Platform

Chapter

14Supported Document Formats

This chapter lists document formats that FAST ESP can process through itsDocument Processing Engine. Note however, that although these formats are

Topics:

supported, not all formats allow for extraction of both text and metadata. Formore details, contact your FAST Account Manager or FAST Technical Support.

• Supported Formats Overview• Supported Input File Formats

Supported Formats OverviewThere are a number of supported input file formats that FAST ESP can process through its DocumentProcessing Engine. These supported formats are described in the topics that follow.

Note: Although these formats are supported, not all formats allow for extraction of both text and metadata.For more details, contact your FAST Account Manager or FAST Technical Support

Note: This version of ESP supports HTML documents with embedded Microsoft Office documentelements, but this feature must be enabled. Refer Document Processing Stages in the ConfigurationGuide for information.

Supported Input File FormatsFAST ESP supports different input file formats, as listed in the tables in this topic. Note that there are somerestrictions, as indicated.

Restriction: Office 2007: the following features are not supported: Password protected documents,Embedded fonts

Restriction: Word 2007: Footnotes, end notes and reference numbers

Restriction: Excel 2007: Protected workbooks which are partially supported in older Excel formats arenot supported at all by the Excel 2007 filter.

Word Processing FormatsSupported word processing formats are listed alphabetically by general type.

Table 2: Word Processing Formats

7 & 8 bit ANSI Text

7 & 8 bitASCII Text

Versions through 3.1DEC WPS Plus (DX)

Versions through 4.1DEC WPS Plus (WPL)

All versionsDisplayWrite 2 & 3 (TXT)

Versions through 2.0DisplayWrite 4 & 5

All versionsEBCDIC

Versions 3.0, 4.0 and 4.5Enable

Versions through 3.0First Choice

Version 3.0Framework

Versions 97, 2002 and 2005Hangul

All versionsIBM FFT

All versionsIBM Revisable Form Text

Version 1.01IBM Writing Assistant

Versions 4.x through 6.x, 8.x through 13.x and 2004Just System Ichitaro

98

FAST Enterprise Search Platform

Versions through 3.0Just Write

Versions through 1.1Legacy

Versions through 3.1Lotus AMI/AMI Professional

Version 2.0Lotus Manuscript

Versions SmartSuite 97, Millennium, and Millennium 9.6 (text only)Lotus Word Pro (non-Windows)

Versions SmartSuite 96, 97 and Millennium and Millenium 9.6Lotus Word Pro (Windows)

Version 1.1MacWrite II

Versions through 8.0MASS11

All versionsMicrosoft Rich Text Format (RTF)

Versions through 6.0Microsoft Word (DOS)

Versions 4.0 - 2004Microsoft Word (Mac)

Versions through 2007Microsoft Word (Windows)

All versionsMicrosfot WordPad

Versions through 2.0Microsoft Works (DOS)

Versions through 2.0Microsoft Works (Mac)

Versions through 4.0Microsoft Works (Windows)

Versions through 3.0Microsoft Windows Write

Versions through 4.0MultiMate

All versionsNavy DIF

Version 3.0Nota Bene

Version 2.0Novell Perfect Works

Versions through 6.1Novell/Corel WordPerfect (DOS)

Versions 1.02 through 3.0Novell/Corel WordPerfect (Mac)

Versions through 12.0Novell/Corel WordPerfect (Windows)

Versions 4.0 - 6.0Office Writer

OpenOffice verions 1.1 and 2.0OpenOffice Writer (Windows and UNIX)

Versions through 5.0PC-File Letter

Versions through 3.0PC-File + Letter

Versions A, B and CPFS:Write

Versions through 2.1Professional Write (DOS)

Version 1.0Professional Write Plus (Windows)

Version 2.0Q&A (DOS)

Version 3.0Q&A Write (Windows)

Versions through Samna Word IV+Samna Word

Version 1.0Signature

Version 1.02SmartWare II

Versions through 1.0Sprint

Version 5.2 (text only) and 6.x through 8.xStarOffice Writer

99

Supported Document Formats

Version 1.2Total Word

All versionsUnicode Text

All versionsUTF-8

Versions through 1.0Volkswriter 3 & 4

Versions through 2.6Wang PC (IWP)

Versions through Composer PlusWordMARC

Versions through 7.0WordStar (DOS)

Version 1.0WordStart (Windows)

Versions through 3.0WordStar 2000 (DOS)

Versions through III PlusXyWrite

Desktop Publishing Formats

Versions 3.0, 4.0, 5.0, 5.5 and 6.0 and Japanese 3.0, 4.0, 5.0 and 6.0 (textonly)

Adobe FrameMaker MIF

Database FormatsThe table in this topic indicates the supported database formats.

Versions through 2.0Access

Versions through 5.0dBASE

Version 4.xDataEase

Versions 1.3dBXL

Versions 3.0, 4.0 and 4.5Enable

Versions through 3.0First Choice

Version 2.1FoxBase

Version 3.0Framework

Versions through 4.0Microsoft Works (Windows)

Versions through 2.0Microsoft Works (DOS)

Versions through 2.0Microsoft Works (Mac)

Versions through 4.0Paradox (DOS)

Versions through 1.0Paradox (Windows)

Version 1.0Personal R:BASE

100

FAST Enterprise Search Platform

Versions through 3.1R: BASE 5000

Version 1.0R: BASE System V

Version 2.0Reflex

Versions through 2.0Q& A

Version 1.02SmartWare II

Spreadsheet FormatsThe table in this topic list the supported spreadsheet formats.

Table 3: Spreadsheet Formats

Versions 3.0, 4.0 and 4.5Enable

Versions through 3.0First Choice

Version 3.0Framework

Versions through 5.0Lotus 1-2-3 (DOS & Windows)

Versions through 2.0Lotus 1-2-3 (OS/2)

Versions through 5.0Lotus 1-2-3 Charts (DOS & Windows)

Versions 97 - Millennium 9.6Lotus 1-2-3 for SmartSuite

Versions 1.0,1.1 and 2.0Lotus Symphony

Version 2.0Mac Works

Versions 2.x - 7.0Microsoft Excel Charts

Versions 3.0 - 4.0, 98, 2001, 2002, 2004, and v.XMicrosoft Excel (Mac)

Versions 2.2 through 2007Microsoft Excel (Windows)

Version 4.0Microsoft Multiplan

Versions through 4.0Microsoft Works (Windows)

Versions through 2.0Microsoft Works (DOS)

Versions through 2.0Microsoft Works (Mac)

Version 2.5Mosaic Twin

Version 2.0Novell Perfect Works

Versions 1.1, 2.0 (text only)Open Office Calc

Version 1.0PFS: Professional Plan

Versions through 5.0Quattro Pro (DOS)

Versions through 12.0, X3Quattro Pro (Windows)

Version 1.02SmartWare II

Versions 5.2, 6.x, 7.x and 8.0 (text only)Star Office Calc

Version 4.0SuperCalc 5

101

Supported Document Formats

Version 1.0VP Planner 3D

Presentation FormatsThe tables in this topic list the supported presentation formats.

Table 4: Presentation Formats

Versions through 12.0Corel/Novell Presentations

Versions 2.x & 3.xHarvard Graphics for DOS

Windows versionsHarvard Graphics (Windows)

Versions through Millennium 9.6Freelance (Windows)

Versions through 2.0Freelance for OS/2

Versions 3.0 through 2007Microsoft PowerPoint (Windows)

Versions 4.0 through v.XMicrosoft PowerPoint (Mac)

StarOffice versions 5.2 (text only) and 6.x through 8.x (full support) andOpen Office version 1.1 and 2.0 (text only)

StarOffice/OpenOffice Impress (Windowsand UNIX)

Graphics FormatsThe table in this topic indicates which graphics formats are supported.

Version 4.0Adobe Photoshop (PSD)

Versions 7.0 and 9.0Adobe Illustrator

Vector/raster through 5.0Adobe FrameMaker graphics (FMV)

Versions 2.1-8.0Adobe Reader

Versions 1.2-1.7PDF

Ami DrawAmi Draw (SDW)

AutoCAD Drawing Versions 2.5 - 2.6, 9.0 - 14.0, 2000iand 2002

AutoCAD Interchange and Native Drawing formats (DXF andDWG)

Versions 2004, 2005 and 2006AutoCAD Drawing

Text only, no document hit highlighting

Version 2.0AutoShade Rendering (RND)

All versionsBinary Group 3 Fax

All versionsBitmap (BMP, RLE, CUR, OS/2 DIB & WARP)

Type I and Type IICALS Raster (GP4)

Versions 5 through 6Corel Clipart format (CMS)

102

FAST Enterprise Search Platform

Versions 3.x - 8.xCorel Draw (CDR)

Versions 2.x - 9.xCorel Draw (CDR with TIFF header)

ANSI, CALS NIST version 3.0Computer Graphics Metafile (CGM)

TIFF header onlyEncapsulated PostScript (EPS)

All versionsGEM Paint (IIMG)

Bitmap & vectorGraphics Environment Mgr (GEM)

All versionsGraphics Interchange Format (GIF)

Version 2Hewlett Packard Graphics Language (HPGL)

Version 1.0IBM Graphics Data Format (GDF)

Version 1.0IBM Picture Interchange Format (PIF)

Version 5.1Initial Graphics Exchange Spec (IGES)

JBIG2 graphic embeddings in PDF filesJBIG2

All versionsJFIF (JPEG not in TIFF format)

All versionsJPEG (including EXIF)

All versionsKodak Flash Pix (FPX)

Version 1.0Kodak Photo CD (PCD)

All versionsLotus PIC

All versionsLotus Snapshot

Bitmap onlyMacintosh PICT1 & PICT2

All versionsMacPaint (PNTG)

Versions through 4.0Micrografx Draw (DRW)

Versions through 3.1Micrografx Designer (DRW)

Windows 95, version 6.0Micrografx Designer (DSF)

Version 1.0Microsoft XML Paper Specification (XPS)

Version 2.0Novell PerfectWorks (Draw)

Version 3.0OS/2 PM Metafile (MET)

103

Supported Document Formats

Windows only, version 5.0 - 6.0Paint Shop Pro (PSP)

All versionsPC Paintbrush (PCX and DCX)

All versionsPortable Bitmap (PBM)

No specific versionPortable Graymap (PGM)

Version 1.0Portable Network Graphics (PNG)

No specific versionPortable Pixmap (PPM)

Level IIPostscript (PS)

No specific versionProgressive JPEG

No specific versionSun Raster (SRS)

StarOffice versions 5.2 (text only) through 8.x andOpenOffice version 1.1 and 2.0

StarOffice/Open Office Draw for Windows and UNIX

Versions through 6TIFF

Versions through 6TIFF CCITT Group 3 & 4

Version 2Truevision TGA (TARGA)

Version 4Visio (preview)

Version 5, 200, 2002 and 2003Visio

No specific versionWBMP

No specific versionWindows Enhanced Metafile (EMF)

No specific versionWindows Metafile (WMF)

Versions through 2.0, 7 and 10WordPerfect Graphics (WPG &WPG2)

x10 compatibleX-Windows Bitmap (XBM)

x10 compatibleX-Windows Dump (XWD)

x10 compatibleX-Windows Pixmap (XPM)

Restriction: Text embedded in images are not supported.

Restriction: Graphic format restriction: Graphics Interchange Format (GIF) and Portable NetworkGraphics (PNG) support identification only--not text or metadata extraction!

Restriction: Processing postscript files created with DVIPS is not supported.

104

FAST Enterprise Search Platform

Compressed FormatsThe table in this topic lists the supported compressed formats.

GZIP

LZA Self Extracting Compress

LZH Compress

Versions 7.0-97 (conversion of Binder file is supportedonly on Windows)

Microsoft Binder

UUEncode

UNIX Compress

UNIX TAR

PKWARE versions through 2.04gZIP

Restriction: The entire compressed archive is indexed as one document.

Email FormatsThe table in this topic indicates which Email formats are supported.

Microsoft Outlook Folder and Microsoft Outlook Offline Folder files versions97, 98, 2000, 2002, 2003 and 2007

Microsoft Outlook Folder (PST)

Microsoft Outlook Message and Microsoft Outlook Form Template versions97, 98, 2000, 2002, 2003, and 2007

Microsoft Outlook Message (MSG)

MIME-encoded mail messages. (See "MIME Support Notes" immediatelyfollowing this table.)

MIME

MIME Support Notes

MIME formats including:

• EML• MHT (Web Archive)• NWS (Newsgroup single-part and multi-part)• Simple Text Mail (defined in RFC 2822)

TNEF FormatMIME encoding, including:

• base64 (defined in RFC 1521)• binary (defined in RFC 1521)• binhex (defined in RFC 1741)• btoa• quoted-printable (defined in RFC 1521)

105

Supported Document Formats

• utf-7 (defined in RFC 2152)• uue• xxe• yenc

Also, the body of a message can be encoded several ways. The following encodings are supported:

• Text• HTML• RTF• TNEF• Text/enriched (defined in RFC1523)• Text/rich text (defined in RFC1341)• Embedded mail message (defined in RFC 822). This is handled as a link to a new message.

Note: The attachments of a MIME message can be supported in many formats. Not all attachementtypes are supported.

Other FormatsThe table in this section lists additional supported file formats.

Adobe Flash 6.x, Adobe Flash 7.x, and AdobeFlash Lite (text only).(previously Macromedia Flash).

Adobe Flash

Note: For crawled content separately handled in the ESP Crawler.Refer to Crawler documentation.

Executable (EXE, DLL)

Versions through 3.0, with some limitations. Including HTML withembedded Microsoft Office elements.

HTML

Versions 98 - 2003 (text only)Microsoft Project

ID3 informationMP3

Version 2.1vCard, vCalendar

Windows Executable

Version 5.2WML

ESP provides native indexing support for XMLXML

Versions 6.x, 7.x and 8.xYahoo! Instant Messenger

Restriction: Adobe Flash is supported with the following limitation: does not extract link and text frominside action script code.

106

FAST Enterprise Search Platform

Chapter

15Glossary

This glossary lists some specific terms and definitions that are used in connectionwith FAST ESP.

Topics:

• ESP Term Definitions

ESP Term Definitions

Enables a document to be consistently displayed at a given position in the resultset when a user searches with a specific query. It also prevents individualdocuments from being displayed when a user searches with a specific query.

Absolute boosting

Character normalization can preserve both original and normalized forms foraccented words (for example, hôtel).

Accent normalization

A data set that grants permissions, or access rights, to each user or group for aspecific system object, such as a directory or file. FAST ESP is utilizes ACL

Access control list(ACL)

information from the content repositories so that the same permissions apply tosearch results.

This means that a user is only able to see the query results that he/she is entitledto view, based on his/her permissions towards the source content repository.

See Proximity.Adjacent searching

The textual components of web hyperlinks (text links or ‘alt’ text associated withimage hyperlinks).

Anchor text

Identifying word sequences in queries that do not contribute essentially to thequery’s meaning, such as “Where can I find” or “Where is”.

Anti-phrasing

A programmatic interface that enables software developers to access featuresand functions of a hardware or software platform.

Applicationprogramming interface(API)

Matching a query term and a term within a document based on approximations.Such approximations can be based on spell check (see Spell Checking) orlinguistic normalization (lemmatization, accent normalization).

Approximate matching

Tokenization (word segmentation) for Asian languages requires special treatment.These languages do not allow text to be split into word entities by referring to

Asian languagetokenization

whitespace or other separators. Asian language text needs to be split into tokensthat can be treated as words during document processing and matching.

Dimension of search relevancy. This indicates that the document is consideredto be an authority for this query. That is, the document is being referred to byothers, for example, through web anchor texts.

Authority

The average time it takes for the search engine to respond to a given query.There are typically two times that can be measured: 1) the average response

Average query responsetime

time of the search engine itself, and 2) that of the complete system for anend-to-end query (i.e. including the application and web server times).

A process that allows organizations to evaluate various aspects of their processesin relation to best practice, usually within their own industry sector.

Benchmarking

Used to alter the relevancy value of a document compared to other documentsin a search index. It is the addition or subtraction of a value to a document’s rank(relevancy).

Boosting

The ability to limit a query term/phrase to the start and/or end of an indexedfield/parameter. Combining start and end condition provides an exactfield/parameter match.

Boundary match

FAST ESP supports boundary matching on fields and scopes (see Scope Search).

108

FAST Enterprise Search Platform

Component that extracts links form text from JavaScript and Adobe Flash files.It is used by the Enterprise Crawler (EC) and in the document processing pipeline.

BrowserEngine

Programmatic alerts produced by an API. For a search platform, this is usuallyrelated to the content processing and indexing status of a document in order forthe client application to keep track of the processing/indexing progress.

Callbacks

Organizing pieces of information into topical categories. Usually, these arehierarchical trees, with the most general topics at the top and the most specificat the bottom.

A search engine may apply categorization of the documents in the index basedon similarities (typically based on a training set), matching rules or programmaticrules.

Categorization

Arranging results in result groups according to characteristics that are externalto the result set (supervised clustering) or inherent to the result set (unsupervisedclustering). e results.

Clustering

A group of content types or a logical group of documents based on selectedcriteria, such as semantics, or document processing. Content types can be

Collection

grouped by source and by the processing rules that are to be applied to thecontent.

Any program that collects data and pushes it into FAST ESP.Connectors

Used to push content to a FAST ESP implementation.Content API

A module that retrieves content from external sources.Content connector

Receives content from the Content API and distributes it to the appropriatedocument processing pipelines.

Content distributor

A single piece of content prior to submission to FAST ESP.Content entity

A content connector that retrieves external files from Web servers.

By following links, the FAST Crawler crawls Web content hierarchies based ona single start URL.

Crawler

The File Traverser goes along directory structures, whereas the FAST Crawlercrawls Web servers along URI structures.

The act of accessing Web servers and/or file systems in order to extractinformation to feed into the enterprise search platform. By following links, aCrawler is able to traverse Web content hierarchies based on a single start URL.

Crawling

A type of dynamic drill-down navigator that applies on-the-fly aggregation of resultvalues across the entire result set for a query.

Deep navigators

A piece of content that is normalized with respect to the FAST ESP documentstructure.Within the FAST Real-Time Filter documentation, documents are referredto as Events .

Document

Part of a document. A document is divided into elements in order to enableparametric search and to apply different ranking weights on different parts of adocument.

Document element

A module that provides document processing pipelines for analyzing documents.Document processingengine

Part of a document processing pipeline that performs a particular processing ofa document. A document processing stage may modify, remove or add information

Document processingstage

109

Glossary

to a document, such as adding new meta information for linguistic processing,or extracting information about the language of the document.

Data that is returned to answer a query. A query is answered by a set of documentsummaries. There is one document summary per document in the result set.

Document summary

Each document summary contains a set of summary fields, where each summaryfield corresponds to a field presented in the result set.

A set of (keyword, weight) pairs, where keyword is a word or a phrase associatedwith the document, and weight is a numerical measure of how important keywordis for the document.

Vectors are a kind of document signature (word-weight pairs) representing adocument's content in a way that allows comparison between documents.

Document vector

Search engines may apply different levels of duplicate detection. Exact duplicatesmeans the same document, but located in different repositories. See also Fieldcollapsing.

Duplicate detection

A navigation tool for structured data; it provides multidimensional drill-down instructured data based on facets of content.

Dynamic drill-down

The process by which rank components are computed during matching relatedto the level of match between document and query.

Dynamic rank

A short summary of a document, generated based on the actual query, showingthe regions of the document matching the query – with the query terms highlighted.

Dynamic teaser

See document element.Element

A feature provided by the FAST Document Processing Engine. Documents sentthrough the NewsSearch document processing pipeline are processed to detectgeographical, personal, or company names to enhance relevancy.

Entity extraction

Part of an index profile. Fields specify those elements of a document that are tobe searchable or presented in the result.

Field

Used to collapse a group of results with similar value for a given field to a singleentry in the result set.

Field collapsing

See also Duplicate detection.

A content connector that traverses file server directories.

The File Traverser works along directory structures, whereas the FAST Crawlercrawls Web servers along URI structures.

File Traverser

Matching engine that enables filtering of documents.Filter engine

An application forming a node that interacts with the Filter Engine through theAlert API and the Alert Query API.

Filter engine client

The main administrative process in the Filter Engine system.Filter manager

Enhances relevancy by boosting most recent documents, for example, time/age/sensitive documents.

Freshness boosting

Sorting on a configurable number of characters in the field on which you want tosort your results.

Full text sorting

An XML file that defines the way documents are to be searchable. It containsfields to which the elements of a document are mapped.

Index profile

Alternative grammatical forms of words are extracted and added to the documentin a separate document section to enable searching for other grammatical formsthan that given in the query string.

Lemmatization

110

FAST Enterprise Search Platform

A type of lemmatization that expands words into the full set of inflected forms.Lemmatization byexpansion

A type of lemmatization also referred to as “base form reduction”, which normalizesindexed terms and query terms to their grammatical base form. For example,“ate” becomes “eat.”

Lemmatization byreduction

The number of links in a set that refer to a given document. It is best used todetermine the relevancy of a Web page by factoring in how many other pagesrefer to the page under consideration.

Link cardinality

Type of log message like INFO, VERBOSE, etc.Log level

Sorting by multiple fields. Both text and integer fields may be sorted upon(ascending or descending). Field sorting may be combined with rank sorting.

Multi-level sorting

Using linguistic analysis to infer meaning from human-written text that could notbe extracted using the individual word meanings.

Natural languageprocessing (NLP)

Information discovery through drill-down into query results. Navigation is possibleboth on document level attributes/entities and contextual entities within thematching context of the search results.

Navigation

Construct that enables filtering and grouping of search results. On an internationalsite, you may have a navigator that enables you to only display results with contentin a given language – for instance, “Display English results only.”

Navigator

Analyzing input to determine its grammatical structure with respect to formalgrammar.

Parsing

Precision for a query means the ability to retrieve the most precise results. Higherprecision means better relevance ranking and more precise results.

Precision

Identifying word sequences in text that are defined as proper names or phrasesin the appropriate dictionary.

Proper namerecognition

Number of queries that the enterprise search platform will process in one second.This is normally a function of hardware (capability) and licensing (what is alloweddue to contract terms).

Queries per second(QPS)

Search request sent to the FAST ESP system.Query

A module that processes and transforms queries and processes results.Query & Result Server

An API that is used to submit search requests from an end-user or customer frontend application to the FAST ESP system.

Query API

Semantic rules that must be observed when submitting queries to a search engine– for example, the use of parenthesis and Boolean operators. Sometimes, a

Query syntax

query transformation stage may be used to allow end users to use a differentsyntax from the one expected by the search engine.

The ability to support different relevance weight for different terms in a query.Query term weight

An element in the Index Profile controlling how ranking is performed for queries.Within the rank profile you can tune the different components of the ranking suchas proximity, freshness and context.

Rank profile

Arranging result documents according to their relevancy value.Ranking

Recall for a query means the ability to retrieve as many documents as possiblematching a query. Recall may be improved by linguistics processing such aslemmatization, spell check and synonym expansion.

Recall

See also Precision.

111

Glossary

Set of document summaries that point to the resulting documents returned for aquery.

Result set

Defines alternative ways for a query front end to view the index with relation toqueries. A result view is defined by the set of fields returned for the query.

Result view

Provides implicit ranking of dimensions (search result fields) based on relevancescores.

Result-based binning

A type of dynamic drill-down navigator. Drill-down navigators are created acrossan extended but non-exhaustive result set (typically, the 200 highest rankedresults).

Result-side (shallow)navigators

A search installation may be configured in a row/column configuration forperformance and fault-tolerance reasons.

Rows and columns

Multiple Columns are used in order to partition the indexed content for large datavolumes. Each column contains a unique subset of the indexed content.

Multiple Rows are used for query performance scaling and fault-tolerance. Eachrow within a column is identical with respect to the indexed content.

Contains hierarchically structured content. It enables schema flexibility and theability to conserve hierarchical relationships rather than flattening the data as isoften required by meta-data engines.

Scope field

Group of search nodes (row/column matrix) that shares the same index schema(index profile).

Search cluster

Matching engine that enables indexing and search in indexed documents.Search engine

Group of search engine instances that share one index profile. Also called searchcluster.

Search engine cluster

Searchable documents organized as a set of binary indices. After indexing,documents are searchable and form an index.

Search index

Concept used to identify the set of search attributes common for a given searchapplication. This includes global filter constraints (such as, collection), query

Search profile

processing parameters (such as linguistics) and result handling parameters (suchas navigation settings).

Ability to search for similar documents. Similar in a search context may be similarto a document in a result set or similar to an example document.

See also Document vector.

Similarity searching

The process of ordering documents within the search results.Sorting

Correcting or proposing corrections to common typing errors in the search querybased on a comparison of the query's terms/phrases with a dictionary.

Spell-checking

Searching for parts of a string as with a wildcard search ("*term*").

A word or token (for Asian language documents) is split up into smaller entities,sub-strings, consisting of a defined number of signs.

Sub-string search

A defined hierarchy of categories. A treelike structure of customer ormarket-specific terminology that defines how categories relate to one another.

Taxonomy

A module that allows you to configure and maintain taxonomies and the mappingof categories.

Taxonomy Explorer

112

FAST Enterprise Search Platform

A short summary of a document, returned as part of a search result.

A Static Teaser is generated during document processing (query independent).A Dynamic Teaser is generated based on the actual query, showing the regionsof the document matching the query.

Teaser

Splitting up text into word entities. This involves detection of white-spacecharacters and other symbols that separate words from each other and are notrelevant for the matching process.

Tokenization

Clustering results in result groups based on the document similarity calculatedthrough vectorization.

Unsupervised clustering

Calculating a document vector for a given document. A document vector is thenumerical representation of the unstructured textual content of a document.

Vectorization

Can be used to substitute any other character or characters in a string. Commonwildcards include "*" (zero or more characters) and "?" (a single character).

Wildcard

113

Glossary