SQL Full Text Search Programming

Download SQLFullTextSearchProgramming - 360.35 KB

1. Continuation to Lucene

Last year, I wrote an article on Lucene search in CodeProject. That strikes me to discover the alternatives of Lucene search. Few technology factors are towards SQL Server 2008 Full Text search concept.

2. Full Text Search

Definition

Full-text search refers to the functionality in SQL Server that supports full-text queries against character-based data. These types of queries can include words and phrases as well as multiple forms of a word or phrase.

Why do we need?

Over the last decade, the focus of the commercial database management community has been primarily on structured data and the industry as a whole has been fairly effective at addressing the needs of these structured storage applications. However, only a small fraction of the data stored and managed each year is fully structured while the vast preponderance of the data stored is either wholly unstructured or only semi-structured in the form of documents, web-pages, spreadsheets, email and other weakly structured formats.

This work investigates the features and capabilities of the full text search access methods.

SQL Server Support

SQL Server 2008 provides the functionality for applications and users to issue full-text queries against character-based data for more than 50 diverse languages, such as English, Spanish, Chinese, Japanese, Arabic and Hindi.

For each supported language, SQL Server provides language-specific linguistic components, including a word breaker and stemmer and an empty thesaurus file.
For each full-text language, SQL Server also provides a file in which you can optionally define language-specific synonyms to extend the scope of search queries as a thesaurus file.
For writing full-text queries, SQL Server provides a set of full-text predicates (CONTAINS and FREETEXT) and row set-valued functions (CONTAINSTABLE and FREETEXTTABLE).
Applications can perform a variety of types of full-text searches, such as searching on a single word or phrase and optionally ranking the result set
Applications can search on a word or phrase close to another word or phrase
Applications can searching on synonymous forms of a specific word.

As the key check list, SQL Server Full text filter daemon service needs to run as below:

Daemon Service

SQL Server provides a system stop list. Developers can alter the system stop list by adding and removing stop words, also known as noise words.

3. Full Text Index

Each full-text index indexes one or more columns from the base table, and each column can have a specific language. We can create a full-text index on a table or indexed view in a database. Only one full-text index is allowed per table or indexed view. The index can contain up to 1024 columns.

Before full-text queries can be run on a given table, the database administrator must create a full-text index on the table. The full-text index includes one or more character-based columns in the table. These columns can have any of the following data types: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, or varbinary(max). A full-text index is made up of word tokens that are derived from the text being indexed.

4. Development Steps

To implement SQL Full Text Search programming, we have five step processes as below:

a. Data Setup

Let us take a classic example of Employee table for our illustration. As the first step, Employee table is created with the name 'EmpName' of below schema:

Table Schema

Our objective is to search the employee records with the free flow search of any field i.e. the user can extract the record which matches any column in the database table. As an example, the user can enter a name ('Raj') to search from EmpName table. The result may be from FirstName, LastName, ManagerFirstName, ManagerLastName. If the user enters 'Paul Raj', the full text search fetches either Paul or Raj from any of the name columns in EmpName table.

Let us load the sample data of 5 below records into EmpName table as:

Sample Data

b. Full Text Column Setup

To initiate the full text search process, let us identify full text column, which is the consolidated data of each text columns that you want to include in the full-text. In our example, let us create a new column FullTextValue in EmpName table as below:

Full Text Column

To fill the data in newly added FullTextValue column, let us execute the below query

<P>
UPDATE EmpTable
SET FullTextValue = [EmpID] +
      + ' ' +  [FirstName]
      + ' ' +  [LastName]
      + ' ' +  [ManagerFirstName]
      + ' ' +  [ManagerLastName]
      + ' ' +  [DivisionName]
      + ' ' +  [Location]
      + ' ' +  [Country] 
</P>

Now, the data elements of EmpName table looks like:

Sample Items

c. Full Text Catalog Setup

A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a file group.

As the thumb rule, if a given column contains documents stored as binary data, we must specify a table column that identifies the type of each document in the column being indexed.

New full text catalog is created in SQL Server by expanding 'Storage' section of the selected DB instance, right click menu of 'Full Text Catalog' option as below:

New Catalog

The catalog system uses SQL Server full-text catalogs to store and search catalog content. In our example, a new full text catalog namely 'FT_Employee' has been created in the full text catalog wizard as below:

Wizard

Our new catalog is created successfully by running the above wizard. We can proceed with full text index definition.

d. Full Text Index Setup

To match the newly created catalog with the proper DB table, let us run the process of defining the full text index on the table. In our example, let us go EmpTable where we want to achieve the full text search algorithm. On right click of the table in Database explorer, we get 'Full-Text index' menu, sub menu with 'Define Full-Text index' as below:

Define Catalog

On clicking Define Full-Text index option, it launches full text indexing wizard as:

Define Catalog

As the first step of the process, the unique index of the given table 'EmpTable' is selected from the populated drop down box. In our example, it is EmpID table with the primary key definition of PK_EmpTable. So, unique index of the table is selected as:

Define index

Next, we need to select the table columns in terms of full text queries. Developer can select multiple table columns and so the user interface is with check boxes. In our example, we have the consolidated column 'FullTextValue' as the eligible table columns for full text query and so the column is selected as:

Define Column

Another important step is to select an existing full text catalog. If not exists, SQL provides the facility to create a new catalog as marked in a check box. In our example, we created our own full text catalog namely 'FT_Employee' in the previous section. So, let us select the created full text catalog FT_Employee as:

Define Catalog

A stopword can be a word with meaning in a specific language, or it can be a token that does not have linguistic meaning. For example, in the English language, words such as "a," "and," "is," and "the" are left out of the full-text index since they are known to be useless to a search. In our example, we disable the full text stop list by selecting off option from the above highlighted drop down box.

Define Catalog

Now, we successfully created full text index on EmpTable with 0 error and 0 warning. If we encountered with any issue, it might be reported in the Report expand button with the details.

e. Full Text Query

It is all set now. As the last step, we can fetch the result of the full text query as:

<P>
SELECT tbl.*
FROM EmpTable tbl
   JOIN containstable 
    (EmpTable, FullTextValue, '"3"') as key_tbl
   ON tbl.EmpID = key_tbl.[Key]
ORDER BY key_tbl.[Rank] asc
</P>

In this example, we are trying to fetch the records with the contained data '3' in any of the column of EmpTable. It returns the third row of EmpTable because it contains 3 in all the columns EmpID, FirstName, LastName, ManagerFirstName, ManagerLastName, DivisionName, Location and Country. SQL returns this record even if any one of the above column contains 3.

Thus, we can achieve the full text search mechanism in SQL Server.

5. Architecture

With the above implemented five steps, let us see the details of architecture components of full text search.

Architecture

Let us split the above architecture in three layers namely SQL layer, Transport layer and Daemon layer.

a. SQL Layer

SQL Layers contains user tables, filter daemon manager, thesaurus & stop list, query processor and full text index.

Filters are invoked by the daemon based on the type of the content. Filters parse the content and emit chunks of processed text. A chunk is a contiguous portion of text along with some relevant information about the text segment like the language-id of the text, attribute information if any etc. Filters emit chunks separately for any properties in the content. Properties can be items such as title or author and are specific to the content types and therefore understood by the filters.

When searching for a keyword, all indexes in this stack need to be searched so there is some advantage in keeping the number of indexes to a small number. During insertion and merge operations, distribution and frequency statistics are maintained for internal query processing use and for ranking purposes.

Indexed text in SQL Server can range from a simple character string data to documents of many types, including Word, Powerpoint, PDF, Excel, HTML, and XML. The document filter support is a public interface, enabling custom filters for proprietary document formats to be integrated into SQL Server. In our example, it is pretty straight forward by putting all the search content in a single column namely 'FullTextValue'; but SQL supports various document types too.

b. Transport Layer

In general, transport layer ensures the reliable arrival of messages and provides error checking mechanisms and data flow controls. In this architecture, there are three types of message flow through named pipe and shared memory.

Database data to be indexed resides in shared memory of the server. Later, protocol handler pulls the data from this memory for further processing and accesses data from a user table in a specified database. Some data types require filtering before the data in a document can be full-text indexed. Batch requests are processed to filter the chunks of text from the document for the stream of textual information result. Indexed key words such as word breakers and stremmers are loaded in the shared memory, as mentioned in the transport layer.

c. Daemon Layer

Daemon layer is a process that is started by the Full-Text Engine. It runs the following full-text search components, which are responsible for accessing, filtering, and word breaking data from tables, as well as for word breaking and stemming the query input.

Protocol Handler gathers data from the columns being full-text indexed and pass it to the filter daemon host.

Filter component performs the extra processing of indexing documents in an varbinary, varbinary(max), image, or xml data.

In the right most Word breakers process, the filter extracts the textual information from the document and then sends the text to the word-breaker component for the language associated with the table column. Word breakers and stemmers perform linguistic analysis on all full-text indexed data. Linguistic analysis involves finding word boundaries and conjugating verbs i.e. stemming. For a given language, a word breaker identifies individual words by determining where word boundaries exist based on the lexical rules of the language. Each word or token is inserted into the full-text index using a compressed representation to reduce its size. The stemmer generates inflectional forms of a particular word based on the rules of that language.

FDHOST Launcher service (MSSQLFDLauncher) Launcher service is used by full-text search in SQL Server 2008 R2 to start the filter daemon host process, which handles full-text search filtering and word breaking. This service must be running to use full-text search. The service is an instance-aware service that is associated with a specific instance of SQL Server. In our example, we validated this service by getting into the services section of computer management, as shown in the first image.

6. Points of Interest

On comparing full-text-search possibilities using Lucene and SQL server, it depends on the data itself and your needs.

In terms of scale out web-servers, lucene is better due to load handling on par with SQL. At the same time, SQL Server's Index Server capabilities to full-text index and query Word and PDF documents quite successfully. If you are doing constraints that lucene can't provide, then you will almost certainly want to use SQL methodology.

In a nutshell, SQL Server full text search is better than Lucene when the priority is towards result relevance & ranking, searching & indexing, ease of use and ease of integration with Microsoft technologies. In terms of big data volume, Lucene serves better.

History

Version 1.0 - Initial Version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

GanesanSenthilvel

Architect
India

Currently working as IT Solution Architect for Financial Services applications. Out of 18+ years, spent six career years at major IT firms in USA. Basically from C, C++, VC++, C# family, extending to the emerging Big Data and Cloud technology. Loves driving, spending time with friends&family;, building my farm house.

Article Top