Drupal is a very popular Open Source Content Management System (CMS) offering a Content Management Framework (CMF) written in PHP.
Drupal is complemented by a very active and productive community of developers, who offer more than 10 000...
Drupal is a very popular Open Source Content Management System (CMS) offering a Content Management Framework (CMF) written in PHP.
Drupal is complemented by a very active and productive community of developers, who offer more than 10 000...
MiMTiD Corp. acts as Content Protection Agent for some of the world's leading Motion Picture, Record and Sports Programming companies in the United States. On a monthly basis, MiMTiD submits thousands of take down...
OpenSearchServer is a comprehensive suite of software tools and an applications server. With these, you can develop index-related applications with full text capabilities.
When we were creating OpenSearchServer, we wanted to make an appropriate solution for specialists in full text processing and also we wanted to allow professional developers who have not yet tried these technologies to enter this field easily.
We have noticed in the last few years that the use of Search-based applications has been increasing in situations where developers want to optimize access to a specific document among masses of documents.Search-based applications can do this far better than database applications, that is why their use is increasing so much in this situation. More generally, Search-based applications provide a critical improvement when fast access to specific data is the challenge, when interpretation of a query is necessary before providing an answer to a human query (see Analyzers) or when the documents to be searched do not have a similar structure.
Search tools and technologies were first widely used by non-specialists for searching the web, either through a search engine portal (e.g. Google, Bing, Yahoo, etc.), or for an in-web search (finding data in a specific website). These technologies have also been used to create Enterprise Search applications for accessing and processing both structured and non-structured data.
The advantages of this new category of Search-based development methods and tools will help you to create unsurpassed applications in situations where performance and the cost of development are key factors. More professional developers will want to try and adopt these methods and tools and we were thinking of these developers when we designed OpenSearchServer. It consists of a series of tabs and, even when creating the most sophisticated applications, you just need to tick boxes and fill in some information. You do not have to write a single line of code! You just concentrate on configuring your data and using analyzers which help you to deliver the job.
Because we know that no two projects have the same needs and because when full text is involved you will need to enter specific context vocabulary, OpenSearchServer allows you to have full access to any function or rule provided, so that nothing in OpenSearchServer is a “black box” for developers and any customization that you can imagine will be possible, to help you to create your very own specific application. You can learn more about all these possibilities by reading the comprehensive free online documentation and by using our Quickstart module to install and configure your first application easily.
We know that for some of you, these tools are quite new and we will answer some of the questions we are often asked below. We will also publish some business case studies showing some concrete business situations which our customers have experienced and showing how OpenSearchServer helped them to progress and create value for their company and their customers.
People often say this to us. As we said at the beginning of this article, we are aware of this and we constantly kept database specialists in mind when we were creating OpenSearchServer. Of course, most of our team were educated on databases first and had used them for years and had discovered their limitations. If we need to do a search, database applications have two kinds of limitations:
The time to access data with a Search application is between 40 and 70 times shorter than the time needed for a query in a relational database application, so the performance advantages of Search applications are very easy to understand. The goal of the creation of the relational database concept was to facilitate the sophisticated processing of structured data, including queries, comparisons, merging, printing, formatting and back-office administrative tasks. This goal drove development of the well-known tables structure and join commands, which allow developers to build a fully relational data model, with access to information in tables, using indexed fields.
In a Search application, on the other hand, all the information is in the index, so accessing information is much faster, of course. A second issue has an obvious impact on performance: in database structures most of the data are stored in hard drives and when accessing the table index (generally one field) in RAM memory, the application knows where the data is on the disk and therefore is able to access it. Therefore, in database applications, we continuously make a huge number of accesses on data stored in disks and we all know that this the slowest operation a computer can perform.
Search applications work in a completely different way. When the index is created, the entire contents of the documents are included in the index, with a pointer to the original document. Data in the index are written and sorted using very sophisticated algorithms, allowing very fast and powerful browsing of the index to identify entries matching the user's query and to extract corresponding documents sorted by relevance.
This is a very important reason why database applications cannot compete with Search applications in the area of data access performance. And of course, when the need is only to access data to find answer to a query it's dramatically below any user's expectations. Benchmarks show that, when database applications access data for query, it can be up to 100 times slower than the time needed for Search-based applications.
With the pre-eminence of internet usage today and the leadership of websites like Google.com or Bing.com, no professional user or company executive can understand that a query needs 3 seconds to return an answer when they see that any teenager can have more that 207 million results in less than 0.16 seconds when typing "Michael Jackson" in a Google search box, as shown below.

Selective access to data in relational databases is made through the SQL language (Structured Query Language). For example, if you wish to extract records from a products database for products unsold this year, you might write the following instruction set:
Select product_number, product_description, stock_count from Products where stock_count > 0 order by product_number.SQL is very well designed for accessing structured data, but it has serious limitations if we need to retrieve information from a huge volume of unstructured data.
Full text search (i.e. a search on unstructured data) needs a very different approach. In our previous example, if the user enters the query incorrectly (e.g. "Michael Jakson" or "Michael Jacksonn") the query will still be correctly processed and the search engine will also recognize these possible errors and will automatically propose that you re-focus the query on "Michael Jackson". Also, if you search for "televisions" on an e-commerce website, you will also get answers for "television". In some sites, you would also have an automatic display of DVD players because people often purchase a DVD when they purchase a television.
To reach this level of text processing and query interpretation with SQL would require a very significant amount of programming. Due to the way data are stored, this will result in very low performance, as we saw above.
Search software applies several text analyzers to data when it is being added to the index and the same analyzers are applied to the string used as the query. The resulting interpretation of queries returns a complete set of answers, which are ranked by relevance using very powerful algorithms. These algorithms were developed using very advanced mathematics.
In conclusion, we frequently have to answer this question about the differences between relational database applications and Search applications. Of course our answer depends on the amount of the data. If a website has only a few pages, or if you have an eCommerce website with a very restricted number of products on offer, then using a database application would not have a big negative impact.On the other hand, let's imagine that you operate a community site with millions of members and a lot of information held on each member and you want to allow any member to find other users living in same country and/or having studied at the same university. In a situation like this, if you wish to provide an optimum experience for the user, a Search application is the only logical choice.
Search engines are mainly devoted to non-structured data, such as web pages, documents from various applications and nearly any type of digital data file other than database tables and binary program files.
A search engine allows a user to ask a query and then to get in response a list of documents matching this query. The response is sorted by relevance, with the most relevant coming first.
So, for each query process there are three steps:
Before being able to process any query, the search engine needs to index the full set of documents to which queries will be applied. This is done by a module called an Indexer.
This operation collects all the different words included in each document and stores them in a big file (the index) with the following information:
The index is stored in a very powerful data structure and algorithms allow extremely fast access to data and browsing in the data and calculating the relevance to a query.
To extract content (words) from each document it has to add to the index, the indexer uses specific libraries called parsers. There is a parser for HTML pages, one for PDF documents, one for Word documents, one for Outlook, one for MP3 files, etc.
As long as parsers are sending individual words to the indexer, many text analyzers and optimizers will be applied to them, giving the power and versatility to search in the index.
The lemmatization analyzer will recognize the different inflected forms for each word and store only the lemma. “Drank”, “drinking”, “drinks”, “drinkable”, “drunk” all have the same lemma, which is “drink”.
The Lowercase filter will set all character to lower case and the ISOLatin1 filter will remove all accentuated characters in Latin languages (é, ê, ë, è will be replaced by e), so that the user does not have to enter accentuated characters, but nevertheless finds relevant answers.
Also stopwords are removed (e.g. words like “of”, “the”, “for” etc.).
When indexing has been completed, a large file called the index has been created. When a query is applied to it, the search engine will very quickly list the documents matching this query, showing the most relevant first. According to Wikipedia, browsing an index of 10,000 large documents can be done in milliseconds but browsing all the words of these 10,000 documents sequentially can take hours.
So when the index has been created, the system is ready to receive and answer queries.
After a user enters a query and hits the Search button, all the analyzers used to create the index will be applied to the query with the same effect that was described for index building.The result of this analysis will be then processed by powerful algorithms to browse the index, find relevant entries, sort them and return the results. We see a big difference here between database SQL queries and search application queries: in search engines the query is interpreted in order to get a fuller understanding of its meaning but SQL only returns data which match the query exactly (data which is a perfect match)
When a query is processed, all documents linked to entries included in the answer will be displayed on the screen. In order to facilitate the reading and understanding of this list, several powerful functions are applied by the search engine. We list some of them below, described with an example of web search.
first the answers will be sorted. The document which has the higher relevance to the query comes first. This relevance is calculated by an algorithm when browsing the index.
If you do a web search and you get a large number of answers, it is useful to see a small description of each entry you get. This is called a snippet and it is automatically extracted from the original document by the search engine.
It is certainly nicer if web pages included in your answer and belonging to the same website are clustered and presented together (listed together with no regard to relevance).
this is a very powerful feature which makes navigation through answers easier when there are hundreds of thousands of answers. Answers are categorized automatically in different ways, using very powerful algorithms, and then you are able to restrict display to videos, images, music, blogs, etc., for example, or to restrict display to documents dated today, last week, last month, last year etc.
Search engines are used to make it easier for people find digital files they need to access in various modes. We all know about websites like Google.com and Bing.com, where a user can enter a query and all the URL's related to his specific search are found and returned, sorted by relevance. In this business model, the companies which created these websites (Google and Microsoft, respectively) use search engine technology as a free service to find websites for internet users which match the users' queries. By displaying some advertising information, Google or Microsoft receive the revenue for theses websites from third parties, not from the users to which they are offering the Search services.
This model is also used and offered by many companies offering portals with search capabilities on specific fields.
But no one knows exactly how the algorithms are used or made and it's very difficult to anticipate the relevancy of a website to a specific query. Worse is that, periodically, Bing or Google change their algorithms and years of work to fine-tune and optimize rankings goes up in smoke. This portal model is offered by many companies and with search capabilities on specific activities or subjects (e.g. sport, cinema, medicines). In these websites the only part of the search engine that you see on the screen is the search box where you write your query. It's like the tip of an iceberg, you don't see the huge power and mass of the search engine behind it.
More generally, we can categorize the uses of search engines as follows:
Content sites (newspapers, special interest groups, etc.), e-commerce sites and community sites are incorporating more and more data. In these models, website owners have a big interest in helping their users or customers to find the information they wish to access easily, in the site owner's own website.
Due to the growing audience of websites, it is always very difficult to anticipate the way people want to access information and in which order they want to get it. For one website, different users may have very different types of visits. Of course, everyone knows how important it is for e-commerce sites to direct the visitors as quick as possible to the product they want to buy.
Another type of business model is called “freemium”. In this model, basic services are free, but the website owner charges a premium for advanced or special features. In websites that use “freemium”, the website's owners need to be sure not only that free users are happy so they continue using the service and provide the third parties with advertising revenues but also that paying users are kept satisfied, in order to have a low churn rate.
And of course in a corporate website, in which the intention is to present a company in the best possible way, nothing could give users a worse impression than not being able find information easily when the information they are looking for is somewhere on the company's website.
In a recent study, IDC has shown that a knowledge worker spends an average of 9 hours a week searching for information. That represents a cost of $14,000 per employee per year.
Also, anyone who has worked for a large company on a multi-department project knows how it is difficult to share information with colleagues or partners and how difficult it is to get information from colleagues or partners. Hence, we can easily understand why enterprise IT specialists have tried their best to use and adapt Search technologies in the enterprise field.
Enterprise applications have many specific features that have had a great influence on search engine development, leading to the creation of a brand new category of Search application called Enterprise, Search applications, which have the following characteristics:
in a company, documents are of different types, not only web pages. They can be external, like websites; in internal files that are local or remote; and in various formats (MS Office documents, inbox, Mail servers, central databases, ERP, Enterprise Content Management systems, CRM, etc.),
in corporations, not all users have the right to see all documents so profiles should apply also to Search applications,
if you work in an advertising company or a toy manufacturer the word banner has not the same meaning. Also if your company's best-selling product is called Paris, it is important that your product does not get confused with Paris the city in queries.
unlike Bing or Google, where you get as many answers as possible, in a corporate environment, the best answer to a query is very often one single document. Enterprise Search applications return far fewer answers but those answers are much more relevant. Google may omit some answers that it considers are not very relevant to a query, and this is not normally a problem for Google users. However, in Enterprise Search applications, it might be disastrous to miss even a single document, the system might completely lose credibility.
In the field of interconnection with legacy systems and others, Enterprise Search has become a very specialized and very interesting domain. We will issue a specific white paper about this soon. Watch out for this on our website so that you can read our views on this.
Desktop search applications like Google Desktop aim at giving an individual user the best possible access to the documents stored within his personal computer. Here again, search technologies are used and we also see now more and more features directly handled by the new version of Windows, including some features obtained from Fast (acquired by Microsoft) or in Mac OS, with its excellent Spotlight technology.
If you want to distribute a large collection of non-structured information on portable media (CD, DVD), you can use a search engine to index this data and allow easy and fast access by entering a query as done when using internet search engine.
Because search engines have far better performance when accessing data than any other technology, many researchers have tried to find out how to use this power to speed up data access in a non-Search application (e.g. a database application) before the data is processed. Thus, when performing complex processing, when a non-Search applications (such as a database application) cannot provide fast enough access time, this data access sub-process can be given to a search engine and then when the data has been found, it is delivered to the rest of the application for processing, if required.
So this new programming approach has two parts:
In this approach, the text of the query is not entered by a human user, but sent by a sub-process to the search engine and the answer is sent back by the search engine to the database.
An application produced like this is called a Search-Based Application (SBA) and it provides high performance and a better ROI because SBA's can be created much more quickly. This approach redefines the way people create applications. It seems very promising and at OpenSearchServer, we are very interested in investigating more on this direction. Also, synchronization of database and index become a key factor and a lot remains to be done on this subject. We are preparing a full white paper about Search-Based Applications, which will be available for you soon.