Thursday 17 Apr 2008

Create an intelligent search engine

PHPWhen we build a website containing many pages based on the same template (blog, shop, forum, directory...) we always need a search-engine to extract results from the database.

Here things are getting harder, because by searching "videos" we won't find articles with the word "video" whereas it would be needed.. :-(

Fortunately, there is a solution: use a soundex. It's an algorithm which will compared words based on phonetics of the words, so the results will be close in sound to the searched term.

After some research, I found that MySQL proposes an integrated soundex, unfortunately rather adapted to English. Moreover, the sought word is not inevitably stored all alone in a field of the base. If it belongs to a longer text, that does not function any more.

After a few searches, I found that MySQL propose an integrated soundex, adapted to english. The problem is the searched word is not necessary stored alone in a DB field. If it's in a longer text, that doesn't work anymore..

'reseach' is my word to find:
SELECT * FROM articles WHERE SOUNDEX(titre) = SOUNDEX('reseach')
equivalent to:
SELECT * FROM articles WHERE titre SOUNDS LIKE 'reseach'

We can also create "full text" index in MySQL. That allows to search a single word in a larger text with MATCH:

SELECT * FROM articles WHERE MATCH titre AGAINST ('reseach')

But now we only got an exact word and not the close words. The best should be to be able to mix both to search A word in the database CLOSE to the searched term.

But that doesn't exist (actually)... :-|

(Sorry, I don't know really other databases than MySQL, even if I would prefer to work on PostGreSQL.)

My solution: create a dictionnary table in the DB which contains the soundexes of all words in the website., with a reference to the article where we can find them.

By searching the soundex of "videos" in my dictionnary table, I will find the references to the articles containing "video" and I only need to print the results.

With a small reserve: searched terms are sometimes very far form the original word, even with a close phonetic. So I use the levenshtein function before showing a result. If it's too far, I trash it.

The difficult part is rather during the recording of articles. You must add a function to take care of:

  1. delete the references to the articles if it was already present
  2. cut article word by word (and eliminate the words <3 letters)
  3. calculate the soundex of these words
  4. check if they are already in the base, in this case add the reference to the article,
  5. else add a new "sound" in the dictionary, with the article reference.

So, to do this you need a function to create the "sound" close to a word. You can stick to the function soundex provided by PHP.

If your website is not in english, or if you need more powerful association, you will need to build your own soundex. You only need to associate groups of letters to a sound. Check this soundex example for french. You can even test at the bottom of the page by typing a word and see the soundex.

Don't forget refining with levenshtein is essential.

I made my development with this soundex. Now it works great and I use it on the website of one of my clients. :-)

cafĂ© Did this article help you? 
Buy me a coffee!

One answer at “Create an intelligent search engine”

  1. 1
    Addiction Treatment said:

    Resources like the one you mentioned here will be very useful to me! I will post a link to this page on my blog. I am sure my visitors will find that very useful.

Leave a comment (all comments are moderated, don't waste time with spam)

Azure Dev