Creating better search suggestions with Sphider

Creating better search suggestions with Sphider

The problem

If you use the PHP web Sphider, you may or may not have noticed that the "did you mean" function isn't very accurate much of the time. Why is that? Simple, here is what it does with the search terms:

  1. Use MySQL's SOUNDEX function to find the close matches
  2. Use PHP's levenshtein function to find the first closest levenshtein distance.

Well, this is OK, but often gives poor results, especially as it keeps the first closest result & stops, even if better matches exist later in the result array.

The solution

So what I have done is to add two additional levels of matching, after getting the levenshtein distance, we then try and match the metaphone keys, and finally check the keywords using PHPs similar_text function to see if there is a better match than the current one. So this is what we end up with:

  1. Use MySQL's SOUNDEX function to find the closest matches
  2. Use PHP's levenshtein function to find the closest levenshtein distance.
  3. Use PHP's metaphone function to match keys, if they match, then we perform the next step (#4).
  4. Use PHP's similar_text function to see if this result is better than the last one.

The code

OK, how to implement? Easy, open the include/searchfuncs.php file, around line 333, find the following;

$near_word ="";
while ($row=mysql_fetch_row($result)) {
    
    $distance = levenshtein($row[0], $word);
    if ($distance < $max_distance && $distance <4) {
	$max_distance = $distance;
	$near_word = $row[0];
    }
}

Now REPLACE that with the following code (indent as appropriate);

// BEGIN BETTER SEARCH SUGGESTION FIX
$near_word ="";
$max_similar = 0;
while ($row=mysql_fetch_row($result)) {
    $distance = levenshtein($row[0], $word);
    if ($distance = $distance) {
            if (metaphone($row[0]) == metaphone($word)) {
                $similar = similar_text($row[0],$word);
                if ($similar >= $max_similar) {
                    $max_distance = $distance;
                    $max_similar = $similar;
                    $near_word = $row[0];
                }
            }
        } else {
            $max_distance = $distance;
            $near_word = $row[0];
    }
}
// END BETTER SEARCH SUGGESTION FIX

Finish

Now save and upload, and now your "did you mean" search suggestions should be much much more accurate.

Tags

 

You might like

Comments


Thanks again Matt,

It's true, the "Did you mean" suggestions are a weak point in the original.

This is another nice improvement for Sphider.

Btw. I just read your follow up answers in the other article concerning the hotmail problems. As I didn't receive notifications I hadn't read them before. :-) I don't have spam problems on my hotmail account yet, and it's the account I use for forum signups and such so I can't really be bothered with Uncle Bill's policy. But it sounds really strange, microsoft seems to be working hard to remain funny in certain aspects...

Greetings!


Hi, thanks for the article. Already applied in their true until that changes are not seen =) set them up. If you have more fashion, write articles, I will be glad to read and improve the script.


Thanks for the info, was about to ask the same question at the sphider forum.


Thank, Matt
I was just learning snews from your web


When I entered a word in uppercase and gave no result, "did you mean" will suggest the exact query I entered. As example at this site I entered GAMINGO. This also happens at "demo" of sphider forum also of course at my site. Thanks


That's odd Iswandi , I will have a look some-time.


Iswandi,

I have found and fixed this bug. Sphider isn't converting the query string to lower case as expected.

Inside the file include/searchfuncs.php, find the following line;

$query=str_replace('"','',$query);


and replace it with this;

$query=strtolower(str_replace('"','',$query));


*Edit: As Willy noted, there are 2 instances to change.


Thanks Matt,

Another small but important improvement to Sphider.

I found two instances of this piece of code, I changed both. Is that correct?

Greetings!


Hi Willy,

Yes, do that on both instances... to make it easier, we could simply put this at the top of the function below the list of global variables, and that would cover it throughout;

$query=strtolower($query);


but either way works :)


hi,
did you mean replacing these whole codes with the code you provided?:
starting from
$near_word ="";

ends with

$near_words[$word] = $near_word;
}

}

second question:

sphider does not do case intensive searches for non english characters. How can we correct this?


Hi there,

Yes, as I noted, you need to REPLACE those 4 lines with the code I provided.

I'm not certain about case sensitivity with non Latin languages, I will have to investigate that & get back to you.


I meant the code actually is not consisting of 4 lines in 1.3.5 version:
The 4th line and the remaining part in my case is:

if ($distance < $max_distance && $distance <4) {
    $max_distance = $distance;
    $near_word = $row[0];
    }
}
if ($near_word != "" && $word != $near_word) {
    $near_words[$word] = $near_word;
    }
}


When I change this whole code with the code you provided, "did you mean function" is not working.


Regarding my other question..

My html page charsets are windows-1254. I changed my database collation to latin5_turkish_ci.
I added this code to database.php:
mysql_query("SET NAMES 'latin5'");
mysql_query("SET CHARACTER SET latin5");
mysql_query("SET COLLATION_CONNECTION = 'latin5_turkish_ci'");
No problems displaying the characters. I have "AĞRI" in one of my page title. When I search "ağrı",it only search for "ağrı" not "AĞRI" and as a result highlights "ağrı" only.

Thanks again for your help in advance.


Hi again,

Sorry about that, I just took a look at the code. It seems when I moved to a new content management system recently a part of the code block to be replaced was stripped out of the original article. Try replacing the code block now listed above.

For finding the lowercase, you can try using PHP's multibyte function;
http://php.net/manual/en/function.mb-strto...

Something like this maybe?
$query = mb_strtolower(str_replace('"','',$query), 'iso-8859-9');


Thank you again. I appreciate your help. "Did you mean" function is now working but one "}" character should be deleted.

Fow the case problem.. I read the url you've supplied. But I dont want the query to be converted into lower case or upper case. What I want is :

When you search a query "book", it matches all word regardless of case right? (highlights book and Book and BOOK). But in case of latin 5 characters it doesn't work in the same way. It become case sensitive. Do you have any idea to tell the script to search everything case intensive?


Ahhh, you're quite right, haha, this one got butchered, I better check my other articles. Thanks for the trouble shooting!

Regarding the case-sensitive searching, I believe that Sphider itself converts all the strings to lowercase using PHP's strtolower function before storing them in the database so that the md5 hashes match... i.e. searching on "GaMe" would match "game" or "GAME". It's possible that it's mucking up the latin 5 characters when it's doing that, so the md5 hashes aren't matching when searching.

For a quick fix, I would try replacing all the strtolower function calls in spiderfuncs.php, searchfuncs.php, etc with the mb_strtolower function then re-indexing & checking again, but that's just a wild guess at this point.


It sounded a good idea. I tested it. But this time, it didn't highlighted the word found and also it didn't found with capital letters :)
In previous version when I typed çanakkale it asked "did you mean Çanakkale" and clicking on Çanakkale brought Çanakkale results not ÇANAKKALE. In this version, when I type Çanakkale it asked "did you mean çanakkale" which is weird because I only have Çanakkale and ÇANAKKALE in page. It finds the page but no highlighting. :))

I also removed that special characters from remove accents part in commonfuncs.php file but no help.

It is really interesting and drives me crazy. I could not find any solution to this. It clearly recognizes the words but we need to tell ğ=Ğ, ş=Ş or Ç=ç so that it matches all and highlights them.

If you find any ideas or any solutions to this, I'll keep an eye on this page..

Thank you for everything.


Hi again,

I've been searching and I guess the solution lies in "preg_match" in spiderfunc.php and "$entities = array" part in comonfuncs.php. But I could not do it..

Comments are closed. No new comments allowed.

Copyleft 2002 - 2017 Matt Jones
Hand crafted with HTML5 & CSS3
↑ Back to top