Creating better search suggestions with Sphider
The problem
If you use the PHP web Sphider, you may or may not have noticed that the "did you mean" function isn't very accurate much of the time. Why is that? Simple, here is what it does with the search terms:
- Use MySQL's SOUNDEX function to find the close matches
- Use PHP's levenshtein function to find the first closest levenshtein distance.
Well, this is OK, but often gives poor results, especially as it keeps the first closest result & stops, even if better matches exist later in the result array.
The solution
So what I have done is to add two additional levels of matching, after getting the levenshtein distance, we then try and match the metaphone keys, and finally check the keywords using PHPs similar_text function to see if there is a better match than the current one. So this is what we end up with:
- Use MySQL's SOUNDEX function to find the closest matches
- Use PHP's levenshtein function to find the closest levenshtein distance.
- Use PHP's metaphone function to match keys, if they match, then we perform the next step (#4).
- Use PHP's similar_text function to see if this result is better than the last one.
The code
OK, how to implement? Easy, open the include/searchfuncs.php file, around line 333, find the following;
$near_word ="";
while ($row=mysql_fetch_row($result)) {
$distance = levenshtein($row[0], $word);
if ($distance < $max_distance && $distance <4) {
$max_distance = $distance;
$near_word = $row[0];
}
}
Now REPLACE that with the following code (indent as appropriate);
// BEGIN BETTER SEARCH SUGGESTION FIX
$near_word ="";
$max_similar = 0;
while ($row=mysql_fetch_row($result)) {
$distance = levenshtein($row[0], $word);
if ($distance = $distance) {
if (metaphone($row[0]) == metaphone($word)) {
$similar = similar_text($row[0],$word);
if ($similar >= $max_similar) {
$max_distance = $distance;
$max_similar = $similar;
$near_word = $row[0];
}
}
} else {
$max_distance = $distance;
$near_word = $row[0];
}
}
// END BETTER SEARCH SUGGESTION FIX
Finish
Now save and upload, and now your "did you mean" search suggestions should be much much more accurate.
Comments
RSS Comments Feed
Willy
It's true, the "Did you mean" suggestions are a weak point in the original.
This is another nice improvement for Sphider.
Btw. I just read your follow up answers in the other article concerning the hotmail problems. As I didn't receive notifications I hadn't read them before. :-) I don't have spam problems on my hotmail account yet, and it's the account I use for forum signups and such so I can't really be bothered with Uncle Bill's policy. But it sounds really strange, microsoft seems to be working hard to remain funny in certain aspects...
Greetings!
Dmitrii Lavrinyuk
pPaul
pchenk
I was just learning snews from your web
Iswandi
Matt
Matt
I have found and fixed this bug. Sphider isn't converting the query string to lower case as expected.
Inside the file include/searchfuncs.php, find the following line;
$query=str_replace('"','',$query);and replace it with this;
$query=strtolower(str_replace('"','',$query));*Edit: As Willy noted, there are 2 instances to change.
Willy
Another small but important improvement to Sphider.
I found two instances of this piece of code, I changed both. Is that correct?
Greetings!
Matt
Yes, do that on both instances... to make it easier, we could simply put this at the top of the function below the list of global variables, and that would cover it throughout;
but either way works :)
Test Sphider
did you mean replacing these whole codes with the code you provided?:
starting from
$near_word ="";
ends with
$near_words[$word] = $near_word;
}
}
second question:
sphider does not do case intensive searches for non english characters. How can we correct this?
Matt
Yes, as I noted, you need to REPLACE those 4 lines with the code I provided.
I'm not certain about case sensitivity with non Latin languages, I will have to investigate that & get back to you.
Test Sphider
The 4th line and the remaining part in my case is:
if ($distance < $max_distance && $distance <4) { $max_distance = $distance; $near_word = $row[0]; } } if ($near_word != "" && $word != $near_word) { $near_words[$word] = $near_word; } }When I change this whole code with the code you provided, "did you mean function" is not working.
Test Sphider
My html page charsets are windows-1254. I changed my database collation to latin5_turkish_ci.
I added this code to database.php:
mysql_query("SET NAMES 'latin5'");
mysql_query("SET CHARACTER SET latin5");
mysql_query("SET COLLATION_CONNECTION = 'latin5_turkish_ci'");
No problems displaying the characters. I have "AĞRI" in one of my page title. When I search "ağrı",it only search for "ağrı" not "AĞRI" and as a result highlights "ağrı" only.
Thanks again for your help in advance.
Matt
Sorry about that, I just took a look at the code. It seems when I moved to a new content management system recently a part of the code block to be replaced was stripped out of the original article. Try replacing the code block now listed above.
For finding the lowercase, you can try using PHP's multibyte function;
http://php.net/manual/en/function.mb-strtolower.php
Something like this maybe?
$query = mb_strtolower(str_replace('"','',$query), 'iso-8859-9');Test Sphider
Fow the case problem.. I read the url you've supplied. But I dont want the query to be converted into lower case or upper case. What I want is :
When you search a query "book", it matches all word regardless of case right? (highlights book and Book and BOOK). But in case of latin 5 characters it doesn't work in the same way. It become case sensitive. Do you have any idea to tell the script to search everything case intensive?
Matt
Regarding the case-sensitive searching, I believe that Sphider itself converts all the strings to lowercase using PHP's strtolower function before storing them in the database so that the md5 hashes match... i.e. searching on "GaMe" would match "game" or "GAME". It's possible that it's mucking up the latin 5 characters when it's doing that, so the md5 hashes aren't matching when searching.
For a quick fix, I would try replacing all the strtolower function calls in spiderfuncs.php, searchfuncs.php, etc with the mb_strtolower function then re-indexing & checking again, but that's just a wild guess at this point.
Test Sphider
In previous version when I typed çanakkale it asked "did you mean Çanakkale" and clicking on Çanakkale brought Çanakkale results not ÇANAKKALE. In this version, when I type Çanakkale it asked "did you mean çanakkale" which is weird because I only have Çanakkale and ÇANAKKALE in page. It finds the page but no highlighting. :))
I also removed that special characters from remove accents part in commonfuncs.php file but no help.
It is really interesting and drives me crazy. I could not find any solution to this. It clearly recognizes the words but we need to tell ğ=Ğ, ş=Ş or Ç=ç so that it matches all and highlights them.
If you find any ideas or any solutions to this, I'll keep an eye on this page..
Thank you for everything.
Test Sphider
I've been searching and I guess the solution lies in "preg_match" in spiderfunc.php and "$entities = array" part in comonfuncs.php. But I could not do it..