Enabling more robust hyphenated word indexing in Sphider
The problem
Lately I've been hacking around in Sphider. I've been using that as the search engine behind my website for a couple of months now, it's quite encompassing, however there are quite a few annoyances for me personally.
So slowly I've been fixing these items as time permits. Many of the hacks are rather specific to my site, but this one could be of help to anyone who wants better indexing of hyphenated words.
Essentially, Sphider currently indexes hyphenated words as a single word, so "light-weight" is simply indexed as "light-weight". I have an article containing this word within the title, so I would have hoped that searching for "light weight" would bring up that article. Unfortunately, it did not. So I rooted around and found where the words were prepared and inserted into the array.
The solution
What this little modification does is take a hyphenated word, such as "light-weight", it will then split that into four separate words, "light, weight, light-weight, and lightweight", and add them into an array. Then we use this new array to index them all against the URL, so now if you remove the hyphen from the word and simply use the component words your search results should still be accurate. Not perfect, but it works well for me.
The code
Making this change is easy, simply back-up your spiderfuncs.php file and work off a copy. Within the function unique_array (around line 320), find the following code;
if ($stem_words == 1) {
$newarr = Array();
foreach ($arr as $val) {
$newarr[] = stem($val);
}
$arr = $newarr;
}
sort($arr);
reset($arr);
$newarr = array ();
Now you want to add the new code just before the array is sorted and reset;
if ($stem_words == 1) {
$newarr = Array();
foreach ($arr as $val) {
$newarr[] = stem($val);
}
$arr = $newarr;
}
// BEGIN HYPHENATED WORD INDEXING FIX
// This code splits and also joins hypenated words for indexing, while maintaining the whole hypenated word
// example; light-weight will be split into four separate words, "light, weight, light-weight, and lightweight"
$hyphenarray = array(); // create placeholder array
foreach ($arr as $val) {
if (strpos($val,"-") == true) { // word contains a hyphen
$bits = explode("-",$val); // let's break it up
foreach ($bits as $bitsval) {
$hyphenarray[] = $bitsval; // add each individual word to array
}
$joined = str_replace("-","",$val); // remove hyphens to join word
$hyphenarray[] = $joined; // add the joined word to array
$hyphenarray[] = $val; // add the hypenated word to array
} else {
$hyphenarray[] = $val; // word does not contain hyphen, add to array & carry on
}
}
$arr = $hyphenarray;
// END HYPHENATED WORD INDEXING FIX
sort($arr);
reset($arr);
$newarr = array ();
Finish
And that's it, it should index hyphenated words as their component words now too.
Comments
RSS Comments Feed
Willy
It's very nice to see that someone that understands coding is making some very good modifications for Sphider available.
Greetings!
Matt
I have a couple more useful modifications for Sphider that I will try to publish in the near future, as time permits.
Willy
Unfortunately I'm not a coder so I can't offer very much to the Sphider project. I just do the little that I can do.
I'm looking forward to your further contributions.
Btw. I checked to notify of followup comments but did not receive the email. And no, it is not in my spambox. Thought you might like to know this as I see that future is still in beta. ;)
Matt
Hmmm, hotmail accepted the message and queued it for delivery on Apr 2 09:44:28 EST, once they've accepted it, it's up to their system to deliver.
I've heard hotmail is very stringent, though I'm not blacklisted and I have other hotmail users that have received their emails. I should probably set up domain sender keys, I think that may help with hotmail.
In the meantime, I've toggled your account to verified, so hopefully you'll receive the follow-up on this.
Matt
SPF: http://www.openspf.org/
and
SenderID: http://www.microsoft.com/mscorp/safety/technologies/senderid/default.mspx
records to my DNS, let's see if that helps with hotmail/msn/live.
Matt
They recommended I enroll my mail server IP in some proprietary program of theirs, so that their users can get their own email... sorry, but that's simply not happening.
I just wanted you to be aware that Microsoft is taking it upon themselves to drop legitimate email from clean IPs. Not flagging it, they are simply accepting it and then dropping it. Meanwhile, I signed up for a hotmail account for testing, there's already 12 Viagra/porn spam emails to the account. You be the judge.
When I get time, I will add some checks to my subscription system to notify hotmail/msn/live users that they can not use the service due to Microsoft's policies.
Tina
My comment is a bit "off topic" ... forgive. Recently, I have been trying to improve the normal search function in snews (adding fulltext indexes and adding weighing and highlighting. Sphider is nice, but too much for me at present. I noticed that you show the page size in your results, but it does not seem to match the real page size when you go to that page ... eg. http://www.mdj.us/web-development/ajax-javascript/live-comment-previewing-using-the-jquery-library/ - 38.1kb, but if I call that page, firefox says the page size is 10.17 kB. Maybe it's becuase of gzip? Just thought I'd mention.
Matt
If you download the page and then check it out locally, it should be roughly the equivalent of what sphider is reporting.
Matt
dimkaUA
I also corrected a lot in this script, and much loved. For example search for pictures.
Demo: http://demo.wmshop.tj/
Official site: http://magicsearch.pp.ua/
------------------
ОгÑ€омное Ñ?паÑ?ибо. ХоÑ€ошое иÑ?пÑ€авление.
Я Ñ‚оже многое иÑ?пÑ€авил в Ñ?Ñ‚ом Ñ?кÑ€ипÑ‚е и многое добавил. Ð?апÑ€имеÑ€ поиÑ?к по картинкам.
Ðâ€ÐµÐ¼Ð¾: http://demo.wmshop.tj/
ОфициалÑŒнÑ‹й Ñ?айÑ‚: http://magicsearch.pp.ua/
Matt
I have another little hack I need to release at some point. I've improved the "did you mean" function by running some additional checks & comparisons.
Not sure, but the form may have mangled the Cyrillic characters on the second part of your comment.
Dmitrii Lavrinyuk
Sorry if something written is not clear, I do not understand English, translated with help translate google. A request to send email: dimasrap5@gmail.com
Matt
Yes, I will see if I can post a tutorial for it this weekend. Check back in a day or two.
Dmitrii Lavrinyuk
When sdelaesh lesson write in the subject in the comments. I just signed on to this note.
Matt
Try this hack;
http://www.mdj.us/web-development/php-programming/creating-better-search-suggestions-with-sphider/
Willy
I just found four notifications in my spambox at hotmail. Thre are dated the fourth of June and the latest is from today. So something happened there...
Greetings.
tintinboss
Thanks again