Enabling more robust hyphenated word indexing in Sphider

Enabling more robust hyphenated word indexing in Sphider

The problem

Lately I've been hacking around in Sphider. I've been using that as the search engine behind my website for a couple of months now, it's quite encompassing, however there are quite a few annoyances for me personally.

So slowly I've been fixing these items as time permits. Many of the hacks are rather specific to my site, but this one could be of help to anyone who wants better indexing of hyphenated words.

Essentially, Sphider currently indexes hyphenated words as a single word, so "light-weight" is simply indexed as "light-weight". I have an article containing this word within the title, so I would have hoped that searching for "light weight" would bring up that article. Unfortunately, it did not. So I rooted around and found where the words were prepared and inserted into the array.

The solution

What this little modification does is take a hyphenated word, such as "light-weight", it will then split that into four separate words, "light, weight, light-weight, and lightweight", and add them into an array. Then we use this new array to index them all against the URL, so now if you remove the hyphen from the word and simply use the component words your search results should still be accurate. Not perfect, but it works well for me.

The code

Making this change is easy, simply back-up your spiderfuncs.php file and work off a copy. Within the function unique_array (around line 320), find the following code;

if ($stem_words == 1) {
    $newarr = Array();
    foreach ($arr as $val) {
        $newarr[] = stem($val);
    }
    $arr = $newarr;
}
sort($arr);
reset($arr);
$newarr = array ();

Now you want to add the new code just before the array is sorted and reset;

if ($stem_words == 1) {
    $newarr = Array();
    foreach ($arr as $val) {
        $newarr[] = stem($val);
    }
    $arr = $newarr;
}
// BEGIN HYPHENATED WORD INDEXING FIX
// This code splits and also joins hypenated words for indexing, while maintaining the whole hypenated word
// example; light-weight will be split into four separate words, "light, weight, light-weight, and lightweight"
$hyphenarray = array();  // create placeholder array
foreach ($arr as $val) {
    if (strpos($val,"-") == true) {  // word contains a hyphen
        $bits = explode("-",$val);  // let's break it up
        foreach ($bits as $bitsval) {
            $hyphenarray[] = $bitsval;  // add each individual word to array
        }
        $joined = str_replace("-","",$val);  // remove hyphens to join word
        $hyphenarray[] = $joined;  // add the joined word to array
        $hyphenarray[] = $val;  // add the hypenated word to array
    } else {
        $hyphenarray[] = $val;  // word does not contain hyphen, add to array & carry on
    }
}
$arr = $hyphenarray;
// END HYPHENATED WORD INDEXING FIX
sort($arr);
reset($arr);
$newarr = array ();

Finish

And that's it, it should index hyphenated words as their component words now too.

Tags

 

You might like

Comments


Thanks again,

It's very nice to see that someone that understands coding is making some very good modifications for Sphider available.

Greetings!


Thanks Willy, I always try to give something back to the open source projects that I utilize.

I have a couple more useful modifications for Sphider that I will try to publish in the near future, as time permits.


That's what I try to do too,

Unfortunately I'm not a coder so I can't offer very much to the Sphider project. I just do the little that I can do.

I'm looking forward to your further contributions.

Btw. I checked to notify of followup comments but did not receive the email. And no, it is not in my spambox. Thought you might like to know this as I see that future is still in beta. ;)


Willy, I hear you, answering questions on forums and such is just as helpful as writing code if you ask me.

Hmmm, hotmail accepted the message and queued it for delivery on Apr 2 09:44:28 EST, once they've accepted it, it's up to their system to deliver.

I've heard hotmail is very stringent, though I'm not blacklisted and I have other hotmail users that have received their emails. I should probably set up domain sender keys, I think that may help with hotmail.

In the meantime, I've toggled your account to verified, so hopefully you'll receive the follow-up on this.


Willy, Microsoft responded to my inquiries. They basically said my IP was filtered b/c of their internal spam filtering settings and to follow *abc* directions, which I have already done.

They recommended I enroll my mail server IP in some proprietary program of theirs, so that their users can get their own email... sorry, but that's simply not happening.

I just wanted you to be aware that Microsoft is taking it upon themselves to drop legitimate email from clean IPs. Not flagging it, they are simply accepting it and then dropping it. Meanwhile, I signed up for a hotmail account for testing, there's already 12 Viagra/porn spam emails to the account. You be the judge.

When I get time, I will add some checks to my subscription system to notify hotmail/msn/live users that they can not use the service due to Microsoft's policies.


Hi Matt,
My comment is a bit "off topic" ... forgive. Recently, I have been trying to improve the normal search function in snews (adding fulltext indexes and adding weighing and highlighting. Sphider is nice, but too much for me at present. I noticed that you show the page size in your results, but it does not seem to match the real page size when you go to that page ... eg. http://www.mdj.us/web-development/ajax-jav... - 38.1kb, but if I call that page, firefox says the page size is 10.17 kB. Maybe it's becuase of gzip? Just thought I'd mention.


Hi Tina, you got it, everything is gzip compressed, so it ends up far smaller on most modern browsers.

If you download the page and then check it out locally, it should be roughly the equivalent of what sphider is reporting.


FYI, hotmail appears to be getting through now, though it seems to be going into the "junk" folder, so if you receive it, make sure to mark it "not spam".


Thank. Good fix.
I also corrected a lot in this script, and much loved. For example search for pictures.
Demo: http://demo.wmshop.tj/
Official site: http://magicsearch.pp.ua/
------------------
ОгÑ€омное Ñ?паÑ?ибо. ХоÑ€ошое иÑ?пÑ€авление.
Я Ñ‚оже многое иÑ?пÑ€авил в Ñ?Ñ‚ом Ñ?кÑ€ипÑ‚е и многое добавил. Ð?апÑ€имеÑ€ поиÑ?к по картинкам.
Демо: http://demo.wmshop.tj/
ОфициалÑŒнÑ‹й Ñ?айÑ‚: http://magicsearch.pp.ua/


Cool dimkaUA, nice looking sites you've got there.

I have another little hack I need to release at some point. I've improved the "did you mean" function by running some additional checks & comparisons.

Not sure, but the form may have mangled the Cyrillic characters on the second part of your comment.


Hello. I am a developer improvements Script Magic Search. You said did improve function "did you mean". Could you share danym improvement? Many thanks in advance.
Sorry if something written is not clear, I do not understand English, translated with help translate google. A request to send email: dimasrap5@gmail.com


Hi Dmitrii,

Yes, I will see if I can post a tutorial for it this weekend. Check back in a day or two.


Ok, I will wait.
When sdelaesh lesson write in the subject in the comments. I just signed on to this note.


Hi Matt,

I just found four notifications in my spambox at hotmail. Thre are dated the fourth of June and the latest is from today. So something happened there...

Greetings.


w0w! Matt thanks a lot for this awesome fix. Worked like charm in my site. I mimicked this code for words containing dot (.) and that worked too.

Thanks again

Comments are closed. No new comments allowed.

Copyleft 2002 - 2017 Matt Jones
Hand crafted with HTML5 & CSS3
↑ Back to top