Monday, April 02, 2007

live.com - malware?

When looking through my Webalizer stats recently I noticed that *.search.live.com is transferring about four times as much data from my domain than *.google.com. This wouldn't concern me if I saw some people being referred to my site from live.com, however I see almost none, while google.com is responsible for referring about half the traffic to my site!

Then I looked through the aggregate stats for all web sites hosted on my ISP and noticed that live.com has three times the bandwidth use of google while not showing up in referrals.

I did a couple of test searches with live.com and it seems that one reason why I'm not getting hits is because the search engine just isn't much good. The search string "bonnie++" does not return any links to my program on the first page (maybe live.com can't handle a '+' character).

So I'm now wondering whether there is any reason to permit the live.com servers to use my bandwidth. It's costing my ISP money for no apparent good cause.

In the past there was a previous MS search engine that I had to block because it's attacks (which can not be described in any other way) were using half the web bandwidth of the entire ISP). This case is not so obviously an attack and I'm wondering whether I should permit it to continue for a while just in case they end up giving me some useful referrals.

Of course the other possibility is that if we all block their servers then the live.com results will become even more useless than they currently are and they'll give up on the idea.

I look forward to comments on this issue.

6 comments:

Patrick said...

So, M$ re-indexes your site more often than Google? Or just as often, however they pull files that Google won't?

Might be fun to see if M$'s bot can be corralled. ;)

Roland said...

Apparently msnbot never gets a code 304 from my server. I infer that they don't use the "If-Modified-Since" header, and they get the full contents of the page everytime instead of only when it's been changed.

Thumper said...

There was a big campaign to block live.com's spider when it was first launched, iirc.

(by the way, is your feed regenerated every time someone adds a comment to a post or something? It breaks Planet Debian and constantly bumps your posts to the top of the page.)

- Chris

Chris said...

That thing is stupid enough to pull the same file from my server again and again even though it must have got a 404 for months now. Or maybe I'm just ignorant of how search engines work...

etbe said...

Patrick, a casual inspection of my personal site http://www.coker.com.au/ will show that most content is not changed often (in fact much of it has not changed for years) and that almost none of it is time critical (a google search based on last month's data should give the same result as a search based on today's data). It seems reasonable to expect search engines to recognise this and reduce the frequency of their scanning. Of course my blog would need to be scanned regularly - that's what RSS is for!

Thumper, when old posts receive comments they do not move to the top of the list, I believe that this is evidence that commenting does not affect how the posts are displayed in the planet. I don't know why my posts supposedly stay at the top in Planet Debian.

Chris, it does make some sense to do a repeat search after a 404 (sometimes they are temporary due to misconfiguration - I've made such mistakes before). But there should be an exponential back-off for such things.

Soyuz said...

I already had the discussion with Russell but will post anyway for others to ponder upon it.

Is it ethical to do things that might kill/prohibit the growth of a technology/product?

Yes, First of all Russell has full legal right over his site and its access rights. Secondly, he has the complete right to opt out from incurred *loss* (expended bandwidth vs no hit/referral).

But then again when we come to think about it, this is a situation where a technology/product is in growth (Ok.. lets forget for a moment that M$ is owner of it). Who knows .. a great search algorithm might come out of.


What I am trying to point out that we all help out others and thats what we do as human. Specially for technology like this if the code/system is not tested on real world+data its not going to get better anyway.

So I am pondering are we in "unethical line" here?

(Please don't make a anti M$ post chain out of it. My comment is not about the company but about ethics/humanity in technology development)