Scheduler request failed: Failure when receiving data from the peer


log in

Advanced search

Message boards : Number crunching : Scheduler request failed: Failure when receiving data from the peer

Author Message
S@NL - Nick Succorso
Send message
Joined: 24 May 00
Posts: 15
Credit: 23,074,763
RAC: 0
Netherlands
Message 1284023 - Posted: 15 Sep 2012, 11:18:23 UTC

I do realise that this subject has been discussed here more often. Still I need to bring it up again, since the possible solutions that were given here, do not help me at all. My main 'rig', an i5-3550 running under W7 and equipped with a 560GTX Ti videocard is in the well-known, but oh so hated state of Scheduler request failed: Failure when receiving data from the peer
This machine runs under BOINC 7.0.28 and built up a load of over 500 results, which are all nicely uploaded, but all scheduler requests fail.

I am at the end of my wits, since my other 'rigs' a smaller E7300 running under Vista and equipped with an 9500GT videocard and my laptop, an i7-2670QM running under W7 and equipped with GT555M videocard are reporting fine. The Vista machine runs under BOINC 6.12.26 en the laptop under BOINC 6.12.34; all three machines run the Lunatics optimized applications.

My question is simply this, how can I 'force' my I5-3550 to report properly, or perhaps I should say: how can I seduce Berkeley to accept my reporting. Manual forcing does not work. I have put the cc_config.xml with a 250 parameter in the BOINC directory, which is nicely accepted, and since then BOINC Manager tries to report in chunks of 250 units, but to no avail. I have also tried to do this with smaller numbers, and also this doesn't help.

Last week I had the similar problem, until my buffers ran dry over the weekend. I don't know what happened, but while I was away, something changed, and upon my return, all of a sudden all units were reported and my buffers were filled again. Until the end of last Thursday morning things went fine, automatic scheduler requests which were accepted and new units being downloaded. After that, scheduler requests fail again.

Any help in this is greatly appreciated.

____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4087
Credit: 32,994,949
RAC: 5,750
United Kingdom
Message 1284028 - Posted: 15 Sep 2012, 11:27:15 UTC - in response to Message 1284023.
Last modified: 15 Sep 2012, 11:58:52 UTC

Try setting NNT (If you're not asking for work, the server won't try and send any, it's one less overhead the server's scheduler and the upload and download links have to deal with),
and reporting 100 tasks at a time, if that won't work, try 20 tasks at a time, If it works, hit update immediately again, no need to wait 5 minutes if you're just reporting,

Once you're got everything reported, you might find you can't get through when asking for work, and getting 20 tasks resent at once is too much too, then try setting a very low cache setting, and try and get a few tasks resent at a time,
(I had to do this yesterday on my XP3200+/HD4650/8400GS/GT430 host, it couldn't get 20 tasks resent at once, but could get 2 or 3 resent O.K)

Claggy

S@NL - Nick Succorso
Send message
Joined: 24 May 00
Posts: 15
Credit: 23,074,763
RAC: 0
Netherlands
Message 1284070 - Posted: 15 Sep 2012, 13:53:03 UTC - in response to Message 1284028.

Hi Claggy, thanks for replying. I have done what you proposed, but so far, the results are minimal. I have succeeded to report twice 25 tasks, but most of the time it ends in tears. Either"timeout reched" or "Failure when receiving data from peer". It seems I just have to sit here and keep hitting the retry button ;)

Any other suggestions, except from just waiting until things resolve magically?

Do you (or anyone else) have any idea where this tedious mechanism of additionally reporting comes from? I mean, once uploaded the result is there, isn't it? Why this extra step, which seems to cause most of the traffic and is bl..dy annoying if you ask me.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4087
Credit: 32,994,949
RAC: 5,750
United Kingdom
Message 1284075 - Posted: 15 Sep 2012, 14:04:20 UTC - in response to Message 1284070.
Last modified: 15 Sep 2012, 14:13:06 UTC

Hi Claggy, thanks for replying. I have done what you proposed, but so far, the results are minimal. I have succeeded to report twice 25 tasks, but most of the time it ends in tears. Either"timeout reched" or "Failure when receiving data from peer". It seems I just have to sit here and keep hitting the retry button ;)

Any other suggestions, except from just waiting until things resolve magically?

Do you (or anyone else) have any idea where this tedious mechanism of additionally reporting comes from? I mean, once uploaded the result is there, isn't it? Why this extra step, which seems to cause most of the traffic and is bl..dy annoying if you ask me.
Try a figure of 10 instead.

It'll probably resolve itself when all the AP Wu's are split and sent out, there's something in the system causing a bottleneck, and causing extra long scheduler contacts.
(I think the server is limiting the number of AP tasks sent out in a scheduler contact to two or three at a time, and hosts rather than getting a 'no tasks available for the application you requested' message and a minimal load on the scheduler,
are now getting one or two Wu's at a time, and putting a much bigger load on the scheduler)

Uploading is a simple file transfer, the server then needs to be told the file has been uploaded and is ready for validation,

Claggy

S@NL - Nick Succorso
Send message
Joined: 24 May 00
Posts: 15
Credit: 23,074,763
RAC: 0
Netherlands
Message 1284081 - Posted: 15 Sep 2012, 14:12:54 UTC - in response to Message 1284075.

I'll try lower numbers and let it run.

The point I am trying to make is that once a result is uploaded, an internal message withing the SETI@Home system in Berkely could handle the reporting. Why do those tens (hundereds?) of thousands of clients have to make connection to just tell the system "hey, I just uploaded a result to you, (just in case you forgot?)

Funny also that this scheduling process, which is apparently asking some resources is on the same server as the AP splitter.

Thanks again for your help, Claggy, and have a nice remainder of the weekend!

Nick
____________

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2680
Credit: 6,064,447
RAC: 4,288
Bulgaria
Message 1284084 - Posted: 15 Sep 2012, 14:20:57 UTC - in response to Message 1284070.

Any other suggestions, except from just waiting until things resolve magically?

Try some Proxy server (try to read most of the thread before acting, there are corrections/additions in the later posts):
http://setiathome.berkeley.edu/forum_thread.php?id=64691


I mean, once uploaded the result is there, isn't it?

Yes, but this is only a file copy.

Lets say you go to a friend's computer, he's not there, you decide to copy some nature photos to:
C:\Photos
(and you add more the next days)

Do you think he will find them?
Do you think he will know who put them there?

Not before you call him (after a week) and say "Hey, look in C:\Photos , I put some 1000 files for you"

Which is better than:
you copy 1 file, make a phone call "see the file #1"
you copy 1 file, make a phone call "see the file #2"
you copy 1 file, make a phone call "see the file #3"
.........


The reporting tells the server all the filenames of your result files (they go in different folders on the server) at once which is like "one phone call tells it all"


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4087
Credit: 32,994,949
RAC: 5,750
United Kingdom
Message 1284087 - Posted: 15 Sep 2012, 14:33:29 UTC - in response to Message 1284081.
Last modified: 15 Sep 2012, 14:52:44 UTC

At present i make it 12,552,531 MB or AP Wu's eithier out in the field or awaiting validation, that's a lot of Wu's result files to keep track of,

A lot of the problem is because how popular Seti is, and the small size of their Wu's (in computation length), it's the biggest project out there, but is stuck with a single 100MBs scheduler/upload/download pipe,
other projects like Einstein have download servers in different parts of the world, their Wu's take a lot longer, and even then it only has 481,984 tasks in progress, and only 397,951 Wu's awaiting validation,
while Collatz only has 95,136 tasks in progress, LHC has only 58,288 tasks in progress, CPDN has only 136,581 tasks in progress, PrimeGrid has only 99,091 tasks in progress, Milkyway has only 280,277 tasks in progress, etc.

Claggy

Profile Fred E.Project donor
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,139,004
RAC: 30
United States
Message 1284094 - Posted: 15 Sep 2012, 14:58:19 UTC

Funny also that this scheduling process, which is apparently asking some resources is on the same server as the AP splitter.

I've been thinking along those lines. Synergy seems to be handling handling more stuff in addition to scheduling than Bane handled, and it seems to bog down more when AP is being split. Less AP splitters would spread the AP work flow as Richard Haselgrove recently suggested in the News Forum to ease the download congestion, and Synergy might do better if it did not have that extra load.

I've been using NNT to report completions since these intermittent problems started a couple of weeks ago. Work requests with more than 1 or 2 results reported seldom get through when it is tough, but a NNT update reporting results often gets a response. I'm also experimenting with the cc_config.xml option

<report_results_immediately>1</report_results_immediately>

to avoid bulding up a stack to report.

____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

S@NL - Nick Succorso
Send message
Joined: 24 May 00
Posts: 15
Credit: 23,074,763
RAC: 0
Netherlands
Message 1284101 - Posted: 15 Sep 2012, 15:14:28 UTC

@All: Thanks for your observations and reactions

@msattler: LOL2, After 11.5 years of SETI@Home involvement I never shed one tear. Swallowed a few though ;)

@BilBg: Points taken, nevertheless, we have computers these days to keep track of such things, so why not about the uploaded results. Within the bigger system of Berkeley I imagine the database to be available for more purposes, so why not this purpose too? It could save an awful lot of datacom traffic, although the processing is increased. Still, this seems easier to expand than datacom facilities at Berkeley.

@Claggy: I know SETI is big, and might grow out of control if it continues like this. I also understand that there are no budgets available for whatever tasks, and I therefore have deep respect and appreciation for the volunteers working on this project. Nevertheless, coming from a business/management background I have great difficulty to notice the seemingly relaxed way this project is being managed. I wait for the day when it comes to a grinding stop. Until that day, I shall crunch on happily, or perhaps a little less happily than in the past.

@Fred E.: Perhaps when our magician Mat returns from his musical travels he is able to do something. Let's all pray his bands are not becoming too successful, for Mat may then decide to choose for his musical career altogether.

____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8491
Credit: 49,742,438
RAC: 55,600
United Kingdom
Message 1284108 - Posted: 15 Sep 2012, 15:37:21 UTC - in response to Message 1284081.

The point I am trying to make is that once a result is uploaded, an internal message withing the SETI@Home system in Berkely could handle the reporting. Why do those tens (hundereds?) of thousands of clients have to make connection to just tell the system "hey, I just uploaded a result to you, (just in case you forgot?)

Funny how much difference a single letter can make. It's that 'I' that makes all the difference.

As others have said, file upload is just a dumb copy - moving data from one hard disk (yours) to another hard disk (theirs). At that point, there's no 'I' involved.

I assume you'd like to receive credit for that task? You'd like your name to go up in lights if it's the one which finds ET?

Yes, thought so. So how do you match the file that just appeared on their hard disk with your computer, your account, your team? Don't talk about IP addresses. You might be using a laptop, and currently attached to a completely different WiFi hotspot from the one you were using when you downloaded the file. Or, like many of us, you might have several computers behind a NAT router, all with the same public IP address. Which one uploaded the file?

No, all you can use is the file name. That means you have to search the entire database. Sure, the database could be optimised to accelerate a text search on a flat-file table of 6,691,321 rows - but it doesn't sound smart to me. And what security do you have? Anybody could do the file transfer: say you get into an argument with somebody on the board? They could look at your task list, work out what all your 'in progress' result files are going to be called, and upload garbage under those names. Can't filter by IP, remember - so all your work gets invalidated.

So, that's why we report work separately. When you report, that 'I' contains your userID, the hostID, and an authenticator derived from your email and password. Your fingerprints are all over the report. The user and host IDs help the database narrow down the search for the task you're reporting - to a few hundred or a few thousand tasks, rather than a few million. And once you've got that shortlist of task names, you can actually save more time by matching up multiple tasks reported in a single batch - if they did it your way, processing result files individually as they arrive, the million-fold search would have to be done singly, for each separate returned file.

Have I convinced you yet? I can probably think of more if you'd like it...

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5811
Credit: 58,779,670
RAC: 48,398
Australia
Message 1284264 - Posted: 15 Sep 2012, 21:31:28 UTC - in response to Message 1284108.


I've been getting a few Scheduler request timeouts again this morning.
____________
Grant
Darwin NT.

S@NL - Nick Succorso
Send message
Joined: 24 May 00
Posts: 15
Credit: 23,074,763
RAC: 0
Netherlands
Message 1284282 - Posted: 15 Sep 2012, 22:22:00 UTC - in response to Message 1284108.

Ah, Richard, please do not understand me wrongly. I am just a faithful cruncher dedicated to the project. It is just that I am apparently very ignorant about the backgrounds of everything. As such, I do not want to dive too deeply into the technical backgrounds, as many of us won't. And please, don't forget, the English you guys sometimes use in this Forum to describe things is not always that easy to follow for us non-English.

Indeed, some recognition for what I dedicate my 'rigs' to against today's price of energy is appreciated nonetheless. And if I keep those rigs powered-up for SETI@Home they better work as hard as they can all of the time. Like kitties chasing their own tail, msattler probably would say.

It's just a couple of simple questions I was asking, which I never asked before and so no one gave me the answers so far. Now I did ask them and today I got some answers. So far not why one rig is reporting flawlessly and the other simply doesn't!

Your explanation helps me a lot in understanding the backgrounds of separate reporting a bit more. After all, I also took part in "SETI-Classic", and remember very vividly what happened there. Also on my Dutch end of project. So, thanks for your explanation, which is indeed very helpful.
____________

Profile MarkJProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 08
Posts: 938
Credit: 23,742,445
RAC: 74,931
Australia
Message 1284339 - Posted: 16 Sep 2012, 2:03:21 UTC

I have had a lot of problems getting work requests through in the last few days on all my rigs (seti only). As others have noted reporting completed work and not asking for any seems to get through. All machines have a fairly minimal cache with the largest being about .25 of a day, so we're not talking about a lot of work units.
____________
BOINC blog

Message boards : Number crunching : Scheduler request failed: Failure when receiving data from the peer

Copyright © 2014 University of California