Panic Mode On (79) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 23 · Next
Author Message
juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4922
Credit: 266,965,597
RAC: 342,592
Brazil
Message 1308508 - Posted: 21 Nov 2012, 21:21:37 UTC - in response to Message 1308497.
Last modified: 21 Nov 2012, 21:29:02 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.


No real chanches on my side, DL very slow without proxy (<0.5kbps), with proxy a little better (still <5 kbps). The same host/conection give >1MBps for DL an Einstein WU, so the slow is not because my internet conection. Scheduler works very slow to, and UL fast but takes a lot of time to clear from the screen (don´t know the nave you give to the task after tue UL is completes 100%).

Bad, but at least data is flow with little or no error (i realy not see any scheduler error yet)... but takes more time to DL than crunch... so at this rates the caches will never fill on the fastest hosts even with 100WU.

But AP-splitters still off line...
____________

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4039
Credit: 32,691,076
RAC: 800
United Kingdom
Message 1308511 - Posted: 21 Nov 2012, 21:31:21 UTC - in response to Message 1308497.
Last modified: 21 Nov 2012, 21:41:10 UTC

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.

Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts, For Example:

21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] Starting scheduler request
21/11/2012 21:37:27 SETI@home Beta Test Sending scheduler request: To fetch work.
21/11/2012 21:37:27 SETI@home Beta Test Requesting new tasks for CPU and GPU
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] CPU work request: 82381.29 seconds; 0.00 CPUs
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] NVIDIA GPU work request: 0.00 seconds; 0.00 GPUs
21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] ATI GPU work request: 20239.74 seconds; 0.00 GPUs
21/11/2012 21:37:33 SETI@home Beta Test Scheduler request completed: got 20 new tasks
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Server version 701
21/11/2012 21:37:33 SETI@home Beta Test Project requested delay of 7 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total CPU job duration: 76666 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total NVIDIA GPU job duration: 0 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] estimated total ATI GPU job duration: 19036 seconds
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Deferring communication for 7 sec
21/11/2012 21:37:33 SETI@home Beta Test [sched_op_debug] Reason: requested by project


Claggy

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8373
Credit: 46,528,440
RAC: 13,188
United Kingdom
Message 1308512 - Posted: 21 Nov 2012, 21:32:47 UTC - in response to Message 1308508.

Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically.

So far, the only difference that I've noticed (apart from the fact that it works...) is a re-allocation and download of some of the little graphics files used in Simple View.

No real chanches on my side, DL very slow without proxy (<0.5kbps), with proxy a little better (still <5 kbps). The same configuration give >1MBps for DL an Einstein WU, so the slow is not because my internet conection. Scheduler works very slow to, and UL fast but takes a lot of time to clear from the screen (don´t know the nave you give to the task after tue UL is completes 100%).

Bad, but at least data is flow... but takes more time to DL than crunch... so at this rates the caches will never fill on the fastest hosts even with 100WU.

I'm trying to keep up a modest (not overwhelming) running commentary to Eric.

From here, it's looking as if the scheduler change has made a big improvement, and work is being allocated pretty much on demand (while stocks last...)

That puts us under the classic "post outage" download pressure, even without AP in the mix. And there's an added problem that Vader seems to be handling the load on its own, while GeorgeM is having the same timeout problems as Synergy was. Maybe a clue there, since they're both newer servers? Anyway, that's hopefully the next on the list - they've already turned off a couple of background processes on GeorgeM, but I've just had to report 'no change'.

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4922
Credit: 266,965,597
RAC: 342,592
Brazil
Message 1308513 - Posted: 21 Nov 2012, 21:38:05 UTC - in response to Message 1308512.
Last modified: 21 Nov 2012, 21:40:01 UTC


I'm trying to keep up a modest (not overwhelming) running commentary to Eric.

Agree with you, avoid any more wear with Eric. I just make a report for you to know about how are the things on our side of the world, normaly the problem is bigger as far from the lab you are due the normal internet conection delays.

I think the fix they do make a first good step in the path to finaly solve the question. Atleast now all is working, very slow, but is Alive!
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3840
Credit: 106,262,354
RAC: 88,333
United States
Message 1308515 - Posted: 21 Nov 2012, 21:47:26 UTC - in response to Message 1308506.

Message from Eric:

We've got some things to try. Let us know if it starts working.


I find these two helpful:

findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt

findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt


They work in the "command prompt" environment in Windows.

Save them (separately or together) in one or two files in BOINC's Data directory: give the files names with the extension ".cmd"

Then, double-clicking the file(s) will quickly give you an overview of how well the scheduler requests have been going.

Don't swamp Eric with data, but if a few of us (those who feel confident working with that minimalist instruction - don't bother if you're not comfortable doing that) keep an eye on his experiments and provide feedback, it may help. Remember your logs will be timestamped in your local timezone - please supply the UTC offset so he can match them up with the server changes.

I had not thought to check the logs that way. Quite a good idea. I took it a bit further and did a 3rd to just check for "[SETI@home] Scheduler request failed: Timeout was reached" to separate other failures. Then I have the bat count the lines and give me the % failure for total and timeout. So far checking several machines that have data going back to the 5th. The failure rate is between 14% & 19% for all failures.

Quick script to provide the percentage of failures on your machine.
@ECHO OFF
cls
pushd %~dp0
findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt
findstr /C:"[SETI@home] Scheduler request failed: Timeout was reached" stdoutdae.txt >sched_failures_timeout-%computername%.txt
findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt
for /f %%a in ('Find /V /C "" ^< sched_failures-%computername%.txt') do set fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_failures_timeout-%computername%.txt') do set time_fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_successes-%computername%.txt') do set success=%%a
set /a schdreqcnt=%fails%+%success%
@ECHO Scheduler Requests: %schdreqcnt%
set /a schdsuccesspct=%success%*100/%schdreqcnt%*100/100
@ECHO Scheduler Success: %schdsuccesspct% %%
set /a schdfailspct=%fails%*100/%schdreqcnt%*100/100
@ECHO Scheduler Failure: %schdfailspct% %%
set /a schdtofailspct=%time_fails%*100/%schdreqcnt%*100/100
@ECHO Scheduler Timeout: %schdtofailspct% %% of total
set /a schdtmfailspct=%time_fails%*100/%fails%*100/100
@ECHO Scheduler Timeout: %schdtmfailspct% %% of failures
pause

____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8373
Credit: 46,528,440
RAC: 13,188
United Kingdom
Message 1308516 - Posted: 21 Nov 2012, 21:48:29 UTC - in response to Message 1308511.

Brilliant, scheduler contacts now just work...

As they do at the Main project:

21/11/2012 21:39:00 | SETI@home | [sched_op] Starting scheduler request
21/11/2012 21:39:00 | SETI@home | Sending scheduler request: Requested by user.
21/11/2012 21:39:00 | SETI@home | Reporting 36 completed tasks, requesting new tasks for NVIDIA GPU
21/11/2012 21:39:00 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 CPUs
21/11/2012 21:39:00 | SETI@home | [sched_op] NVIDIA GPU work request: 70953.66 seconds; 0.00 GPUs
21/11/2012 21:39:07 | SETI@home | Scheduler request completed: got 18 new tasks

So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 68,594,783
RAC: 91,289
Argentina
Message 1308521 - Posted: 21 Nov 2012, 21:59:31 UTC

Since the time in wich Richard posted saying that we are "talking" to Synergy through the campus, everything is working fine here...
(It seems that Synergy was offended by some of my hosts because "he" refused to talk with them previously... ;D )

I can't say if the downloads are slow or not because Ive reached the limits and Ive not beeing able to catch a host while it was downloading the allowed replacements for the reported tasks... (I guess that's a kind of evidence that the downloads are not a big issue here...)
____________

mikeej42
Send message
Joined: 26 Oct 00
Posts: 109
Credit: 787,234,227
RAC: 57,764
United States
Message 1308526 - Posted: 21 Nov 2012, 22:15:44 UTC - in response to Message 1308515.

The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files.
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8373
Credit: 46,528,440
RAC: 13,188
United Kingdom
Message 1308528 - Posted: 21 Nov 2012, 22:23:32 UTC - in response to Message 1308526.

The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files.

No, it needs to be stored in whatever location you have chosen for "BOINC's Data directory". If you took Boinc's default suggestion, and you're running Windows Vista or Windows 7, then your data will indeed have ended up there.

On my WindowsXP machines, it tends to end up in D:\BOINCdata - my choice, not BOINC's.

mikeej42
Send message
Joined: 26 Oct 00
Posts: 109
Credit: 787,234,227
RAC: 57,764
United States
Message 1308531 - Posted: 21 Nov 2012, 22:37:33 UTC - in response to Message 1308528.

The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files.

No, it needs to be stored in whatever location you have chosen for "BOINC's Data directory". If you took Boinc's default suggestion, and you're running Windows Vista or Windows 7, then your data will indeed have ended up there.

On my WindowsXP machines, it tends to end up in D:\BOINCdata - my choice, not BOINC's.

Poor wording on my part, thanks for the correction.
____________

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4922
Credit: 266,965,597
RAC: 342,592
Brazil
Message 1308535 - Posted: 21 Nov 2012, 22:43:01 UTC - in response to Message 1308531.
Last modified: 21 Nov 2012, 22:57:27 UTC

The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files.

No, it needs to be stored in whatever location you have chosen for "BOINC's Data directory". If you took Boinc's default suggestion, and you're running Windows Vista or Windows 7, then your data will indeed have ended up there.

On my WindowsXP machines, it tends to end up in D:\BOINCdata - my choice, not BOINC's.

Poor wording on my part, thanks for the correction.

Works perfectly, and BTW no scheduler error until now.

Now the problem is how to DL the WU, they still to slow and afinaly stops on as this example:

21/11/2012 20:49:16 SETI@home Temporarily failed upload of 25au12ac.20695.9065.140733193388037.10.81_1_0: HTTP error
____________

mikeej42
Send message
Joined: 26 Oct 00
Posts: 109
Credit: 787,234,227
RAC: 57,764
United States
Message 1308543 - Posted: 21 Nov 2012, 23:20:32 UTC

11/21/2012 5:13:39 PM | SETI@home | Started download of ap_30au12ad_B2_P1_00087_20121118_28020.wu
11/21/2012 5:15:30 PM | SETI@home | Temporarily failed download of ap_31au12ad_B4_P0_00073_20121120_02469.wu: transient HTTP error

The scheduler appears to be more consistent in getting a completed response but still having trouble downloading the tasks I do get assigned.
____________

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6782
Credit: 24,408,685
RAC: 26,858
United Kingdom
Message 1308558 - Posted: 22 Nov 2012, 0:18:13 UTC
Last modified: 22 Nov 2012, 0:18:26 UTC

Quick script to provide the percentage of failures on your machine.


Thanks


____________


Today is life, the only life we're sure of. Make the most of today.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8373
Credit: 46,528,440
RAC: 13,188
United Kingdom
Message 1308561 - Posted: 22 Nov 2012, 0:25:52 UTC - in response to Message 1308558.

Quick script to provide the percentage of failures on your machine.


Thanks


Now we just need a small tweak to divide those into 'before tonight' and 'after tonight', so we know what effect Eric's changes have had.

Lionel
Send message
Joined: 25 Mar 00
Posts: 544
Credit: 215,686,983
RAC: 198,154
Australia
Message 1308585 - Posted: 22 Nov 2012, 2:21:16 UTC - in response to Message 1307895.

I think for the time being, I'm treating it as what they always warn us about - a period of a few hours congestion after an outage.

We've had 200,000 results returned in an hour, 1,400 queries a second, and we've still got 94 Mbit/sec. Assuming they leave the splitters turned off until all the current results ready to send have been allocated and downloaded (which I hope they do), we'll get a better idea how well the scheduler copes with 'report only'.


Richard, good luck with the clearing exercise ... just by way of info, I had work allocated to me around 6 weeks ago that is still trying to get here ... great chunks are now aborting due not being downloaded and being able to be completed in time ... my guess would be that the problems are more manifest than what people are thinking ... the continual HTTP gateway timeout isn't helping either, it only adds to the aborted total ...

cheers



____________

tbret
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2598
Credit: 186,680,205
RAC: 415,982
United States
Message 1308636 - Posted: 22 Nov 2012, 5:41:35 UTC - in response to Message 1308528.

Richard, I just want to add my thanks to you (and others) for trying to work-through this mess.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,042,472
RAC: 49,996
Australia
Message 1308645 - Posted: 22 Nov 2012, 6:25:29 UTC - in response to Message 1308511.
Last modified: 22 Nov 2012, 6:26:51 UTC

Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts,

Yep, whatever they did for that mini outage today, it's worked.
I went back through my log & had a look- on the first try after the outage the Scheduler responded within 7 seconds, there were a few an hour or 2 later that took around 45sec but most often it's within 4. Sometimes it's within 2.
I can't see any sign of errors relating to the Scheduler at all on either of my system's logs since todays outage.

Well done Eric & co.



EDIT- i also notice the Master Database queries are way down, less than 700/s since the initial burst after the outage.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,042,472
RAC: 49,996
Australia
Message 1308647 - Posted: 22 Nov 2012, 6:37:53 UTC - in response to Message 1308516.
Last modified: 22 Nov 2012, 6:38:36 UTC

So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference.

For me, when things were borked, if a Scheduler request didn't time out it was taking at least 2 minutes to get a response to a request for work. Even with NNT set it would often take more than a minute.
Now, reporting & requesting more work is getting a response within 4-6 seconds in most cases at the moment.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,042,472
RAC: 49,996
Australia
Message 1308653 - Posted: 22 Nov 2012, 7:12:30 UTC - in response to Message 1308647.


C:\Windows\system32>ping setiboinc.ssl.berkeley.edu

Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data:
Reply from 128.32.18.157: bytes=32 time=238ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=238ms TTL=48

Ping statistics for 128.32.18.157:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 237ms, Maximum = 238ms, Average = 237ms

C:\Windows\system32>ping setiboinc.ssl.berkeley.edu

Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data:
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=238ms TTL=48
Reply from 128.32.18.157: bytes=32 time=238ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48

Ping statistics for 128.32.18.157:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 237ms, Maximum = 238ms, Average = 237ms

C:\Windows\system32>ping setiboinc.ssl.berkeley.edu

Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data:
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48
Reply from 128.32.18.157: bytes=32 time=237ms TTL=48

Ping statistics for 128.32.18.157:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 237ms, Maximum = 237ms, Average = 237ms


Pinging setiboincdata.ssl.berkeley.edu gives 25-75% packet loss.
Pinging boinc2.ssl.berkeley.edu gives 50-75% packet loss (and that's with .13 configured in my hosts file).
____________
Grant
Darwin NT.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38149
Credit: 555,357,711
RAC: 616,257
United States
Message 1308658 - Posted: 22 Nov 2012, 7:31:25 UTC

So far, today's server reconfig seems to have made a difference here.
The kitties had 5 rigs doing Einstein this morning, only one is still doing so at present.

We shall see.....
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?

Copyright © 2014 University of California