Message boards :
Number crunching :
Panic Mode On (79) Server Problems?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 21 · Next
| Author | Message |
|---|---|
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
Agree with you, avoid any more wear with Eric. I just make a report for you to know about how are the things on our side of the world, normaly the problem is bigger as far from the lab you are due the normal internet conection delays. I think the fix they do make a first good step in the path to finaly solve the question. Atleast now all is working, very slow, but is Alive!
|
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57
|
Message from Eric: Quick script to provide the percentage of failures on your machine. @ECHO OFF
cls
pushd %~dp0
findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt
findstr /C:"[SETI@home] Scheduler request failed: Timeout was reached" stdoutdae.txt >sched_failures_timeout-%computername%.txt
findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt
for /f %%a in ('Find /V /C "" ^< sched_failures-%computername%.txt') do set fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_failures_timeout-%computername%.txt') do set time_fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_successes-%computername%.txt') do set success=%%a
set /a schdreqcnt=%fails%+%success%
@ECHO Scheduler Requests: %schdreqcnt%
set /a schdsuccesspct=%success%*100/%schdreqcnt%*100/100
@ECHO Scheduler Success: %schdsuccesspct% %%
set /a schdfailspct=%fails%*100/%schdreqcnt%*100/100
@ECHO Scheduler Failure: %schdfailspct% %%
set /a schdtofailspct=%time_fails%*100/%schdreqcnt%*100/100
@ECHO Scheduler Timeout: %schdtofailspct% %% of total
set /a schdtmfailspct=%time_fails%*100/%fails%*100/100
@ECHO Scheduler Timeout: %schdtmfailspct% %% of failures
pauseSETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Brilliant, scheduler contacts now just work... As they do at the Main project: 21/11/2012 21:39:00 | SETI@home | [sched_op] Starting scheduler request So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference. |
|
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0
|
Since the time in wich Richard posted saying that we are "talking" to Synergy through the campus, everything is working fine here... (It seems that Synergy was offended by some of my hosts because "he" refused to talk with them previously... ;D ) I can't say if the downloads are slow or not because Ive reached the limits and Ive not beeing able to catch a host while it was downloading the allowed replacements for the reported tasks... (I guess that's a kind of evidence that the downloads are not a big issue here...)
|
|
mikeej42 Send message Joined: 26 Oct 00 Posts: 109 Credit: 791,875,385 RAC: 9
|
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files.
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. No, it needs to be stored in whatever location you have chosen for "BOINC's Data directory". If you took Boinc's default suggestion, and you're running Windows Vista or Windows 7, then your data will indeed have ended up there. On my WindowsXP machines, it tends to end up in D:\BOINCdata - my choice, not BOINC's. |
|
mikeej42 Send message Joined: 26 Oct 00 Posts: 109 Credit: 791,875,385 RAC: 9
|
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. Poor wording on my part, thanks for the correction.
|
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. Works perfectly, and BTW no scheduler error until now. Now the problem is how to DL the WU, they still to slow and afinaly stops on as this example: 21/11/2012 20:49:16 SETI@home Temporarily failed upload of 25au12ac.20695.9065.140733193388037.10.81_1_0: HTTP error
|
|
mikeej42 Send message Joined: 26 Oct 00 Posts: 109 Credit: 791,875,385 RAC: 9
|
11/21/2012 5:13:39 PM | SETI@home | Started download of ap_30au12ad_B2_P1_00087_20121118_28020.wu The scheduler appears to be more consistent in getting a completed response but still having trouble downloading the tasks I do get assigned.
|
Bernie Vine Send message Joined: 26 May 99 Posts: 9960 Credit: 103,452,613 RAC: 328
|
Quick script to provide the percentage of failures on your machine. Thanks |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Quick script to provide the percentage of failures on your machine. Now we just need a small tweak to divide those into 'before tonight' and 'after tonight', so we know what effect Eric's changes have had. |
|
Lionel Send message Joined: 25 Mar 00 Posts: 680 Credit: 563,640,304 RAC: 597
|
I think for the time being, I'm treating it as what they always warn us about - a period of a few hours congestion after an outage. Richard, good luck with the clearing exercise ... just by way of info, I had work allocated to me around 6 weeks ago that is still trying to get here ... great chunks are now aborting due not being downloaded and being able to be completed in time ... my guess would be that the problems are more manifest than what people are thinking ... the continual HTTP gateway timeout isn't helping either, it only adds to the aborted total ... cheers |
|
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40
|
Richard, I just want to add my thanks to you (and others) for trying to work-through this mess. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14015 Credit: 208,696,464 RAC: 304
|
Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts, Yep, whatever they did for that mini outage today, it's worked. I went back through my log & had a look- on the first try after the outage the Scheduler responded within 7 seconds, there were a few an hour or 2 later that took around 45sec but most often it's within 4. Sometimes it's within 2. I can't see any sign of errors relating to the Scheduler at all on either of my system's logs since todays outage. Well done Eric & co. EDIT- i also notice the Master Database queries are way down, less than 700/s since the initial burst after the outage. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14015 Credit: 208,696,464 RAC: 304
|
So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference. For me, when things were borked, if a Scheduler request didn't time out it was taking at least 2 minutes to get a response to a request for work. Even with NNT set it would often take more than a minute. Now, reporting & requesting more work is getting a response within 4-6 seconds in most cases at the moment. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14015 Credit: 208,696,464 RAC: 304
|
C:\Windows\system32>ping setiboinc.ssl.berkeley.edu Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data: Reply from 128.32.18.157: bytes=32 time=238ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=238ms TTL=48 Ping statistics for 128.32.18.157: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 237ms, Maximum = 238ms, Average = 237ms C:\Windows\system32>ping setiboinc.ssl.berkeley.edu Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data: Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=238ms TTL=48 Reply from 128.32.18.157: bytes=32 time=238ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Ping statistics for 128.32.18.157: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 237ms, Maximum = 238ms, Average = 237ms C:\Windows\system32>ping setiboinc.ssl.berkeley.edu Pinging synergy.ssl.berkeley.edu [128.32.18.157] with 32 bytes of data: Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Reply from 128.32.18.157: bytes=32 time=237ms TTL=48 Ping statistics for 128.32.18.157: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 237ms, Maximum = 237ms, Average = 237ms Pinging setiboincdata.ssl.berkeley.edu gives 25-75% packet loss. Pinging boinc2.ssl.berkeley.edu gives 50-75% packet loss (and that's with .13 configured in my hosts file). Grant Darwin NT |
kittyman ![]() Send message Joined: 9 Jul 00 Posts: 51583 Credit: 1,018,363,574 RAC: 1,004
|
So far, today's server reconfig seems to have made a difference here. The kitties had 5 rigs doing Einstein this morning, only one is still doing so at present. We shall see..... "Time is simply the mechanism that keeps everything from happening all at once."
|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14015 Credit: 208,696,464 RAC: 304
|
AP being split & sent off in large numbers will be the real test. Grant Darwin NT |
kittyman ![]() Send message Joined: 9 Jul 00 Posts: 51583 Credit: 1,018,363,574 RAC: 1,004
|
Agreed. If I understand it, the change was in the routing of the pipe to the servers? "Time is simply the mechanism that keeps everything from happening all at once."
|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 14015 Credit: 208,696,464 RAC: 304
|
Just the Scheduler. Apparently (at least for now) they're able to use the campus network for the Scheduler traffic. If you look at the network graphs at present, instead of being around 14-20Mb/s it's been sitting around 10-12Mb/s inbound. I did some pings (posted a few posts before these from memory). No packet loss at all, where as the download server (i use .13 exclusively) is around 50-75% & the upload server is around 50% packet loss. EDIT- & the other real test will be to bump up the limits & see if things fall over again or not. Maybe 400 per core & 1200 per GPU to start with? Hint, hint. Grant Darwin NT |
©2026 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.