Panic Mode On (79) Server Problems? |
![]() |
| log in |
Message boards : Number crunching : Panic Mode On (79) Server Problems?
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 23 · Next
| Author | Message |
|---|---|
Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically. No real chanches on my side, DL very slow without proxy (<0.5kbps), with proxy a little better (still <5 kbps). The same host/conection give >1MBps for DL an Einstein WU, so the slow is not because my internet conection. Scheduler works very slow to, and UL fast but takes a lot of time to clear from the screen (don´t know the nave you give to the task after tue UL is completes 100%). Bad, but at least data is flow with little or no error (i realy not see any scheduler error yet)... but takes more time to DL than crunch... so at this rates the caches will never fill on the fastest hosts even with 100WU. But AP-splitters still off line... ____________ | |
| ID: 1308508 · | |
Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically. Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts, For Example: 21/11/2012 21:37:27 SETI@home Beta Test [sched_op_debug] Starting scheduler request Claggy | |
| ID: 1308511 · | |
Things seems to have started again, and this time we're talking to Synergy over the Campus data network (128.32.18.157) - anybody using a manually configured hosts file please note. We're still using setiboinc.ssl.berkeley.edu, so the proxies should pick up the change automatically. I'm trying to keep up a modest (not overwhelming) running commentary to Eric. From here, it's looking as if the scheduler change has made a big improvement, and work is being allocated pretty much on demand (while stocks last...) That puts us under the classic "post outage" download pressure, even without AP in the mix. And there's an added problem that Vader seems to be handling the load on its own, while GeorgeM is having the same timeout problems as Synergy was. Maybe a clue there, since they're both newer servers? Anyway, that's hopefully the next on the list - they've already turned off a couple of background processes on GeorgeM, but I've just had to report 'no change'. | |
| ID: 1308512 · | |
Agree with you, avoid any more wear with Eric. I just make a report for you to know about how are the things on our side of the world, normaly the problem is bigger as far from the lab you are due the normal internet conection delays. I think the fix they do make a first good step in the path to finaly solve the question. Atleast now all is working, very slow, but is Alive! ____________ | |
| ID: 1308513 · | |
Message from Eric: Quick script to provide the percentage of failures on your machine. @ECHO OFF cls pushd %~dp0 findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >sched_failures-%computername%.txt findstr /C:"[SETI@home] Scheduler request failed: Timeout was reached" stdoutdae.txt >sched_failures_timeout-%computername%.txt findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >sched_successes-%computername%.txt for /f %%a in ('Find /V /C "" ^< sched_failures-%computername%.txt') do set fails=%%a for /f %%a in ('Find /V /C "" ^< sched_failures_timeout-%computername%.txt') do set time_fails=%%a for /f %%a in ('Find /V /C "" ^< sched_successes-%computername%.txt') do set success=%%a set /a schdreqcnt=%fails%+%success% @ECHO Scheduler Requests: %schdreqcnt% set /a schdsuccesspct=%success%*100/%schdreqcnt%*100/100 @ECHO Scheduler Success: %schdsuccesspct% %% set /a schdfailspct=%fails%*100/%schdreqcnt%*100/100 @ECHO Scheduler Failure: %schdfailspct% %% set /a schdtofailspct=%time_fails%*100/%schdreqcnt%*100/100 @ECHO Scheduler Timeout: %schdtofailspct% %% of total set /a schdtmfailspct=%time_fails%*100/%fails%*100/100 @ECHO Scheduler Timeout: %schdtmfailspct% %% of failures pause ____________ SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the BP6/VP6 User Group today! | |
| ID: 1308515 · | |
Brilliant, scheduler contacts now just work... As they do at the Main project: 21/11/2012 21:39:00 | SETI@home | [sched_op] Starting scheduler request So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference. | |
| ID: 1308516 · | |
|
Since the time in wich Richard posted saying that we are "talking" to Synergy through the campus, everything is working fine here... | |
| ID: 1308521 · | |
|
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. | |
| ID: 1308526 · | |
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. No, it needs to be stored in whatever location you have chosen for "BOINC's Data directory". If you took Boinc's default suggestion, and you're running Windows Vista or Windows 7, then your data will indeed have ended up there. On my WindowsXP machines, it tends to end up in D:\BOINCdata - my choice, not BOINC's. | |
| ID: 1308528 · | |
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. Poor wording on my part, thanks for the correction. ____________ | |
| ID: 1308531 · | |
The .cmd (or .bat) file needs to be stored in %ProgramData%\BOINC for the pushd to point to the error files. Works perfectly, and BTW no scheduler error until now. Now the problem is how to DL the WU, they still to slow and afinaly stops on as this example: 21/11/2012 20:49:16 SETI@home Temporarily failed upload of 25au12ac.20695.9065.140733193388037.10.81_1_0: HTTP error ____________ | |
| ID: 1308535 · | |
11/21/2012 5:13:39 PM | SETI@home | Started download of ap_30au12ad_B2_P1_00087_20121118_28020.wu The scheduler appears to be more consistent in getting a completed response but still having trouble downloading the tasks I do get assigned. ____________ | |
| ID: 1308543 · | |
Quick script to provide the percentage of failures on your machine. Thanks ____________ | |
| ID: 1308558 · | |
Quick script to provide the percentage of failures on your machine. Now we just need a small tweak to divide those into 'before tonight' and 'after tonight', so we know what effect Eric's changes have had. | |
| ID: 1308561 · | |
I think for the time being, I'm treating it as what they always warn us about - a period of a few hours congestion after an outage. Richard, good luck with the clearing exercise ... just by way of info, I had work allocated to me around 6 weeks ago that is still trying to get here ... great chunks are now aborting due not being downloaded and being able to be completed in time ... my guess would be that the problems are more manifest than what people are thinking ... the continual HTTP gateway timeout isn't helping either, it only adds to the aborted total ... cheers ____________ | |
| ID: 1308585 · | |
|
Richard, I just want to add my thanks to you (and others) for trying to work-through this mess. | |
| ID: 1308636 · | |
Brilliant, scheduler contacts now just work, even if i get no work (at the Main project) at least we're got a workaround for the scheduler timeouts, Yep, whatever they did for that mini outage today, it's worked. I went back through my log & had a look- on the first try after the outage the Scheduler responded within 7 seconds, there were a few an hour or 2 later that took around 45sec but most often it's within 4. Sometimes it's within 2. I can't see any sign of errors relating to the Scheduler at all on either of my system's logs since todays outage. Well done Eric & co. EDIT- i also notice the Master Database queries are way down, less than 700/s since the initial burst after the outage. ____________ Grant Darwin NT. | |
| ID: 1308645 · | |
So, both reporting and new work in 7 seconds flat - they were resends for work I'd ghosted earlier today, but I don't think that makes much difference. For me, when things were borked, if a Scheduler request didn't time out it was taking at least 2 minutes to get a response to a request for work. Even with NNT set it would often take more than a minute. Now, reporting & requesting more work is getting a response within 4-6 seconds in most cases at the moment. ____________ Grant Darwin NT. | |
| ID: 1308647 · | |
|
| |
| ID: 1308653 · | |
|
So far, today's server reconfig seems to have made a difference here. | |
| ID: 1308658 · | |
Message boards : Number crunching : Panic Mode On (79) Server Problems?
| Copyright © 2013 University of California |