Panic Mode On (79) Server Problems? |
![]() |
| log in |
Message boards : Number crunching : Panic Mode On (79) Server Problems?
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 23 · Next
| Author | Message |
|---|---|
|
Start date: 2012-11-26 | |
| ID: 1310617 · | |
|
I find these files in my BOINC area: | |
| ID: 1310629 · | |
So, are these files transferred for each request? And are they being transferred as-is or with some compression? For each request? Yes, one in each direction. Are they compressed? You would need to ask David Anderson, but I believe not. Remember to factor in the time overhead for compressing and decompressing. At the volunteer's end? It doesn't matter, we provide the hardware and the time. At the server end? Choose an algorithm which runs efficiently on Linux. It might make more sense to compress the request files than the reply files, given the sizes. | |
| ID: 1310634 · | |
So, are these files transferred for each request? And are they being transferred as-is or with some compression? OK, so I'm not completely barking up the wrong tree.
Yes that's an obvious question to ask. At the moment "some" of us are more convinced that the problem is with network congestion rather than hardware response, so it might be that the reduction in network traffic outweighs the extra computation. Hard to say until you set up a test case. It might make more sense to compress the request files than the reply files, given the sizes. Undoubtedly, go for the lowest-hanging fruit first! Unfortunately the code I worked with for MICE was rather gnarly; I don't remember how much of that was to do with the libraries involved and how much was about the transfer protocols. I can probably recover it (it might even be Googlable[1] if anyone cares) if there's interest. [1] In fact this looks publicly available: http://indico.cern.ch/getFile.py/access?contribId=94&sessionId=6&resId=1&materialId=slides&confId=116711 Apologies if it isn't. ____________ | |
| ID: 1310641 · | |
So, are these files transferred for each request? And are they being transferred as-is or with some compression? Any change to compress sched_request files would have to be done in the client, so considerable delay. In addition, the upload side of the pipe isn't congested, though having fewer packets to ack through the download side could improve the reliability. OTOH, sending sched_reply with gzip or deflate compression by using mod_gzip or mod_deflate on the scheduling server might be a very minimal change. Clients already advertise they can accept that: 20/11/2012 00:05:54 | SETI@home | [http] [ID#1] Sent header to server: Accept-Encoding: deflate, gzip That's from an earlier post in this thread by Richard Haselgrove, the BOINC 6.10.58 I'm using also sends that header when contacting the Scheduler. An interesting Taka's RunLoop blog post gives some idea of the time taken to compress a file versus how long it takes to transfer it on a 100Mbps link. I think Synergy would take less time for compression than the Atom being used as a server there :^) Joe | |
| ID: 1310666 · | |
If anyone wishes to use it you can find it here: Every user may have different format for Date (I needed to change that part, my Date format is: 27.11.2012) (obviously this .bat will not work very useful in the start of a month (it can't go back to the previous month) so use it now, it will be too late in December ;) ) Since I changed too many places here is my edited .bat
@ECHO OFF
cls
pushd %~dp0
set check_days=20
@ECHO %date%
::: --- Change the following lines to match your DATE format:
set cdy=%date:~0,2%
set cmo=%date:~3,2%
set cyr=%date:~6,4%
set edychk1=-
set edychk2=0
set Start_date=%cyr%-%cmo%-%cdy%
@ECHO Start date: %Start_date%
set Out_file=$com_check_full__%computername%__%Start_date%.txt
@ECHO Out file: %Out_file%
@ECHO.
pause
REM.> %Out_file%
findstr /C:"[SETI@home] Scheduler request failed: " stdoutdae.txt >$sched_failures-%computername%.txt
findstr /C:"[SETI@home] Scheduler request failed: Timeout was reached" stdoutdae.txt >$sched_failures_timeout-%computername%.txt
findstr /C:"[SETI@home] Scheduler request completed: " stdoutdae.txt >$sched_successes-%computername%.txt
:start
set /a lct=%lct%+1
findstr /C:"%cdy%-" $sched_failures-%computername%.txt >sched_failures-%computername%_%cyr%-%cmo%-%cdy%.txt
findstr /C:"%cdy%-" $sched_failures_timeout-%computername%.txt >sched_failures_timeout-%computername%_%cyr%-%cmo%-%cdy%.txt
findstr /C:"%cdy%-" $sched_successes-%computername%.txt >sched_successes-%computername%_%cyr%-%cmo%-%cdy%.txt
REM ---start output senction
for /f %%a in ('Find /V /C "" ^< sched_failures-%computername%_%cyr%-%cmo%-%cdy%.txt') do set fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_failures_timeout-%computername%_%cyr%-%cmo%-%cdy%.txt') do set time_fails=%%a
for /f %%a in ('Find /V /C "" ^< sched_successes-%computername%_%cyr%-%cmo%-%cdy%.txt') do set success=%%a
@ECHO Results for calendar date: %cyr%-%cmo%-%cdy%
@ECHO Results for calendar date: %cyr%-%cmo%-%cdy% >> %Out_file%
set /a schdreqcnt=%fails%+%success%
@ECHO Scheduler Requests: %schdreqcnt%
@ECHO Scheduler Requests: %schdreqcnt% >> %Out_file%
set /a schdsuccesspct=%success%*100/%schdreqcnt%
@ECHO Scheduler Success: %schdsuccesspct% %%
@ECHO Scheduler Success: %schdsuccesspct% %% >> %Out_file%
set /a schdfailspct=%fails%*100/%schdreqcnt%
@ECHO Scheduler Failure: %schdfailspct% %%
@ECHO Scheduler Failure: %schdfailspct% %% >> %Out_file%
set /a schdtofailspct=%time_fails%*100/%schdreqcnt%
@ECHO Scheduler Timeout: %schdtofailspct% %% of total
@ECHO Scheduler Timeout: %schdtofailspct% %% of total >> %Out_file%
set /a schdtmfailspct=%time_fails%*100/%fails%
@ECHO Scheduler Timeout: %schdtmfailspct% %% of failures
@ECHO Scheduler Timeout: %schdtmfailspct% %% of failures >> %Out_file%
REM ---end output senction
set /a cdy=%cdy%-1
if %cdy:~0,1%==%edychk1% @ECHO Date makes no sense. I Can't math! & goto end
if %cdy:~0,1%==%edychk2% @ECHO Date makes no sense. I Can't math! & goto end
if %lct%==%check_days% goto end
goto start
:end
@ECHO.
@ECHO End date: %cyr%-%cmo%-%cdy%
@ECHO.
@ECHO *** All the above text you can find now in file:
@ECHO %Out_file%
pause
del sched_*-%computername%_*.txt
%Out_file%
:pause
And my Results for 20 days: Results for calendar date: 2012-11-27 Scheduler Requests: 52 Scheduler Success: 67 % Scheduler Failure: 32 % Scheduler Timeout: 0 % of total Scheduler Timeout: 0 % of failures Results for calendar date: 2012-11-26 Scheduler Requests: 193 Scheduler Success: 25 % Scheduler Failure: 74 % Scheduler Timeout: 1 % of total Scheduler Timeout: 1 % of failures Results for calendar date: 2012-11-25 Scheduler Requests: 117 Scheduler Success: 17 % Scheduler Failure: 82 % Scheduler Timeout: 1 % of total Scheduler Timeout: 2 % of failures Results for calendar date: 2012-11-24 Scheduler Requests: 90 Scheduler Success: 0 % Scheduler Failure: 100 % Scheduler Timeout: 1 % of total Scheduler Timeout: 1 % of failures Results for calendar date: 2012-11-23 Scheduler Requests: 78 Scheduler Success: 44 % Scheduler Failure: 55 % Scheduler Timeout: 0 % of total Scheduler Timeout: 0 % of failures Results for calendar date: 2012-11-22 Scheduler Requests: 103 Scheduler Success: 80 % Scheduler Failure: 19 % Scheduler Timeout: 0 % of total Scheduler Timeout: 0 % of failures Results for calendar date: 2012-11-21 Scheduler Requests: 134 Scheduler Success: 45 % Scheduler Failure: 54 % Scheduler Timeout: 13 % of total Scheduler Timeout: 24 % of failures Results for calendar date: 2012-11-20 Scheduler Requests: 185 Scheduler Success: 72 % Scheduler Failure: 27 % Scheduler Timeout: 4 % of total Scheduler Timeout: 18 % of failures Results for calendar date: 2012-11-19 Scheduler Requests: 29 Scheduler Success: 82 % Scheduler Failure: 17 % Scheduler Timeout: 6 % of total Scheduler Timeout: 40 % of failures Results for calendar date: 2012-11-18 Scheduler Requests: 72 Scheduler Success: 61 % Scheduler Failure: 38 % Scheduler Timeout: 38 % of total Scheduler Timeout: 100 % of failures Results for calendar date: 2012-11-17 Scheduler Requests: 25 Scheduler Success: 48 % Scheduler Failure: 52 % Scheduler Timeout: 48 % of total Scheduler Timeout: 92 % of failures Results for calendar date: 2012-11-16 Scheduler Requests: 38 Scheduler Success: 65 % Scheduler Failure: 34 % Scheduler Timeout: 34 % of total Scheduler Timeout: 100 % of failures Results for calendar date: 2012-11-15 Scheduler Requests: 39 Scheduler Success: 61 % Scheduler Failure: 38 % Scheduler Timeout: 33 % of total Scheduler Timeout: 86 % of failures Results for calendar date: 2012-11-14 Scheduler Requests: 56 Scheduler Success: 82 % Scheduler Failure: 17 % Scheduler Timeout: 17 % of total Scheduler Timeout: 100 % of failures Results for calendar date: 2012-11-13 Scheduler Requests: 53 Scheduler Success: 77 % Scheduler Failure: 22 % Scheduler Timeout: 9 % of total Scheduler Timeout: 41 % of failures Results for calendar date: 2012-11-12 Scheduler Requests: 217 Scheduler Success: 92 % Scheduler Failure: 7 % Scheduler Timeout: 5 % of total Scheduler Timeout: 75 % of failures Results for calendar date: 2012-11-11 Scheduler Requests: 35 Scheduler Success: 54 % Scheduler Failure: 45 % Scheduler Timeout: 8 % of total Scheduler Timeout: 18 % of failures Results for calendar date: 2012-11-10 Scheduler Requests: 19 Scheduler Success: 89 % Scheduler Failure: 10 % Scheduler Timeout: 10 % of total Scheduler Timeout: 100 % of failures Results for calendar date: 2012-11-9 Scheduler Requests: 41 Scheduler Success: 70 % Scheduler Failure: 29 % Scheduler Timeout: 19 % of total Scheduler Timeout: 66 % of failures Results for calendar date: 2012-11-8 Scheduler Requests: 87 Scheduler Success: 59 % Scheduler Failure: 40 % Scheduler Timeout: 37 % of total Scheduler Timeout: 94 % of failures ____________ - ALF - "Find out what you don't do well ..... then don't do it!" :) | |
| ID: 1310686 · | |
|
| |
| ID: 1310724 · | |
|
Okay, Tuesday maintenance has been done, Server Status shows almost all up and running. | |
| ID: 1310755 · | |
|
Server staus page shows one of the MB splitters isn't running, however result creation rate is still only slightly more than 0. | |
| ID: 1310923 · | |
|
I guess shoving that wad of $20.00s under the lab door didn't hurt. :) | |
| ID: 1310959 · | |
|
| |
| ID: 1311094 · | |
It seems that Oscar is not splitting MB units now, and several channels are with errors. Positive side: Cricket is not maxed up and Synergy responds really fast to scheduling requests :) ____________ | |
| ID: 1311100 · | |
Help to make the SETI dilithium crystals stronger! :) :) :) :) :) ____________ | |
| ID: 1311123 · | |
|
| |
| ID: 1311294 · | |
|
& we're back! Thanks guys. | |
| ID: 1311356 · | |
|
Yep, thanks guys, at least forums are up. | |
| ID: 1311359 · | |
|
Well this is exciting. I noticed the crickets were chirping again and allowed network comms again. Two failures were then followed by: | |
| ID: 1311391 · | |
|
Beta is down hard | |
| ID: 1311401 · | |
Beta is down hard yep, noticed that too... well at least I had 400+ beta tasks to keep the main cruncher happy during this outage. As far as luck goes : I would run out of beta tasks by tomorrow morning. Don't bother abusing the retry button as that won't help Eric glueing the database together again ;-) ____________ get your bright shining star for just $10 | |
| ID: 1311404 · | |
Well this is exciting. I noticed the crickets were chirping again and allowed network comms again. Two failures were then followed by: Looks like my home machine managed to get allocated 97 GPU WUs -- but the response never got through so the ghost population just increased. Continue to get "Project communication failed": ____________ | |
| ID: 1311409 · | |
Message boards : Number crunching : Panic Mode On (79) Server Problems?
| Copyright © 2013 University of California |