Panic Mode On (101) Server Problems?

Message boards : Number crunching : Panic Mode On (101) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 27 · Next

AuthorMessage
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1737241 - Posted: 26 Oct 2015, 18:24:10 UTC

Funny thing.. that's what I had originally postured in my post but then decided to go a different route. I had originally typed-up "maybe they doubled the RTS to better-deal with the spate of shorties from these 2011 tapes a little better?"
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1737241 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1737286 - Posted: 26 Oct 2015, 20:38:10 UTC - in response to Message 1737241.  

Ohhhh, Cosmo, I like your hypothesis of shorties. Simple, obvious, and elegant idea. The thought behind my idea of a long-running operation was some shot at starting up Green Bank processing. Just guessing (hoping), y'know.
ID: 1737286 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1737443 - Posted: 27 Oct 2015, 7:14:34 UTC - in response to Message 1737233.  

Or maybe someone did increase RTS -buffer to about 600K? Just my guess...

We wait and see.

Just got back from holidays, and usually I'm worried about a lack of a ready-to-send buffer. I get back and find it way over 300,000 which usually indicates problems of another sort.
But it appears it's sitting around the 600,000 mark now; so as WezH has suggested, maybe they've bumped it up in anticipation of a longer than usual weekly outage?
Grant
Darwin NT
ID: 1737443 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1737550 - Posted: 27 Oct 2015, 21:14:03 UTC

Replica 19,492 behind master....hopefully just a hangover from today's outrage?

"Sour Grapes make a bitter Whine." <(0)>
ID: 1737550 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1737574 - Posted: 27 Oct 2015, 22:23:20 UTC

And losing ground. Now 23,689 seconds behind :(.
ID: 1737574 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1737585 - Posted: 27 Oct 2015, 23:01:34 UTC

Thanks guys , I was going to say "Huston you have a problem" , however the 1st 2 posts at the top of the thread says the

Replica 19,492 behind master....hopefully just a hangover from today's outrage?


And losing ground. Now 23,689 seconds behind :(.


witch answers the Question

Why is the servers reporting that I have more units in progress than I do have .

Answer the Replica data base is behind 24,800 plus seconds :-)
ID: 1737585 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1737588 - Posted: 27 Oct 2015, 23:04:51 UTC

@23:00 UTC:

Replica seconds behind master 7,772 sec
Aloha, Uli

ID: 1737588 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1737600 - Posted: 28 Oct 2015, 0:07:22 UTC - in response to Message 1737588.  
Last modified: 28 Oct 2015, 0:07:50 UTC

Looks like it was an average outage, and the ready-to-send buffer is above 300,000 and the splitters are running.
So it like it has been reset to 600,000 or so.
Grant
Darwin NT
ID: 1737600 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1739020 - Posted: 2 Nov 2015, 10:47:21 UTC - in response to Message 1737600.  

Also it seems that splitters are running at a higher rate than before. Now, even when they reach 600.000 RTS they very seldom throttle down to low splitting rates, but they almost always, as far as I can see, split at >30 MB WUs/s.

I am a mostly AP only cruncher at the moment, but probably those asking for MBs in the past many times would run dry or understocked, since now average splitting ratio seems higher and RTS are stable at hte new 600.000 level, not sky-rocketing.

Therefore probably we should have an increase of processed MB units (which is good also for AP, since this way MB queues end quicker and AP reloading cycle is faster).

Am I wrong?

Happy crunching!

Sleepy
ID: 1739020 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739035 - Posted: 2 Nov 2015, 12:54:47 UTC

Rolling forward my estimate of how much MB work has been processed with the SaH v7 app (including autocorrelations), and how much might conceivably remain to be reprocessed.

The 'TOTAL' column is what I've posted in the data distribution thread, including new tapes received since the start of 2015 (nothing new arrived this month). The v7 column is what I've done since the SaH v7 launch date (June 2013), and the v6 only column is the difference between the two. Looks like we've processed roughly 350 tapes since I last drew up this table two months ago - see PMO 100 for the earlier figures.

At that rate - and assuming we're not diverted by the arrival of new tapes - we should have finished reprocessing 2011 by mid-February. No, I don't know whether we'll go back to take a second look at the rest of 2010 then.

Current figures:
Recorded	TOTAL		Processed with		Processed with 
				SaH v7 (since		Sah v6 only
				launch June 2013)	(derived)

2007		 350		   4			 346
2008		 916		 874			  42
2009		 548		 456			  92
2010		 762		 138			 624
2011		1138		 558			 580
2012		 846		 819			  27
2013		 590		 585			   5
2014		 260		 260			 n/a
2015		 164		 164			 n/a

Grand total	5574		3858			1716
ID: 1739035 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739184 - Posted: 2 Nov 2015, 21:38:38 UTC - in response to Message 1739020.  

Also it seems that splitters are running at a higher rate than before. Now, even when they reach 600.000 RTS they very seldom throttle down to low splitting rates, but they almost always, as far as I can see, split at >30 MB WUs/s.

I am a mostly AP only cruncher at the moment, but probably those asking for MBs in the past many times would run dry or understocked, since now average splitting ratio seems higher and RTS are stable at hte new 600.000 level, not sky-rocketing.

Therefore probably we should have an increase of processed MB units (which is good also for AP, since this way MB queues end quicker and AP reloading cycle is faster).

There are a lot of shorties about at the moment, and not much AP.
The end result is the number of MB WUs returned per hour, which is usually around the 85,000 mark has been around or above 100,000/hr for almost a week now.
The fact that the splitters have been able to keep up is a good sign. Either they've been able to sort out the random splitter slowdowns, or it's sorted itself out. Either way, it's been a long time since there has been a load such as this, and the splitters have been able to keep up. A very long time.
Grant
Darwin NT
ID: 1739184 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739423 - Posted: 3 Nov 2015, 22:44:41 UTC

Something happened during maintenance, all current files have errors on them :(
ID: 1739423 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1739450 - Posted: 4 Nov 2015, 0:47:58 UTC - in response to Message 1739423.  
Last modified: 4 Nov 2015, 0:56:31 UTC

I'm not getting any new tasks :/

[EDIT: Did a reboot and got new tasks. Sorry about that.]
ID: 1739450 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739459 - Posted: 4 Nov 2015, 1:28:41 UTC - in response to Message 1739423.  
Last modified: 4 Nov 2015, 1:35:36 UTC

Something happened during maintenance, all current files have errors on them :(

Hmmm.

Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed.
I just suspended it, and then resumed it after a couple of minutes and it restarted from scratch, and it's got a 5 minute estimated run time, and gets as far as 0.001% again & time to completion freezes as elapsed time continues to tick by.
I've exited BOINC & restarted, and again the WU starts from scratch, and progress freezes at 0.001%

I aborted that WU, then picked up 2 more with 4min 09sec estimated run times that go high priority due to the short deadline date (11/11/2015).
Both of those WUs get to 0.001% & then stop progressing, even though the Elapsed time clock is running. Aborting them as well.


EDIT- those 2 problem WUS are,
14jl11ac.12197.15609.3.12.158_0
14jl11ac.12197.15609.3.12.156_1


Next WUs downloaded,
08ap11ae.30787.24607.9.12.242_1
08ap11ae.30787.24607.9.12.88_1
08ap11ae.30787.24607.9.12.248_0
2 to GPU, 1 to CPU.

GPU estimated run times were under 3 min, took 1:43.
CPU estimated run time about 35min, 10% done in 3 min.
Grant
Darwin NT
ID: 1739459 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739471 - Posted: 4 Nov 2015, 2:54:23 UTC - in response to Message 1739459.  

Lots of channels ended in error, some of the work going out un-crunchable, and splitter output down in the dumps again.
Grant
Darwin NT
ID: 1739471 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1739481 - Posted: 4 Nov 2015, 4:33:04 UTC - in response to Message 1739471.  
Last modified: 4 Nov 2015, 4:36:21 UTC

Lots of channels ended in error, some of the work going out un-crunchable, and splitter output down in the dumps again.

And those 600k WUs are gone, just like that.
Sure wish there was a way to share WUs between one's own machines ...
ID: 1739481 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739514 - Posted: 4 Nov 2015, 7:28:56 UTC - in response to Message 1739481.  

6 PFB splitters show as running (usually it's 7), but on the Splitter Status it shows only 4. And splitter output is still way down.
Something certainly got tangled up during the outage.
Grant
Darwin NT
ID: 1739514 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739535 - Posted: 4 Nov 2015, 9:17:57 UTC

Not too worried about the splitter speed - they are behaving pretty much as they do after any outrage, just with a bigger cache to play with.

What I am concerned about is the way the validators are behaving just now - quite a number of tasks are going straight to "invalid" without any obvious reason. This may be something to do with the channels containing errors, or is it something wrong with the validators????
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739535 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739544 - Posted: 4 Nov 2015, 9:50:00 UTC - in response to Message 1739535.  
Last modified: 4 Nov 2015, 9:50:29 UTC

Not too worried about the splitter speed - they are behaving pretty much as they do after any outrage, just with a bigger cache to play with.

Usually after an outage they get up to speed reasonably quickly, they just can't meet the initial demand.
This time around they haven't even gotten close to getting up to speed, so even less chance of them meeting the demand.

What I am concerned about is the way the validators are behaving just now - quite a number of tasks are going straight to "invalid" without any obvious reason. This may be something to do with the channels containing errors, or is it something wrong with the validators????

I suspect something isn't right.
Before & after the outage the returned per hour was around 100,000. For the last 4-5 hours it's been steadily rising & now it's up to 120,000/hr.

Looking at my results there are 5 errors (odd WUs I aborted, one by accident).
There's 1 Invalid, and I don't know why.


21no11aa.14796.23491.4.12.104_0

<core_client_version>7.6.6</core_client_version>
<![CDATA[
<stderr_txt>

Build features: SETI7 Non-graphics FFTW USE_AVX x64
CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX
ar=1.132150 NumCfft=50749 NumGauss=0 NumPulse=28186176308 NumTriplet=28186176308
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768

Windows optimized S@H v7 application
Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan
AVXxjf Win64 Build 2549 , Ported by : Raistmer, JDWhale

SETI7 update by Raistmer
Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 1.132150
Spike: peak=25.56701, time=57.04, d_freq=1421012325.12, chirp=-1.4788, fft_len=64k

Best spike: peak=25.56701, time=57.04, d_freq=1421012325.12, chirp=-1.4788, fft_len=64k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.122e+011, d_freq=0,
score=-12, null_hyp=0, chirp=0, fft_len=0
Best pulse: peak=6.667967, time=80.66, period=1.073, d_freq=1421012063.24, score=0.8988, chirp=42.86, fft_len=64
Best triplet: peak=0, time=-2.122e+011, period=0, d_freq=0, chirp=0, fft_len=0


Flopcounter: 5608098081970.542969

Spike count: 1
Pulse count: 0
Triplet count: 0
Gaussian count: 0
Wallclock time elapsed since last restart: 1658.9 seconds

14:39:23 (3620): called boinc_finish

</stderr_txt>
]]>
Grant
Darwin NT
ID: 1739544 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739550 - Posted: 4 Nov 2015, 10:14:37 UTC

The significant thing about the invalid tasks seems to be that they are invalid for all application versions - e.g. my WU 1954045249. That certainly points at either a data error or a validator error.

I've got to go out soon, so I'm setting NNT on all hosts, until I can look in more detail later in the day, and perhaps see what the lab staff say when they get in.
ID: 1739550 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 27 · Next

Message boards : Number crunching : Panic Mode On (101) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.