Message boards :
Number crunching :
Panic Mode On (101) Server Problems?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 27 · Next
Author | Message |
---|---|
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Funny thing.. that's what I had originally postured in my post but then decided to go a different route. I had originally typed-up "maybe they doubled the RTS to better-deal with the spate of shorties from these 2011 tapes a little better?" Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
Ohhhh, Cosmo, I like your hypothesis of shorties. Simple, obvious, and elegant idea. The thought behind my idea of a long-running operation was some shot at starting up Green Bank processing. Just guessing (hoping), y'know. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Or maybe someone did increase RTS -buffer to about 600K? Just my guess... Just got back from holidays, and usually I'm worried about a lack of a ready-to-send buffer. I get back and find it way over 300,000 which usually indicates problems of another sort. But it appears it's sitting around the 600,000 mark now; so as WezH has suggested, maybe they've bumped it up in anticipation of a longer than usual weekly outage? Grant Darwin NT |
JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1 |
Replica 19,492 behind master....hopefully just a hangover from today's outrage? "Sour Grapes make a bitter Whine." <(0)> |
OTS Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0 |
And losing ground. Now 23,689 seconds behind :(. |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
Thanks guys , I was going to say "Huston you have a problem" , however the 1st 2 posts at the top of the thread says the Replica 19,492 behind master....hopefully just a hangover from today's outrage? And losing ground. Now 23,689 seconds behind :(. witch answers the Question Why is the servers reporting that I have more units in progress than I do have . Answer the Replica data base is behind 24,800 plus seconds :-) |
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
@23:00 UTC: Replica seconds behind master 7,772 sec Aloha, Uli |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Looks like it was an average outage, and the ready-to-send buffer is above 300,000 and the splitters are running. So it like it has been reset to 600,000 or so. Grant Darwin NT |
Sleepy Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360 |
Also it seems that splitters are running at a higher rate than before. Now, even when they reach 600.000 RTS they very seldom throttle down to low splitting rates, but they almost always, as far as I can see, split at >30 MB WUs/s. I am a mostly AP only cruncher at the moment, but probably those asking for MBs in the past many times would run dry or understocked, since now average splitting ratio seems higher and RTS are stable at hte new 600.000 level, not sky-rocketing. Therefore probably we should have an increase of processed MB units (which is good also for AP, since this way MB queues end quicker and AP reloading cycle is faster). Am I wrong? Happy crunching! Sleepy |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Rolling forward my estimate of how much MB work has been processed with the SaH v7 app (including autocorrelations), and how much might conceivably remain to be reprocessed. The 'TOTAL' column is what I've posted in the data distribution thread, including new tapes received since the start of 2015 (nothing new arrived this month). The v7 column is what I've done since the SaH v7 launch date (June 2013), and the v6 only column is the difference between the two. Looks like we've processed roughly 350 tapes since I last drew up this table two months ago - see PMO 100 for the earlier figures. At that rate - and assuming we're not diverted by the arrival of new tapes - we should have finished reprocessing 2011 by mid-February. No, I don't know whether we'll go back to take a second look at the rest of 2010 then. Current figures: Recorded TOTAL Processed with Processed with SaH v7 (since Sah v6 only launch June 2013) (derived) 2007 350 4 346 2008 916 874 42 2009 548 456 92 2010 762 138 624 2011 1138 558 580 2012 846 819 27 2013 590 585 5 2014 260 260 n/a 2015 164 164 n/a Grand total 5574 3858 1716 |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Also it seems that splitters are running at a higher rate than before. Now, even when they reach 600.000 RTS they very seldom throttle down to low splitting rates, but they almost always, as far as I can see, split at >30 MB WUs/s. There are a lot of shorties about at the moment, and not much AP. The end result is the number of MB WUs returned per hour, which is usually around the 85,000 mark has been around or above 100,000/hr for almost a week now. The fact that the splitters have been able to keep up is a good sign. Either they've been able to sort out the random splitter slowdowns, or it's sorted itself out. Either way, it's been a long time since there has been a load such as this, and the splitters have been able to keep up. A very long time. Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Something happened during maintenance, all current files have errors on them :( |
Akio Send message Joined: 18 May 11 Posts: 375 Credit: 32,129,242 RAC: 0 |
I'm not getting any new tasks :/ [EDIT: Did a reboot and got new tasks. Sorry about that.] |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Something happened during maintenance, all current files have errors on them :( Hmmm. Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed. I just suspended it, and then resumed it after a couple of minutes and it restarted from scratch, and it's got a 5 minute estimated run time, and gets as far as 0.001% again & time to completion freezes as elapsed time continues to tick by. I've exited BOINC & restarted, and again the WU starts from scratch, and progress freezes at 0.001% I aborted that WU, then picked up 2 more with 4min 09sec estimated run times that go high priority due to the short deadline date (11/11/2015). Both of those WUs get to 0.001% & then stop progressing, even though the Elapsed time clock is running. Aborting them as well. EDIT- those 2 problem WUS are, 14jl11ac.12197.15609.3.12.158_0 14jl11ac.12197.15609.3.12.156_1 Next WUs downloaded, 08ap11ae.30787.24607.9.12.242_1 08ap11ae.30787.24607.9.12.88_1 08ap11ae.30787.24607.9.12.248_0 2 to GPU, 1 to CPU. GPU estimated run times were under 3 min, took 1:43. CPU estimated run time about 35min, 10% done in 3 min. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Lots of channels ended in error, some of the work going out un-crunchable, and splitter output down in the dumps again. Grant Darwin NT |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1849 Credit: 268,616,081 RAC: 1,349 |
Lots of channels ended in error, some of the work going out un-crunchable, and splitter output down in the dumps again. And those 600k WUs are gone, just like that. Sure wish there was a way to share WUs between one's own machines ... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
6 PFB splitters show as running (usually it's 7), but on the Splitter Status it shows only 4. And splitter output is still way down. Something certainly got tangled up during the outage. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
Not too worried about the splitter speed - they are behaving pretty much as they do after any outrage, just with a bigger cache to play with. What I am concerned about is the way the validators are behaving just now - quite a number of tasks are going straight to "invalid" without any obvious reason. This may be something to do with the channels containing errors, or is it something wrong with the validators???? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Not too worried about the splitter speed - they are behaving pretty much as they do after any outrage, just with a bigger cache to play with. Usually after an outage they get up to speed reasonably quickly, they just can't meet the initial demand. This time around they haven't even gotten close to getting up to speed, so even less chance of them meeting the demand. What I am concerned about is the way the validators are behaving just now - quite a number of tasks are going straight to "invalid" without any obvious reason. This may be something to do with the channels containing errors, or is it something wrong with the validators???? I suspect something isn't right. Before & after the outage the returned per hour was around 100,000. For the last 4-5 hours it's been steadily rising & now it's up to 120,000/hr. Looking at my results there are 5 errors (odd WUs I aborted, one by accident). There's 1 Invalid, and I don't know why. 21no11aa.14796.23491.4.12.104_0 <core_client_version>7.6.6</core_client_version> <![CDATA[ <stderr_txt> Build features: SETI7 Non-graphics FFTW USE_AVX x64 CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX ar=1.132150 NumCfft=50749 NumGauss=0 NumPulse=28186176308 NumTriplet=28186176308 In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Windows optimized S@H v7 application Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan AVXxjf Win64 Build 2549 , Ported by : Raistmer, JDWhale SETI7 update by Raistmer Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 1.132150 Spike: peak=25.56701, time=57.04, d_freq=1421012325.12, chirp=-1.4788, fft_len=64k Best spike: peak=25.56701, time=57.04, d_freq=1421012325.12, chirp=-1.4788, fft_len=64k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.122e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=6.667967, time=80.66, period=1.073, d_freq=1421012063.24, score=0.8988, chirp=42.86, fft_len=64 Best triplet: peak=0, time=-2.122e+011, period=0, d_freq=0, chirp=0, fft_len=0 Flopcounter: 5608098081970.542969 Spike count: 1 Pulse count: 0 Triplet count: 0 Gaussian count: 0 Wallclock time elapsed since last restart: 1658.9 seconds 14:39:23 (3620): called boinc_finish </stderr_txt> ]]> Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
The significant thing about the invalid tasks seems to be that they are invalid for all application versions - e.g. my WU 1954045249. That certainly points at either a data error or a validator error. I've got to go out soon, so I'm setting NNT on all hosts, until I can look in more detail later in the day, and perhaps see what the lab staff say when they get in. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.