Panic Mode On (101) Server Problems?

Message boards : Number crunching : Panic Mode On (101) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 27 · Next

AuthorMessage
ChrisD
Volunteer tester

Send message
Joined: 25 Sep 99
Posts: 158
Credit: 2,496,342
RAC: 0
Denmark
Message 1739655 - Posted: 4 Nov 2015, 16:50:33 UTC - in response to Message 1739649.  

LOL, how many Commodore Cassettes does it take to hold 50GB?


Anyone still have a Comodore to verify? :)

As far as I remember data was written at 1200 baud and each block was written twice for error correction. (Maybe I am wrong, it might have been 600 Baud only. Anyone still have that manual?)

At 50 bytes/sec a 60 min Cassette will hold 175 KiloBytes.

Where can we store 285,715 cassettes :) :)

ChrisD
ID: 1739655 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1739656 - Posted: 4 Nov 2015, 16:52:39 UTC - in response to Message 1739649.  

...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly....


And they are out now.

And funny, We do still speak about tapes, I doubt that any of these data has been in tape :D


LOL, how many Commodore Cassettes does it take to hold 50GB?


Hmm...

90-minutes tape (45 minutes on each side) will hold on the order of 150 kilobytes on each side if no compression or fast loader is used


More that I can carry :D
ID: 1739656 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1739664 - Posted: 4 Nov 2015, 17:05:31 UTC - in response to Message 1739610.  
Last modified: 4 Nov 2015, 17:11:40 UTC

Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed.
I just suspended it, and then resumed it after a couple of minutes and it restarted from scratch, and it's got a 5 minute estimated run time, and gets as far as 0.001% again & time to completion freezes as elapsed time continues to tick by.
I've exited BOINC & restarted, and again the WU starts from scratch, and progress freezes at 0.001%

I aborted that WU, then picked up 2 more with 4min 09sec estimated run times that go high priority due to the short deadline date (11/11/2015).
Both of those WUs get to 0.001% & then stop progressing, even though the Elapsed time clock is running. Aborting them as well.


EDIT- those 2 problem WUS are,
14jl11ac.12197.15609.3.12.158_0
14jl11ac.12197.15609.3.12.156_1

I'm running one like that at the moment - 16ap11aa.20222.23380.9.12.20_1

Other notable features - very high CPU usage, explained by stderr.txt entry

Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU...

Deadline is 11 Nov 2015, 2:22:06 UTC (7 days from issue) - that takes us back to very old processing parameters.


I noticed that also.
I`m still at work so can`t look any closer atm.

We had such an issue at beta not that Long ago.

All invalids have no autocorr section.


With each crime and every kindness we birth our future.
ID: 1739664 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1739669 - Posted: 4 Nov 2015, 17:51:40 UTC

This Task has autocorr but also ran 100% on CPU-

http://setiathome.berkeley.edu/result.php?resultid=4496130911


With each crime and every kindness we birth our future.
ID: 1739669 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1739670 - Posted: 4 Nov 2015, 17:57:14 UTC
Last modified: 4 Nov 2015, 17:59:40 UTC

21no11aa batch of work throwing "triplets >30" kind of error.

So far this morning, I've had 12 tasks end with computation error. They are all from the 21no11aa.994.18891.5.12.xx batch. Doesn't seem to matter whether they're for CPU or GPU. I picked one GPU failed work unit and reran it (in a "benchmark" sandbox) as a CPU task; it ended with the same stderr failure.

There are 8 more of these in the work buffer. They only take a few seconds to exit so I'll just let them pass through in turn. Some wingmen I can find are also showing the same triplet count error, but the tasks are being reissued to reach a quorum.

/EDIT: The autocorr count is missing in the stderr result.
ID: 1739670 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1739673 - Posted: 4 Nov 2015, 18:17:44 UTC

Just so you know we're working on the splitter problem - a new bit of splitter code was put into play yesterday. It was working well enough in beta, but apparently it still wasn't ready for prime time. We have some debugging and cleaning up to do but we'll be back soon enough with more workunits....

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1739673 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1739678 - Posted: 4 Nov 2015, 18:32:11 UTC - in response to Message 1739673.  

It was working well enough in beta, but apparently it still wasn't ready for prime time

That is why it is called the weekly outrage.
ID: 1739678 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1739679 - Posted: 4 Nov 2015, 18:33:18 UTC - in response to Message 1739678.  
Last modified: 4 Nov 2015, 18:33:25 UTC

Thanks Matt...
ID: 1739679 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1739696 - Posted: 4 Nov 2015, 20:22:22 UTC

Hello Huston are you there !! Huston ozzie 1 one here are you reading us !!! Huston hello !! Where having difficulty reading you Huston are you there !!!!

AS the crew start to panic " What's happened down there we haven't herd from them in hours " says one crew member , another say's "oh no WW3 has started that's why we can't hear them Huston has been hit with a nuke aaaaaaaaaaaaaaahhhhhhhhhhhhh , where doomd"

Hope you can get things sorted soon I'm out of GPU work and i'll be out off cpu work in a few more hours anyway fingers crossed you can fix the problem soon
ID: 1739696 · Report as offensive
Wild6-NJ
Volunteer tester

Send message
Joined: 4 Aug 99
Posts: 43
Credit: 100,336,791
RAC: 140
Message 1739700 - Posted: 4 Nov 2015, 20:44:50 UTC - in response to Message 1739696.  
Last modified: 4 Nov 2015, 20:46:44 UTC

Hello Huston are you there !! Huston ozzie 1 one here are you reading us !!! Huston hello !! Where having difficulty reading you Huston are you there !!!!

AS the crew start to panic " What's happened down there we haven't herd from them in hours " says one crew member , another say's "oh no WW3 has started that's why we can't hear them Huston has been hit with a nuke aaaaaaaaaaaaaaahhhhhhhhhhhhh , where doomd"

Hope you can get things sorted soon I'm out of GPU work and i'll be out off cpu work in a few more hours anyway fingers crossed you can fix the problem soon




(Apologies to the vegans out there)
ID: 1739700 · Report as offensive
Profile Dr Grey

Send message
Joined: 27 May 99
Posts: 154
Credit: 104,147,344
RAC: 21
United Kingdom
Message 1739710 - Posted: 4 Nov 2015, 21:23:16 UTC

ID: 1739710 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1739735 - Posted: 4 Nov 2015, 22:26:20 UTC - in response to Message 1739710.  

Still, can't say we didn't try

Yeah, that's the same thing that happened back in January. Even though the splitter gets fixed, if they don't do anything to block resends for those WUs, new tasks just keep getting created and sent back out again until the WU maxes out with 10 Invalids, doing nothing but wasting host resources along the way. Very irritating!

Back in January and February, I managed to abort most of the ones I received that I could identify. I'll probably start doing it again shortly with these. The thing is, that earlier batch all came from one original file, as I recall, whereas this time there seem to be multiple source files.
ID: 1739735 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739737 - Posted: 4 Nov 2015, 22:38:35 UTC - in response to Message 1739735.  

Woke up this morning to find 55 Invalids, notice also that the Server Status shows 5 splitters running, but the Splitter Status shows only 3, all working on the 1 last file.

Work in progress has dropped by around 1 million.
It's going to take a very long time to recover from this outage once the splitters are sorted out.
Grant
Darwin NT
ID: 1739737 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739738 - Posted: 4 Nov 2015, 22:40:29 UTC

I got my first batch from the new tape about half an hour ago. All shorties, and with the new reduced file size (which I suspect is deliberate - it doesn't seem to be a problem by itself). But at least the v7 processing seems to be working properly for this batch.

I've also seen a couple of changes made to the splitters, to make it less likely they'll lose their configuration data, and to shut them down automatically if it all goes wrong. Time will tell.
ID: 1739738 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739750 - Posted: 4 Nov 2015, 23:35:48 UTC - in response to Message 1739738.  
Last modified: 4 Nov 2015, 23:36:27 UTC

More odd WUs.

26oc11ac.1324.4157.12.12.89_1
26oc11ac.1324.4157.12.12.107_1
26oc11ac.1324.4157.12.12.65_0
26oc11ac.1324.4157.12.12.59_0
26oc11ac.1324.4157.12.12.71_1
26oc11ac.1324.4157.12.12.77_1
26oc11ac.1324.4157.12.12.83_1
26oc11ac.1324.4157.12.12.234_1
26oc11ac.1324.4157.12.12.113_1
26oc11ac.1324.4157.12.12.240_1
26oc11ac.1324.4157.12.12.47_0
26oc11ac.1324.4157.12.12.41_1

All shorties.
They start running, % Progress counts off, till they get to about 5%, then it resets to zero. Elapsed time continues to run, Progress just sits on 0%

Aborted all.
Grant
Darwin NT
ID: 1739750 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739757 - Posted: 4 Nov 2015, 23:56:09 UTC - in response to Message 1739750.  
Last modified: 5 Nov 2015, 0:02:06 UTC

At least you let them run long enough to display

Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU...

in stderr.txt

I gave Jason one of those (run to completion, so we could be sure it wasn't the "too many triplets" half of that information message), but he hasn't commented on the alternative threshold levels yet.

Reporting pseudo-progress until the first checkpoint is standard for your v7.6.6 client. It's supposed to reassure you that something is happening.

Edit: 26oc11ac? that tape was split some 14 hours ago, while you were asleep. Since then, we've had several hours without work, and now new tapes with new splitters. I'll reserve judgement until the morning.
ID: 1739757 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739758 - Posted: 5 Nov 2015, 0:10:10 UTC - in response to Message 1739757.  

Reporting pseudo-progress until the first checkpoint is standard for your v7.6.6 client. It's supposed to reassure you that something is happening.

It certainly does.
Watching time ticking away with no progress being made is... unsettling.
Especially since the day before I had a WU that ran for 30min with progress stuck at 0.001% and the estimated run time had climbed to 10,776 hours.

WUs in question,
14jl11ac.12197.15609.3.12.158_0
14jl11ac.12197.15609.3.12.156_1


The next WUs I got ran OK, but were very, very, very short.

08ap11ae.30787.24607.9.12.242_1
08ap11ae.30787.24607.9.12.88_1
08ap11ae.30787.24607.9.12.248_0
2 to GPU, 1 to CPU.

GPU estimated run times were under 3 min, took 1:43 (usual shorty estimate 12min)
CPU estimated run time about 35min, 10% done in 3 min (usual shorty estimate 1hr 40m). Completed OK.
Grant
Darwin NT
ID: 1739758 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739772 - Posted: 5 Nov 2015, 0:59:26 UTC - in response to Message 1739758.  
Last modified: 5 Nov 2015, 1:00:59 UTC

Another couple of odd WUs.
16se11ab.25031.20517.5.12.238_3
23oc11ah.12765.24804.6.12.19_3

GPU WU, don't know what the estimated run times were, but they completed in just over 3min 30s. Usual time to completion for GPU shorties is 13-16min.
The result of no autocorrelation?
Even so, before it was introduced shortie WUs (running 2 at a time) took way longer than 3min 30 to process.


Should be out of GPU work on this system in the next 30min or so.
Grant
Darwin NT
ID: 1739772 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739776 - Posted: 5 Nov 2015, 1:27:07 UTC - in response to Message 1739772.  
Last modified: 5 Nov 2015, 1:31:18 UTC

Just noticed some more anaomalies, these ones VLARs.
Usual runtime on this system is 4-4.5hrs. Estimated run times for these VLARs- 1hr 50min- 2hr 2min.

Would normally take most of the day to get to them.
Will suspend other work & see how they go.

16oc11aa.20967.14169.7.12.90.vlar_0
29ap11ad2518.14791.8.12.129.vlar_2
21no11aa.31868.24617.13.12.60.vlar_2

EDIT-
All of the 16oc11aa WUs ran for 4 secs & then finished.
Same with 260c11ac & 26ap11ab WUs.
21no11aa, one completed after 2min 20s, others still running. Other WUs still running.




And another of the running, but no Progress WUs. 13min and counting, 0.000% done, estimated time remaining- 23,664hrs & climbing. Aborted.
26ap11ab.9472.85225.14.12.42_2
Grant
Darwin NT
ID: 1739776 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739815 - Posted: 5 Nov 2015, 5:12:33 UTC - in response to Message 1739776.  

Got to love the perversity of chance.

I've got 2 systems, a Core 2 Duo & and i7. Naturally the i7 can do a lot more work than the C2D.
With the present lack of work, the C2D gets work every 45min or so. The i7, every 2 (or more) hours.
Grant
Darwin NT
ID: 1739815 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 27 · Next

Message boards : Number crunching : Panic Mode On (101) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.