The Saga Begins (LotsaCores 2.0)

Message boards : Number crunching : The Saga Begins (LotsaCores 2.0)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1815007 - Posted: 4 Sep 2016, 13:55:54 UTC - in response to Message 1815006.  

Sorry about that Bill,

My big clumsy fingers late at night, lol


@Al...

If you are feeling brave, you can increase the sbs to 1024. That would be the max value.

I'm out of here for a while Al.

Good Luck


Zalster
ID: 1815007 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1818407 - Posted: 20 Sep 2016, 8:52:39 UTC - in response to Message 1815007.  

Looks like the new system is about to overtake the original.
Grant
Darwin NT
ID: 1818407 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1818442 - Posted: 20 Sep 2016, 12:20:10 UTC - in response to Message 1818407.  

It's kind of looking that way. I really haven't changed anything, but over the last week or so I have noticed that the original ones RAC has been dropping. I had been hoping that this would be the first machine I had that would go over the 100k RAC, but after hitting around 97k, it's been dropping, and according to the log on that machine it's now in the high 80's and still trending down. Don't understand why, maybe there is just a weird string of WU's that its been crunching through. Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one?

ID: 1818442 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1818473 - Posted: 20 Sep 2016, 20:31:22 UTC - in response to Message 1818442.  

Creditscrew strikes again.
ID: 1818473 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1818475 - Posted: 20 Sep 2016, 20:40:18 UTC - in response to Message 1818442.  

It's kind of looking that way. I really haven't changed anything, but over the last week or so I have noticed that the original ones RAC has been dropping. I had been hoping that this would be the first machine I had that would go over the 100k RAC, but after hitting around 97k, it's been dropping, and according to the log on that machine it's now in the high 80's and still trending down. Don't understand why, maybe there is just a weird string of WU's that its been crunching through. Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one?


Yes. something is afoot. I've see my own RAC dropping from around 115K down to the mid 80Ks
ID: 1818475 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1818479 - Posted: 20 Sep 2016, 20:53:43 UTC - in response to Message 1818442.  

Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one?

I think it may have been due to 75% of the pfb splitters getting stuck last week. That skewed the WUs heavily in favor of guppi VLARs, with very few Arecibo non-VLAR tasks getting sent out. Hopefully, with that splitter situation corrected on Sunday, and with APs currently flowing, things will get back to normal shortly (whatever "normal" is).
ID: 1818479 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1818528 - Posted: 20 Sep 2016, 23:56:37 UTC - in response to Message 1818479.  

Ahh, ok, that might explain it. Thanks for the heads up, I couldn't figure out what might have been wrong. Hopefully it starts it march at least back to the mid 90's again.

ID: 1818528 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1818585 - Posted: 21 Sep 2016, 7:03:14 UTC - in response to Message 1818479.  

Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one?

I think it may have been due to 75% of the pfb splitters getting stuck last week.

That was my thought as well- there was very little to almost no Arecibo work for a while there, that would result in a significant drop in Credit.
Grant
Darwin NT
ID: 1818585 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1819282 - Posted: 24 Sep 2016, 2:00:53 UTC - in response to Message 1818585.  

I notice a few more "Finish file present too long" errors over the last couple of days.
Is your system/data drive(s) an SSD or HDD?

If you've got Process Explorer on that system, what does it show for CPU load on "Hardware interrupts & DPCs?"
Grant
Darwin NT
ID: 1819282 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819327 - Posted: 24 Sep 2016, 12:28:25 UTC - in response to Message 1819282.  
Last modified: 24 Sep 2016, 12:31:20 UTC

Just installed it, seems to be running between 0.39 and 0.76. Checked the RAC situation just now, it's dove down to around 78k, and the curve hasn't flattened yet. Here is one question I have, I am running Mr Kevvys program on this machine, automatically every 30 mins or so. If I remember correctly, this was sort of the anti-SoG setup, what with the more active management as opposed to more along the lines of set and forget after configuring with SoG?

When I was looking at the tasks running this morning, I noticed that 1- they were pretty much all Guppis, which isn't surprising after reading the issues that we are having with getting good AR tasks right now, and 2- the CPU tasks are running on version 8.0, and the GPU tasks are running on 8.12, and says it is SoG. Is this an issue, as I didn't think I installed SoG on this machine because I was going to try using the active management on this one, and the SoG on the 2.0, but here it is.

Might this be part of the problem, trying to actively manage the SoG client, hence causing at least some of the plummeting RAC? Not sure if it has any bearing at all, but thought I'd toss it out there for discussion.

ID: 1819327 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819360 - Posted: 24 Sep 2016, 15:43:28 UTC - in response to Message 1819327.  

First of all, it's not really SoG you're trying to manage, but guppi VLARs and Arecibo non-VLARs. It doesn't matter if you're running Cuda or SoG on the GPUs.

Secondly, yes, the frequency of your rescheduling could have an impact, both on your RAC and on those "finish file present too long" errors you're getting. With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. When BOINC comes back up, more than 10 seconds have passed and the error is generated.

Rescheduling every 30 minutes could also hurt your overall throughput if you still have checkpoints set to the default, which I think is 300 seconds. Any time tasks are restarted, they have to go back to the last checkpoint so, on average, you could be losing close to 2.5 minutes of processing time for each task that's restarted after a rescheduling run. That could be significant.

Personally, I would recommend a far longer interval between rescheduling runs and, if you haven't already made such a change, a shorter checkpoint interval. (But not too short, since there is some additional overhead incurred with each checkpoint. Mine is usually set in the 120 to 150 second range.)
ID: 1819360 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819382 - Posted: 24 Sep 2016, 16:38:55 UTC - in response to Message 1819360.  

Jeff, thanks for the advice, I'll try backing off the schedule to every hour and see what effect that has. Where does one go to shorten the checkpoint interval?

ID: 1819382 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819383 - Posted: 24 Sep 2016, 16:51:44 UTC - in response to Message 1819382.  

Where does one go to shorten the checkpoint interval?

In BOINC Manager > Options, on the Computing tab, all the way at the bottom there's an option to "Request tasks to checkpoint at most every ___ seconds."
ID: 1819383 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1819387 - Posted: 24 Sep 2016, 17:51:21 UTC - in response to Message 1819382.  

Jeff has very accurately described the mechanism where BOINC fails in its file management. I too learned that if you shutdown the Manager right before a task finishes up that you risk the "finish file present too long" error message. I am very careful to see where all tasks are in their completion percentage before running the Qopt program. I also know that on the 1070 especially on SoG, that percentage to completion is not tracked accurately from about 90% onward. The actual percentage is likely already at 99% when the Manager indicates 90% or so. If I have any SoG task already past 90% completion, I let them finish up and look at the logfile to make sure they have uploaded completely before running Qopt. I haven't had the finish error since I changed my routines.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1819387 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819396 - Posted: 24 Sep 2016, 18:25:08 UTC - in response to Message 1819387.  
Last modified: 24 Sep 2016, 18:33:14 UTC

These 'ódd things' come up from time to time, and in general they trace back to some poor design decisions. I'd be happy to detail those if requested, however repeated attempts to submit fixes have been met with narcissistic crap, which I have no time for anymore.

[Edit:]
Example of thread safety: Request shutdown, worker shuts down, achknowledges shutdown, shutdown happens.

Example of crap: Just kill things
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819396 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819400 - Posted: 24 Sep 2016, 18:33:04 UTC - in response to Message 1819383.  

Just pulled it up, mine is set to 60 seconds. Too often?

ID: 1819400 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819401 - Posted: 24 Sep 2016, 18:34:29 UTC - in response to Message 1819400.  

Just pulled it up, mine is set to 60 seconds. Too often?


Go the rate you're willing to lose. I set mine to 1 hour (3600 seconds), though individual tasks will obviously write more often.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819401 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1819403 - Posted: 24 Sep 2016, 18:36:50 UTC - in response to Message 1819401.  
Last modified: 24 Sep 2016, 18:37:40 UTC

Jason, sorry, I don't understand the mechanism, could you please explain the upsides/downsides to more/less frequent? Thanks!

*edit* I mean, I don't want to lose anything, obviously, but I also want it to run as efficiently as possible.

ID: 1819403 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1819406 - Posted: 24 Sep 2016, 18:41:59 UTC - in response to Message 1819403.  
Last modified: 24 Sep 2016, 18:43:35 UTC

Jason, sorry, I don't understand the mechanism, could you please explain the upsides/downsides to more/less frequent? Thanks!

*edit* I mean, I don't want to lose anything, obviously, but I also want it to run as efficiently as possible.


Sure Al,
Let's say all tasks were 2+ hours long and you had a power failure. The last time your tasks saved information would be between when they started and the power failure. That might be every 60 seconds, or whatever you set. When the tasks resumed some information was potentially saved, so can continue from there. If you set it too long and have failures, you potentially waste energy reprocessing work you already did. Too short, you thrash your disks.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1819406 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1819419 - Posted: 24 Sep 2016, 19:36:14 UTC - in response to Message 1819400.  

Just pulled it up, mine is set to 60 seconds. Too often?

Considering how frequently you're running that rescheduler, 60 seconds is probably reasonable. Depending on the timing of each rescheduling run relative to the last checkpoint, each of your running tasks would lose between 0 and 59.999+ seconds of processing each time BOINC restarts so, on average, probably about 30 seconds per task. I suspect that task housekeeping, during both the BOINC shutdown and restart, probably adds some to that, though I don't know how much.

The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however.
ID: 1819419 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : The Saga Begins (LotsaCores 2.0)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.