Panic Mode On (111) Server Problems?

Message boards : Number crunching : Panic Mode On (111) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 31 · Next

AuthorMessage
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34308
Credit: 79,922,639
RAC: 80
Germany
Message 1929814 - Posted: 13 Apr 2018, 21:21:58 UTC - in response to Message 1929770.  
Last modified: 13 Apr 2018, 21:22:35 UTC

OK, first results are in. Arecibo VLAR on NVidia:

GTX 670, up from <10min to <20 min
GTX 750Ti, up from <12.5min to <25 min
GTX 470, up from 12min to <24 min

So I agree with the general proposition that these tasks run roughly twice as long as BLC VLARs - which is roughly what they do on CPUs, too. I think that they have quietly and gradually tweaked the processing parameters of BLC tasks - people have commented on the different runtimes of different batches - so that BLC VLARs run for roughly the same time as mid-AR Arecibos. And it's worked - y'all haven't been frightened away yet. I wonder what this new psychological experiment will do to the crunchership?

Usability - the 670 is definitely showing screen lag. It can be used, but there's a distinct feeling of "There's something not quite right about my computer this morning". That's the biggest risk to the project: if many people get that feeling, and if they track it down to the true cause (most won't), they might switch off BOINC and their other projects too. That would be a shame.

The 750Ti shows no lag at all :-). That may be because the monitor is connected to the GTX 970 in the same box, and I don't waste a 970 on SETI, while GPUGrid/CUDA80 has work. Which it does this morning.

The GTX 470 is, indeed, a Fermi. Specifically, it's the Fermi I drove across town to collect - and paid £239.99 plus sales tax for - on or around 14 May 2010, because no-one would answer my question (Beta message 39386). It's now sitting in my dual CPU Dell Precision Workstation, to replace the Quadro 1500 (one step below CUDA - I should have waited for the 1700). It's also driving 2520x1200 pixels of dual monitors. It doesn't feel too bad as I type and it crunches, but certain screen redraws are very slow, and I'm used to this old machine - now running 32-bit Windows 7 in 4 GB of quad-channel RAM - feeling clunky by modern standards.

Yes, I'm running r3584 SoG. When running usability tests, remember that Raistmer switched round the processing order compared with CUDA. With VLAR on CUDA, the slow, clunky, bits come at the beginning, which is a real turn off: with OpenCL/SoG, they come towards the end, and the progress %age rate slows to a crawl (not that progress is accurately assessed by any SETI application).


Increasing -period_iterations_num would certainly fix those screen lags but to be honest most questions i get is how to speed up crunching not reducing lags.
Default is 50 but maybe it should be increased to 80 or more for slower NV cards (Fermi).
I made this suggestion when i tested VLARs on AMD GPU approx 3 years ago.


With each crime and every kindness we birth our future.
ID: 1929814 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1929824 - Posted: 13 Apr 2018, 22:15:33 UTC - in response to Message 1929795.  

OK, thanks for the update Richard on how the Fermi card faired. I see it didn't bring the system to its knees as predicted. Was that with just the stock SoG tuning the project applies? Or have you given it a custom tuning tweak with -use_sleep or -period_iterations_num or -sbs adjustment to alleviate any lagginess?
I honestly don't know, though I suppose I could look it up if you twisted my arm. I really don't like the idea of deploying Raistmer's SoG app as a stock 'wallop it out to the set and forget punters that don't even understand english', when you need the brain the size of a planet just to understand the command line. It would be much, much better for general project use (not the 0.01% who read and post in this thread) if the app was made intelligently self-tuning (and if it wasn't written in OpenCL, thus taking away CPU resources from other worthwhile scientific research). But I know I'm preaching to the wrong audience here.

I probably picked one of Mike's pre-cooked suggested lines and dropped it somewhere, about two years ago. The Fermi is in host 2901600 - you can probably pick the details out of one of Raistmer's humunguous stderr_txt files.

Edit - downstairs again. Looks like I didn't bother. This is the machine I used to build Lunatics installers on, so I've got every conceivable app available, but I expect them to work as supplied (and that would be what I was looking at while testing), not taking them dirt-track racing.


. . Hi Richard,

. . I annoyed Raistmer with heaps of messages when SoG was first released and I was feeling my way with it, and I know he put a lot of work into making it as idiot friendly as possible. It works very well with the defaults set into it. No need to "tune" unless you are determined to squeeze the absolute MOST out of your cruncher, where it then allows you a large scope in doing just that. I think it has a little self tuning built in in terms of -sbs and -period_iterations_number to suit the more extreme end of hardware available but very few ppl seem to have any problems running the stock version, v8.20 and v8.22 seem pretty bulletproof. Running it straight of the box is pretty successful across the board.

. . Of course Keith and others do like to play with that very complicated command line, but it only needs to have 3 or four parameters tweaked to get near to the best from your GPU. But I think Keith was wondering if you had to play with those commands to get it to behave on the GTX470. It seems the answer is no, the defaults are doing OK.

Stephen

:)
ID: 1929824 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1929843 - Posted: 14 Apr 2018, 0:27:45 UTC - in response to Message 1929811.  

Thanks Tbar and Keith for taking the time to look and answer. I'll keep just using the CPUs to crunch and forget about using the GPU, as it doesn't look like it will add much.
ID: 1929843 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1929878 - Posted: 14 Apr 2018, 6:20:03 UTC - in response to Message 1929732.  

Volta does Arecibo tasks in 75 seconds and a 1080 does them in 150 seconds. Nice!

Half the time, similar clock speeds?


1080@1974MHz GPU / 10124MHz mem and
VOLTA@1575MHz GPU / 1944MHz mem
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1929878 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13797
Credit: 208,696,464
RAC: 304
Australia
Message 1929881 - Posted: 14 Apr 2018, 7:08:50 UTC - in response to Message 1929878.  

Volta does Arecibo tasks in 75 seconds and a 1080 does them in 150 seconds. Nice!

Half the time, similar clock speeds?


1080@1974MHz GPU / 10124MHz mem and
VOLTA@1575MHz GPU / 1944MHz mem

Interesting.
Lower clock for the GPU, higher clock for the memory.
So how much of the improvement would you put down to architecture, or is it the greater memory bandwidth that's most responsible?
Grant
Darwin NT
ID: 1929881 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930012 - Posted: 14 Apr 2018, 23:11:30 UTC - in response to Message 1929991.  

Well, this new VLAR to Nvidia GPU's have decreased the "Results received in last hour " from ≈ 130,000 to ≈ 100,000.
Not good at all for the total crunching of tasks.

But then, on the other hand, since there are millions and millions of results from years and years of crunching, not being analyzed yet, it really doesn't matter
when it comes to the question of if we already may have found ET.


. . But there are no more complaints of Nvidia Q's being starved of WUs. And there are many who consider the lower results returned per hour number being a good thing for the operation of the splitters.

. . And judging by the number of VLAR tasks I have in all Q's, there are a heck of a lot of them so it definitely needed to be done. I am also seeing a lot of Arecibo VLAR resends already. Not sure whether that indicates a lot of volunteers with slower machines aborting them from their caches or slower machines biting the big one and dropping out. Hopefully, if there are very many of the second case, some will be sorted and return to crunching .

Stephen

:)
ID: 1930012 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1930017 - Posted: 14 Apr 2018, 23:37:08 UTC

I am also seeing a lot of Arecibo VLAR resends already


I'm not. I have maybe 1 or 2 resends on each machine's 400 or 500 tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1930017 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930023 - Posted: 14 Apr 2018, 23:53:55 UTC - in response to Message 1930016.  
Last modified: 14 Apr 2018, 23:55:55 UTC

Well, Arecibo will not last for long, if I interpret things right. So, in the longer run VLAR's from Arecibo to Nvidia or not,
will make no big difference in the coming months/years.


. . I have the philosophy "it's not over until it's over". If and when Arecibo ceases to provide data for crunching that is what will be then, but until then Arecibo is still a great source of material. And, as someone else remarked, the VLAR scans will increase signal sensitivity and may be where we will find the really WOW signal.

Stephen

:)
ID: 1930023 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930025 - Posted: 14 Apr 2018, 23:55:07 UTC - in response to Message 1930017.  

I am also seeing a lot of Arecibo VLAR resends already


I'm not. I have maybe 1 or 2 resends on each machine's 400 or 500 tasks.


. . Maybe I just got a couple of batches in a random strike.

Stephen

??
ID: 1930025 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1930029 - Posted: 15 Apr 2018, 0:13:01 UTC - in response to Message 1930027.  


But no matter which telescope, we will never find the WOW signal until they analyze the results we sent back. That is not happening now.
Nebula is just spinning its wheels, just as the Near Time Persistency Checker did, before it died a painful death. We just have too much data
to analyze for anything the project has to do it with.
They're drowning in the data we send back, and they don't have a clue how to deal with it.

I wouldn't say Nebula is spinning its wheels. It is trying multiple configurations on how best to analyze the data. The first part of an experiment involves identifying the test parameters and then structuring the experiment so it produces the expected theoretical result. This is the state Nebula is in now. Once they nail down just how to analyze the data, then they can proceed onto actual analysis. That brings up the next hurdle I believe. Where to get the processing horsepower to actually run through the datasets. Quantum computing anyone?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1930029 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13797
Credit: 208,696,464
RAC: 304
Australia
Message 1930033 - Posted: 15 Apr 2018, 0:26:03 UTC - in response to Message 1930017.  

I am also seeing a lot of Arecibo VLAR resends already

I'm not. I have maybe 1 or 2 resends on each machine's 400 or 500 tasks.

More Scheduler weirdness.
One of my systems has got mostly Arecibo VLAR resends. The other picked up mostly GBT & AP resends. Overall, at this stage, Arecibo VLAR resends still aren't a significant proportion of the total number of resends.
Grant
Darwin NT
ID: 1930033 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1930037 - Posted: 15 Apr 2018, 0:32:49 UTC

I actually don't mind the Arecibo vlars. Keeps the GPUs busy.
ID: 1930037 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930039 - Posted: 15 Apr 2018, 0:44:35 UTC - in response to Message 1930027.  


But no matter which telescope, we will never find the WOW signal until they analyze the results we sent back. That is not happening now.
Nebula is just spinning its wheels, just as the Near Time Persistency Checker did, before it died a painful death. We just have too much data
to analyze for anything the project has to throw at it.
They're drowning in the data we send back, and they don't have a clue how to deal with it.


. . Well maybe they need to network that processing as well? Problems provoke solutions ... I guess Einstein cannot donate enough computer time for Nebula to make as much progress as we need.

Stephen

? ?
ID: 1930039 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930041 - Posted: 15 Apr 2018, 0:49:08 UTC - in response to Message 1930033.  

I am also seeing a lot of Arecibo VLAR resends already

I'm not. I have maybe 1 or 2 resends on each machine's 400 or 500 tasks.

More Scheduler weirdness.
One of my systems has got mostly Arecibo VLAR resends. The other picked up mostly GBT & AP resends. Overall, at this stage, Arecibo VLAR resends still aren't a significant proportion of the total number of resends.


. . Out of Bertie's 220 tasks 20 or 30 are Arecibo VLAR resends. Not a huge number but enough that I consider it significant.

Stephen

? > ?
ID: 1930041 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35567
Credit: 261,360,520
RAC: 489
Australia
Message 1930047 - Posted: 15 Apr 2018, 1:07:08 UTC

. . Out of Bertie's 220 tasks 20 or 30 are Arecibo VLAR resends. Not a huge number but enough that I consider it significant.

Stephen
Did you checkout what was the cause of those resends?

Cheers.
ID: 1930047 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1930049 - Posted: 15 Apr 2018, 1:09:29 UTC - in response to Message 1930037.  

I actually don't mind the Arecibo vlars. Keeps the GPUs busy.

Data is Data, I don't care which with my hardware.

I have mostly been getting nothing but Arecibo shorties on the fastest systems today.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1930049 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930066 - Posted: 15 Apr 2018, 6:10:17 UTC - in response to Message 1930047.  

. . Out of Bertie's 220 tasks 20 or 30 are Arecibo VLAR resends. Not a huge number but enough that I consider it significant.

Stephen
Did you checkout what was the cause of those resends?

Cheers.


. . The only way I know to do that is from stderr.txt when the task is complete. By then they are hard to track (unless I write down each task number :( )

Stephen

?
ID: 1930066 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930067 - Posted: 15 Apr 2018, 6:12:10 UTC - in response to Message 1930049.  
Last modified: 15 Apr 2018, 6:14:31 UTC

I actually don't mind the Arecibo vlars. Keeps the GPUs busy.

Data is Data, I don't care which with my hardware.

I have mostly been getting nothing but Arecibo shorties on the fastest systems today.

. . Bertie has nearly 50% Arecibo VLARs, the other machines are running between 30 and 50 %.

. . But no one is complaining that the scheduler's are not sending them work, so I am calling that successful :)

Stephen

. .
ID: 1930067 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1930070 - Posted: 15 Apr 2018, 7:20:35 UTC - in response to Message 1930047.  

Did you checkout what was the cause of those resends?
Cheers.


. . OK, I scanned through my valids and only found half a dozen that had been validated, 4 were inconclusives and I was the decider (they all validated for all users) one was a compute error on an iGPU on an i3 rig and the last was a delinquent host trashing every task, both CPU and GPU ...

http://setiathome.berkeley.edu/show_host_detail.php?hostid=8359266

. . So nothing so far indicating any problem with nvidia cards.

Stephen

:(
ID: 1930070 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1930081 - Posted: 15 Apr 2018, 9:41:36 UTC

Talking about weird hosts, how can a status get to look like this ?
State: All (9551) · In progress (9259) · Validation pending (0) · Validation inconclusive (0) · Valid (0) · Invalid (0) · Error (292)
Application: All (9578) · AstroPulse v7 (27) · SETI@home v8 (9551)
Humans may rule the world...but bacteria run it...
ID: 1930081 · Report as offensive
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 31 · Next

Message boards : Number crunching : Panic Mode On (111) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.