Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 144 · 145 · 146 · 147 · 148 · 149 · 150 . . . 162 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2031589 - Posted: 9 Feb 2020, 14:41:14 UTC
Last modified: 9 Feb 2020, 14:43:06 UTC

FYI We are working on it. Just the GPU part (on Linux special sauce only). The CPU will remain untouched.

Hope we have news in the following weeks. Keep the fingers crossed.
ID: 2031589 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031607 - Posted: 9 Feb 2020, 18:04:57 UTC - in response to Message 2031589.  

The Check-pointing, and other items have been on the list for Years. Good Luck with that.
Freaking Hilarious some think this hasn't been thought of before.
ID: 2031607 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031618 - Posted: 9 Feb 2020, 19:03:47 UTC - in response to Message 2031607.  

I gather you have attempted to find the checkpointing code in the science app before and was unsuccessful.

Maybe Juan will be luckier.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031618 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031625 - Posted: 9 Feb 2020, 19:29:47 UTC - in response to Message 2031618.  

Yep, Good Luck with that. Go back a few years on this board and you will find a post by Petri about removing the Checkpoint. A few people here were strongly against it, seems they should have taken him up on it back then. The Main problem with Bad Best Pulses on Shorties was around back when Jason was here, how long ago was that? Both those items were on the Top of last years to do list, instead we got 0.99, which fixed nothing, and made the code Linux only. The problem with missed pulses after boot has been narrowed down to the type of memory on the card, seems it disappears when you go from a 1070 to a 1080Ti. Unfortunately, most here don't have 1080s, and everything lower has the problem. At least time should fix that one, when people stop using GDDR5 GPUs.
ID: 2031625 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031632 - Posted: 9 Feb 2020, 19:39:39 UTC - in response to Message 2031625.  
Last modified: 9 Feb 2020, 19:41:45 UTC

How exactly did V0.99 make the app “linux only”? The only thing that was added was the mutex. And you don’t even have to use it if you don’t want to. When not running 2 WUs at a time, the code is functionally identical to V0.98.

Not that anyone was working on porting V0.98 to Windows anyway. So even without adding mutex, V0.98 would have still been linux only. The mutex bit isn’t the limiting factor for porting to Windows, and I would imagine the same thing can be implemented there. The limiting factor is lack of developer talent for CUDA on Windows to do what petri has done in Linux.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031632 · Report as offensive     Reply Quote
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2031639 - Posted: 9 Feb 2020, 20:01:25 UTC - in response to Message 2031632.  

How exactly did V0.99 make the app “linux only”?
I think TBar is referring to the Apple family of operating systems, which I gather is a legitimate complaint.
ID: 2031639 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031640 - Posted: 9 Feb 2020, 20:07:25 UTC - in response to Message 2031632.  
Last modified: 9 Feb 2020, 20:08:00 UTC

The Mutex makes System calls which are not on the UNIX based Mac. If it doesn't work on another UNIX machine, the chances of it working on Windows are extremely Slim.
If you have modern Storage, the Mutex is useless. Definitely not worth making the code Linux only.
ID: 2031640 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031641 - Posted: 9 Feb 2020, 20:17:38 UTC - in response to Message 2031625.  

The problem with missed pulses after boot has been narrowed down to the type of memory on the card, seems it disappears when you go from a 1070 to a 1080Ti.
That sounds like a highly dubious claim. I can only assume that it implies a race condition: different kernels finishing at different times, data not being available at times when the code assumes that it is ready. That's a synchronisation problem which it is the duty of the programmer to solve.
ID: 2031641 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031642 - Posted: 9 Feb 2020, 20:19:34 UTC

So far my search of the message base for "checkpoint" didn't turn up anything useful. Most posts found for "checkpoint" were writing about checkpointing in context for SSD/hard drive longevity.

The only useful post I found from Petri in discussion with Raistmer was about the resorting of found pulses after stopping for a checkpoint. That devolved into the discussion of noisy work units and whether they were useful in the first place and the back and forth of whether to count 30 noisy triplets first or 30 noisy pulses first.

I'll keep looking.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031642 · Report as offensive     Reply Quote
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2031645 - Posted: 9 Feb 2020, 20:29:13 UTC - in response to Message 2031642.  

I'll keep looking.
From a quick glance at the code and one of my state files, I suspect that you'll have to write to the state file as you go along, because it is the data in the state file that will eventually be returned as the result.
So what you should probably do instead, is to ignore the state file and always start from zero when the program starts (or, case in point, restarts).
ID: 2031645 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031649 - Posted: 9 Feb 2020, 20:33:45 UTC

Going back to my original suggestion which started this conversation, I said

just don't write a 'state.sah' file. That's all.
I'll run some tests on the side effects of that tomorrow - I have some suspicions that when the application 'restarts from scratch' after a recorded, but crippled, checkpoint that the timing record may be corrupted. But that's a relatively minor issue. Otherwise, the suggested change is trivial.
ID: 2031649 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031651 - Posted: 9 Feb 2020, 20:38:18 UTC - in response to Message 2031640.  
Last modified: 9 Feb 2020, 20:58:54 UTC

The Mutex makes System calls which are not on the UNIX based Mac. If it doesn't work on another UNIX machine, the chances of it working on Windows are extremely Slim.
If you have modern Storage, the Mutex is useless. Definitely not worth making the code Linux only.


which system call? can you point out which line in the namedMutex.cpp file?

is there a functional equivalent system call for MacOS? have you checked? a slight edit is probably possible to make it work on MacOS for the handful of people that are running the special app and MacOS (only you and TL come to mind). Looking through the top 500 systems, there are only 6 Darwin systems, only 2 of which (both belonging to Tbar) are running Nvidia cards. Expand that to top 1000 and you have about 20 Darwin system, and only 3 total that use nvidia cards/special app.

Apple effectively killed nvidia support on MacOS anyway...
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031651 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031652 - Posted: 9 Feb 2020, 20:39:33 UTC - in response to Message 2031641.  

The problem with missed pulses after boot has been narrowed down to the type of memory on the card, seems it disappears when you go from a 1070 to a 1080Ti.
That sounds like a highly dubious claim. I can only assume that it implies a race condition: different kernels finishing at different times, data not being available at times when the code assumes that it is ready. That's a synchronisation problem which it is the duty of the programmer to solve.
The 'Fix' for the lower end cards is to plug the monitor into the misbehaving card until you get a desktop, then switch the monitor back to the GPU that doesn't have the problem. How dubious is that? Since adding the two 1080Ti to the Mac neither one has had the problem, the 1070s still have the problem. So, how do you explain that? Pretty definitive proof in my book, 1080Ti no problem....1070s = problem. This machine, https://setiathome.berkeley.edu/show_host_detail.php?hostid=6796479
ID: 2031652 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031653 - Posted: 9 Feb 2020, 20:39:55 UTC - in response to Message 2031645.  
Last modified: 9 Feb 2020, 20:50:43 UTC

From a quick glance at the code and one of my state files, I suspect that you'll have to write to the state file as you go along, because it is the data in the state file that will eventually be returned as the result.
No, the final (returned) result file is written to incrementally as 'result.sah' in the slot directory. That's a soft link to the named upload file in the project directory. Obviously, it needs to be re-created from scratch if the application starts 'cold', without a state file to indicate the (partial) status of the internal counters on restart.

Edit - the 'state.sah' file starts with data like

<ncfft>94713</ncfft>
<cr>6.857660e+001</cr>
<fl>512</fl>
<prog>0.85297058</prog>
<potfreq>-1</potfreq>
<potactivity>0</potactivity>
<signal_count>15</signal_count>
<flops>0.000000</flops>
<spike_count>9</spike_count>
<autocorr_count>0</autocorr_count>
<pulse_count>4</pulse_count>
<gaussian_count>0</gaussian_count>
<triplet_count>2</triplet_count>
which does not form part of the final result. It does also contain details of the 'best so far' of the five signal types, in case there isn't anything better in the remaining portion of the run.
ID: 2031653 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031654 - Posted: 9 Feb 2020, 20:43:37 UTC - in response to Message 2031652.  

The 'Fix' for the lower end cards is ...
That's not a 'fix', it's a workround. It'll only become a fix when somebody demonstrates knowledge of how the observed behaviour is caused by the hardware in question, and devises some code to mitigate that observed behaviour.
ID: 2031654 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031656 - Posted: 9 Feb 2020, 20:53:50 UTC - in response to Message 2031654.  

Only TBar's systems seem to have made the connection between a missed pulse and a monitor connection. IIRC, TL was trying to work around a similar issue on his Mac, but couldn't fix it with the same monitor work around. Sounds more like a problem with the Mac platform honestly. The hundreds of Linux systems don't seem to be affected by this issue nearly as much, even with the majority of them running GDDR5 memory GPUs
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031656 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031657 - Posted: 9 Feb 2020, 20:55:44 UTC - in response to Message 2031656.  

I still think it's a timing problem, with the code not waiting for a result to be returned before using it.
ID: 2031657 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2031658 - Posted: 9 Feb 2020, 20:55:52 UTC - in response to Message 2031654.  

When you plug the monitor into the card and display a desktop it changes the way the vram is being used, that's All it takes to get the card to start reporting Pulses instead of missing Pulses...a change to the vram usage.
ID: 2031658 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14680
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031661 - Posted: 9 Feb 2020, 21:10:22 UTC - in response to Message 2031658.  

When you plug the monitor into the card and display a desktop it changes the way the vram is being used, that's All it takes to get the card to start reporting Pulses instead of missing Pulses...a change to the vram usage.
Bo**ox.

If switching to 'display mode' changed the contents of a defined memory location, or changed the value returned when that memory location is interrogated, it would be such a fundamental design error that the whole production run would have to be recalled and scrapped.
ID: 2031661 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2031662 - Posted: 9 Feb 2020, 21:12:27 UTC - in response to Message 2031657.  

I still think it's a timing problem, with the code not waiting for a result to be returned before using it.


very possible. some kind of idiosyncrasy with timing on the MacOS version of the app.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2031662 · Report as offensive     Reply Quote
Previous · 1 . . . 144 · 145 · 146 · 147 · 148 · 149 · 150 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.