Mac Client Bug

Message boards : Number crunching : Mac Client Bug
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Havoc
Volunteer tester

Send message
Joined: 18 May 99
Posts: 38
Credit: 1,454,156
RAC: 0
United Kingdom
Message 1898638 - Posted: 2 Nov 2017, 7:15:51 UTC

Does this WU suggest a problem/bug in the Mac client?

http://setiathome.berkeley.edu/workunit.php?wuid=2723540432
ID: 1898638 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898644 - Posted: 2 Nov 2017, 8:38:14 UTC - in response to Message 1898638.  
Last modified: 2 Nov 2017, 8:39:34 UTC

Does this WU suggest a problem/bug in the Mac client?

http://setiathome.berkeley.edu/workunit.php?wuid=2723540432
Yes - or possibly a recent update to the Mac operating system that the old client can't cope with.
ID: 1898644 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898660 - Posted: 2 Nov 2017, 12:04:31 UTC - in response to Message 1898644.  

Quite unlikely....was the last I heard.
ID: 1898660 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898661 - Posted: 2 Nov 2017, 12:10:22 UTC - in response to Message 1898660.  
Last modified: 2 Nov 2017, 12:11:15 UTC

The applications page says

Mac OS X/64-bit Intel	
8.20 (opencl_ati5_mac)	
17 Oct 2017, 23:49:50 UTC	
17,835 GigaFLOPS
so quite a lot of people have been completing quite a lot of valid work over the last two weeks. Doesn't sound like a deployment error to me.
ID: 1898661 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898662 - Posted: 2 Nov 2017, 12:29:00 UTC - in response to Message 1898661.  

Not a single Mac on Beta got that Error in almost a Year, then when it is moved to Main EVERY Mac gets that error.
What does it sound like then?
ID: 1898662 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898663 - Posted: 2 Nov 2017, 12:45:17 UTC - in response to Message 1898662.  

If every mac gets the error (on every task?), how does the apps page show a FLOPs count higher than any Mac app apart from the original v8.03 from January 2016?

What's your evidence for that "every"?
ID: 1898663 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898665 - Posted: 2 Nov 2017, 12:56:32 UTC - in response to Message 1898663.  
Last modified: 2 Nov 2017, 13:02:37 UTC

ID: 1898665 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898666 - Posted: 2 Nov 2017, 13:05:14 UTC

OK, let's talk about evidence. Havoc's post opening this thread linked to a workunit with two failed Mac OS X tasks, but successful completions for two windows computers. From which we can see that it was an Arecibo task with "WU true angle range is : 0.561684".

The two failures were

All tasks for computer 8312767
All tasks for computer 8022187

Both machines have vastly more successful, valid, results with application 'SETI@home v8 v8.20 (opencl_ati5_mac)' than they have errors. And the oldest valid tasks were issued before the newest errors - so it isn't simply Eric fixing the deployment.

No, there's something else at play to cause those (rare) 'ERR_TOO_MANY_EXITS' outcomes. I don't own a Mac, so I'll have to leave you to track it down. For the record, it won't be the first time that an application has tested out fine at Beta, but has failed when exposed to the far wider range of task types distributed through Main - we had one a few years ago which suddenly used a huge amount of memory when, IIRC, it found a large number of pulses during the first main loop of the search.
ID: 1898666 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898668 - Posted: 2 Nov 2017, 13:08:39 UTC - in response to Message 1898665.  

Simply look at the list of machines, you can start with this one and work down, https://setiathome.berkeley.edu/results.php?hostid=8144040&state=6
Sure, but look at https://setiathome.berkeley.edu/results.php?hostid=8144040&state=4 - same url, but just flipped the state number to look at the valid results. Lots of them with the same app.
ID: 1898668 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898670 - Posted: 2 Nov 2017, 13:21:09 UTC - in response to Message 1898666.  

I've already tracked it down. The evidence is indisputable. The App was run on Beta for almost a Year without a single 'ERR_TOO_MANY_EXITS'. Once moved to Main Every Mac gets 'ERR_TOO_MANY_EXITS'.
Look at the Slightly newer App on Main running Anonymous platform, not a single 'ERR_TOO_MANY_EXITS'.
https://setiathome.berkeley.edu/results.php?hostid=6105482
https://setiathome.berkeley.edu/results.php?hostid=8243589
https://setiathome.berkeley.edu/results.php?hostid=8248108&state=2
Anonymous platform = Good, SETI Server = Bad.
ID: 1898670 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898671 - Posted: 2 Nov 2017, 13:23:50 UTC - in response to Message 1898670.  
Last modified: 2 Nov 2017, 13:29:46 UTC

Or perhaps 'old app bad, slightly newer app better'?

(r3552 vs r3610)
ID: 1898671 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898674 - Posted: 2 Nov 2017, 13:33:59 UTC - in response to Message 1898671.  

The first thing I would check is the API. It should be version 7.5.
But, I said that days ago...
ID: 1898674 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1898678 - Posted: 2 Nov 2017, 13:41:20 UTC

Passing thought....
Did someone re-build the application between the "successful" Beta operations and "problematic" main operations?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1898678 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898681 - Posted: 2 Nov 2017, 13:51:57 UTC - in response to Message 1898678.  

Passing thought....
Did someone re-build the application between the "successful" Beta operations and "problematic" main operations?
It would be unusual. Eric is usually extremely careful not to introduce any variation at that stage - I believe he only keeps one copy of the actual binary executable online, and soft-links to it from the different mount points used for Main and Beta deployments.

The acid test would be to attach the same computer to both Main and Beta, and download both instances. Then, perform an exhaustive comparison of both the downloaded files, and the deployment metadata contained in client_state.xml
ID: 1898681 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898683 - Posted: 2 Nov 2017, 14:05:00 UTC - in response to Message 1898674.  

The first thing I would check is the API. It should be version 7.5.
But, I said that days ago...
The key requirement is that the API version string - embedded in the application binary at the linker stage, from the compiled API library - matches the actual behaviour of the API codebase used. The embedded string is picked up by the deployment script and transferred to the appropriate place in the <app_version> declaration, so that the BOINC client uses the correct protocols when controlling the app behaviour.

Actually, I would expect a version number of at least 7.7.0, or perhaps even 7.9.0, to pick up the changes made in the Mac OS X API by Charlie Fenton over the last month, to be compatible with both old and new versions of the Mac screensaver code (you are aware that Apple released a new version of OS X last month, I'm sure?)

If this app has been 'running at Beta for a year', then it can't have that update yet.
ID: 1898683 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898692 - Posted: 2 Nov 2017, 14:41:20 UTC - in response to Message 1898683.  

Simply have someone check the client_state and see if it shows api version 7.5 in the apps section.
This user has a Mac in both Main & Beta but hasn't run many tasks. It appears you need to run many tasks to see the error, http://setiathome.berkeley.edu/show_user.php?userid=7781668
ID: 1898692 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898708 - Posted: 2 Nov 2017, 16:56:59 UTC - in response to Message 1898692.  

It appears you need to run many tasks to see the error.
Then it CANNOT be a 'missing CL file' deployment error, as you tried to imply by referring me to the Beta thread this morning. I've posted in that thread myself now.
ID: 1898708 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898715 - Posted: 2 Nov 2017, 17:33:40 UTC

When a task fails with ERR_TOO_MANY_EXITS, then it will have attempted to run 100 times before being killed. There will be 100 'boinc_temporary_exit()' reports in the Event Log (with reasons), and there will be 100 copies of stderr_txt in the task report.

Unfortunately, Raistmer's stderr_txt files are too big for 100 copies to fit into the 64 KB of data reported by the failed task. But if you start reading at the bottom, the failure point seems to occur after

Work Unit Info:
...............
Credit multiplier is :  2.85
WU true angle range is :  0.447592
Used GPU device parameters are:
	Number of compute units: 32
	Single buffer allocation size: 128MB
	Total device global memory: 6144MB
	max WG size: 256
	local mem type: Real
	LotOfMem path: no
	LowPerformanceGPU path: no
	HighPerformanceGPU path: no
period_iterations_num=50
and, by implication, the app started by writing

OpenCL platform detected: Apple
Number of OpenCL devices found : 2 
BOINC assigns slot on device #1 of 2 devices.
Info: BOINC provided OpenCL device ID used
That's from one of the examples we looked at this morning, and it appears that the 100 attempts alternated between device #1 and device #2. That particular machine appeared to have two identical "AMD Radeon HD - FirePro D700 Compute Engine" - it's host 8144040 - but I could have picked any of them.

TBar asserts that *every* Mac user running this app will encounter errors. By the law of averages, at least one of you must read this thread, and you will be able to continue the hunt for evidence from here. As I said this morning, I don't possess a (current) Mac (I do have an LC475), and I don't propose to go out and buy one just for this. I'll have to leave this one to the Mac community, but I hope I've left you enough clues about how to proceed.
ID: 1898715 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898716 - Posted: 2 Nov 2017, 17:37:53 UTC - in response to Message 1898708.  

Actually, I NEVER implied it was a missing CL file. What I DID imply was that Raistmer also thinks it's not likely an App can run on Beta, and Main, for almost a year without displaying a single error of the type that suddenly appeared on Main. People ran r3552 on Main last year under Anonymous platform and never saw that error. Seems it only appears when distributed by the Server on Main. What I did suggest was to check the API version, at this point I couldn't care less. Since it's all trash talk you can fix it yourself.
ID: 1898716 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1898725 - Posted: 2 Nov 2017, 17:50:53 UTC

No, it should be evidence-based analysis and diagnosis. But as I said, I don't possess the necessary equipment.
ID: 1898725 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Mac Client Bug


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.