Posts by Raistmer

21) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1937318)
Posted 26 days ago by Profile Raistmer
Post:
@Shaggie76
Thanks for data

@ iwazaru
I never said that need additional motivation besides project's goal.
All I said in another thread is that CreditScrew discourages optimization. It does. But who cares? ;)
And, just to be precise, your proposal for additioanl motivation is wrong one.
If app performance will be increased GPU power consumption will increase too. We will do more work per hour... but will consume more energy also.
So, greenpeace will be disappointed :P
22) Message boards : Number crunching : Download sources (Message 1937238)
Posted 26 days ago by Profile Raistmer
Post:
https://boinc.berkeley.edu/trac/wiki/DownloadOther
Would be good to get both Mike's and arkayn's sites with SETI opt apps be listed here too.
23) Message boards : Number crunching : Intel GPU and CPU at once (Message 1937074)
Posted 27 days ago by Profile Raistmer
Post:
Answered my own question by running Einstein's LATeah on both. The GPU core is 4 times faster than 1 CPU core on an i5-3570K. Probably even more of a difference on the 8th generation chips, as their graphics is twice as fast, but a core is only 1.5 times faster.

It depends from app/data/model.

For netbook on new Atoms iGPU part much faster, for desktop processors - quite different results.
Benching of particular model required no single recipe here.
24) Message boards : Number crunching : Ryzen 16T / 8C vs. 8T / 8C. How much better? (Message 1937067)
Posted 27 days ago by Profile Raistmer
Post:
Thanks for study.


(2a) Establish 6 concurrent Seti CPU tasks + 1 Seti GPU task + 1 other BOINC (NFS@home) to fully load all 8 cores;

(3a) Establish 13 concurrent Seti CPU tasks + 1 Seti GPU task + 2 other BOINC (NFS@home and asteroids@home) to fully load all 16 threads;


If you will have time and inclination could you repeat similar experiment on same hardware with all cores busy by SETI CPU tasks please.

That is:
(2a_mod) Establish 8 concurrent Seti CPU tasks + 0 Seti GPU task + 0 other BOINC (NFS@home) to fully load all 8 cores;

(3a_mod) Establish 16 concurrent Seti CPU tasks + 0 Seti GPU task + 0 other BOINC (NFS@home and asteroids@home) to fully load all 16 threads;

Would be quite interesting if that 19% improvement still apply.
25) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937065)
Posted 27 days ago by Profile Raistmer
Post:

The biggest impediment to cross project comparisons would be projects providing realistic estimates for their tasks.

Exactly.
But we should not forget that "FLOPs counting" method isn't smth new. It was used before. And on background of its usage David decided to develop CreditScrew.
I think he also has some more interesting to spend time on... still he did it.
Apparently estimates were too lower quality degree.... AFAIK his aim was exactly inter-project comparison above all else.
And that aim ruined all our own "small" SETI-credits world (I would say remaining non-achieved).
26) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937061)
Posted 27 days ago by Profile Raistmer
Post:
We can't force other projects to adopt whatever solution we come up with, but you know what they said would happen to the person who built a better mousetrap...

Only if their aim is mouse catching ;)
No matter how scientifically-good your credit system is (in area how accurately it measures FLOPS from RAC or smth else) the one of goals of credit system is to attract resourses to project (social engeneering). And peoples tend to like just bigger numbers. And compare them :)
27) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937054)
Posted 27 days ago by Profile Raistmer
Post:

The amount of Credit awarded is based on the work (operations) estimated to process the WU.

That's issue - estimate can be done differently. It's ok for single project. It's not OK for inter-project comparison (just to be clear).
And I have no proposals how to make inter-project comparisons at all. Also, I don't think they really matter. What if someone very advanced in counting number of sand grains on the beach if I don't care what that number is at all? ;)
Some projects are worthless from my own personal point of view no matter how much credits they pay, some -reverse. Quantities of different dimensionality...
28) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937053)
Posted 27 days ago by Profile Raistmer
Post:
(1) Examples were provided earlier. Real computational device does not only arithmetic operations. Also it does branching, also it does memory accesses. All this non neglectible part of any real program.
That's why FLOPs count very approximate estimation always.

That is to actually process a WU, but for the actual calculations required to process the WU, there must be some estimate for what is required, the number of mathematical operations.
That is the FLOPs I am referring to. It's not about the work that is done, or how it is done, but the work that needs to be done. The number of operations that would be required to produce a result, without any shortcuts etc.

Yep, with single but importand addition : ESTIMATE of work. And estimates tend to be different from different peoples/projects. Will not work as inter-project base.

(2) And regarding SETI work per se and FLOPs - our work (definition what I mean when use word " work") is to determine number and properties of particular signals in given length of radio-signal in given frequence band.
So ideally one should get more credits from SETI if one completes more such work.

And FLOPS does that., particularly with the second Credit system. Those with lower angle ranges (VLAR) require more processing, they have greater estimated FLOPs, they get more Credit. Those that have higher angle ranges, (VHAR) require less processing, they have lesser estimated FLOPs, they get less Credit. Even now that occurs, just without any consistency in what is granted. And much less than what the Cobblestone definition says we should get.

Yep, in such implementation it's the same "fixed number of points (call it "FLOPs" or whatever) awarding for given block of work" we all say about. (*)



And this definition can't be translated in FLOPs. Partly because point (1) partly because same work can be done differently arithmetic-wise. So we can change order of elementary arithmetic operations and NUMBER of them (i.e. FLOPs!) to achieve same result with given precision. That's why I say simple FLOPs counting is inadequate too.

True, hence my suggestion for Credit to be based on the work that has to be done, the operations that would be required to process the WU without any optimisations. If optimisations result in a huge boost in performance by skipping 50% of the operations, but still give a valid result, then they still earn the value of Credit they would if all the operations were performed.

Yep, same (*).

So you also propose to establish some fixed payment for particular block of work.
Agree, I think it better than CreditScrew. As I recall Eric favored "FLOPs counting" in such implementation more too but it's out of his range of decisions.
29) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937050)
Posted 27 days ago by Profile Raistmer
Post:

What matters is that a particular WU would require a certain number of operations to be performed if there were no optimisations, short cuts or other operation minimisation used.
That maximum possible number of operations should be what is used in determining the work done. The reference machine in the Cobblestone definition would perform all of those operations in order to process that WU, and so that number of required FLOPs would give the amount of Credit due for that WU.


Hm.. and who will decide that same work can't be done in even MORE number of operations? ;)

Consider simpliest example I wrote before: function value computed each cycle (even it remains the same for whole inner loop) or it computed only once per whole inner loop and stored in variable.
For big function it's very obvious way to optimize. So obvious that hardly any one will consider it as optimization, just as good programming habit.
So, to compute maximum number one should unroll any computation... and this would be another hardly achievable task for project programmers....

I say all this just to stress simple thing: points awarding is ARBITRARY in reality. And it depends on skills of programmer that code initial algorithm.
30) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937047)
Posted 27 days ago by Profile Raistmer
Post:

I would suggest:
Baseline a set of APs and MBs on a defined processor (even a "theoretical" one), decide how much that is worth per %blanked or per degree AR and scale each task to get its "value". Ignore the differences in processor and application (apart from those needed in the validation process) as those are user choices not project decisions.
This way there is time-consistency in that a user will know that an x% blanked AR will have a value of y, and an M-degree MB will have a value of n. Independent of the processor or application.

Think about it - Today even running the same task, but with a different pair of validating crunchers will probably get you a different value, and that's what gets most peoples backs up.

As to the argument about cross-project comparability in value per task - that doesn't exist today, and may never exist, and that is something I think we have to live with. The truth is many project do not use the fully adaptive scoring system that SETI does, the majority appear to use either a simple fixed, or a more complex time based scoring system. I wouldn't be surprised if one or two actually use a variation on the one I've outlined above.


Smth similar was before CreditScrew applied.
But not only each task was awarded number of "FLOPs"/points (I listed reasons why real FLOPs have very little connection to work measurement so will call these fake "FLOPs" just "points" to reduce confusion) attempts to account for variations in tasks were done by awarding different points for each block of computations.
That was issue when we optimized AP to skip some blocks completely - I had to carefully restore points accounting separately from actual computations to not ruin credits claims.
Same with GPU apps where points arithmetic done on CPU while actual work on GPU.

The system has own degrees of freedom (in part of arbitrary decisions what block/task worth what points).
The issue - 2 different mixs of tasks will result in different pay-off by credits on different hardware. That is, "cherry-picking" for particular host.
But I think it's issue only when network bandwidth is limiting stage. Until bandwidth allows it's just another degree of optimization - to do such work that best fits to particular hardware.
31) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937043)
Posted 27 days ago by Profile Raistmer
Post:
(that is, work, not FLOPs, done per unit of time)

What would you use to determine work done, other than FLOPs? As FLOPS is the metric used to determine arithmetic capability/work done on a computer.


(1) Arithmetic capability - yes (with restrictions of (3)). Work done - no.

Examples were provided earlier. Real computational device does not only arithmetic operations. Also it does branching, also it does memory accesses. All this non neglectible part of any real program.
That's why FLOPs count very approximate estimation always.

(2) And regarding SETI work per se and FLOPs - our work (definition what I mean when use word " work") is to determine number and properties of particular signals in given length of radio-signal in given frequence band.
So ideally one should get more credits from SETI if one completes more such work.

And this definition can't be translated in FLOPs. Partly because point (1) partly because same work can be done differently arithmetic-wise. So we can change order of elementary arithmetic operations and NUMBER of them (i.e. FLOPs!) to achieve same result with given precision. That's why I say simple FLOPs counting is inadequate too.

And (3) FLOP represent any arithmetic operation. Real computation device spends different number of ticks per different instruction.
It's known that (for example) multiplication is slower than addition but division is slowest of all. At such degree as Windows compute reciprocal of processor speed on early boot phase and keeps that number to be able to do only multiplications on any timer-involved routines (SETI code does the same too of course). I never heard about MOP(Multiplication-OPeration) or DOP(Division-OPeration) FLOP binds them all....
32) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937035)
Posted 27 days ago by Profile Raistmer
Post:
we had to do proper bench marking of the processor/task combination.

And that's the single robust method of speed comparison I know.
One should always compare quantities of same dimension. It's the first thing one should remember in physics (and not only).

In SETI application that means if one wants to compare processing speed (that is, work, not FLOPs, done per unit of time) one should separate AP with particular blanking and MB with particular AR.
Then comparison has chances to be correct.

Of course this nullifies the idea on inter-project comparison, but one could not have all and free of charge ;)
33) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937033)
Posted 27 days ago by Profile Raistmer
Post:

Well, same work (to analyse N seconds of radio signal) done faster, time and efforts put in optimization... just to get same credits.

Actually, I think I was not precise here. Cause suspect that stock will have same RAC (!) as before. And RAC is credits per time (speed of credits earning) not amount of credits.
If CreditScrew does this indeed (and seems it does) with stock improvement we will observe REDUCE in credits paid for single same task (and cause stock now operates faster it does more tasks per day so RAC remains the same - that's renormalization I spoke about). Is it what we really wanted to get as pay-off for optimization? I would say no.

Actually this is fundamental difference between CreditScrew and FLOPs counting.
Both methods can't account for work done.
But in situation I describe they will act different. FLOPs counting will pay same credit per task, tasks per day increased, stock host will have bigger RAC (as one could infer from word "optimization"). Opt ones will get same as before (no affected), again as it should be in common sense.

CreditScrew instead will renormalize RAC completely hiding effects of stock optimization (optimizer pay-off ZERO) and reducing RAC for anonymous platform host (though they not changed at all - and that's negative pay-off for optimization).

So, from optimizer point of view FLOPs counting is much better (though inadequate too).
34) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937032)
Posted 27 days ago by Profile Raistmer
Post:

So, what's our definition of a FLOP? Is it the number of CPU instructions executed (fewer), or the number of pairs of numbers multiplied together (exactly the same)? I'll need to go and find some proper references, but my gut feeling is that it should be the second.


IMHO, second.
So, going SIMD will not change FLOPs number.

Also, it will not nessesarily improve speed(!) It depends on implementation of particular instruction on particular device.
And there were at least 2 examples already when going next level SIMD actually slowed things down (!).
SSE3 on Venice, AVX on first generation of AMD CPUs that supported it.
It can sound weird, why to implement SIMD then at all one could say, but it has own understandable reasons.
Competitor (Intel) extended instruction set (for computation speed improvement od course). AMD was not able to implement those instructions as effectively as did Intel. But if theywould just leave out implementations - all software that uses those instruction would just fail to run on AMD chips. So, they implemented corresponding SIMD levels as fast as they could at that moment for compatibility reasons. Their single SSE3 horisontal addition took let say (for example, not real number) 8 CPU ticks while Intel did same operation for let say 6 ticks.
Eventually microcode implementation improved and SSE3 HADD became faster then 4 scalar ones indeed.

What all this mean for FLOPs/credit issue? We can't just add fixed coefficient to account for SIMD usage versus scalar operations (if we want not just count FLOPs but estimate WORK done by device for given time)cause SIMD implementation wildly differs in speed.
35) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1937030)
Posted 27 days ago by Profile Raistmer
Post:

I'm not sure I understand exactly what you are saying. I think you are saying:
"Since the FLOPs are the same, tasks get the same amount of credit as they did when the stock app was slower".
Isn't that exactly what we want?


Well, same work (to analyse N seconds of radio signal) done faster, time and efforts put in optimization... just to get same credits.
More, if stock gets same credits, all opt hosts will get lower credits now and not because they became slower, just because stock became faster.
And then we attempt to compare SETI credits where optimization goes since project founding with credits of other projects where computations done in virtual machines (for example) and computation speed isn't priority at all.

So, if aim is to analyse radio-signal faster - yes, of course stock optimization is exactly what we want. But credits will not reflect advance in achieving that goal.
36) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1936859)
Posted 29 days ago by Profile Raistmer
Post:
I was going off memory but I'll try and dig up his old posts.

But for now my daughter is demanding Masha & the Bear :D

(Really)

I would say it's for quite small children.
Try to find "Смешарики" (not sure they was translated on other languages though) - much more informative and educating cartoons (actually with layer for adults also ).

Well, regarding FLOPs:
Consider such cycles:

for (int i=0;i<N;i++)
for(int j=0;j<N;j++) a[i][j]+=1;
and
for (int j=0;j<N;j++)
for(int i=0;i<N;i++) a[i][j]+=1;

They have exactly same FLOPs. But performance will be quite different on all modern devices (GPU including).

Next example:
for (int i=0;i<N;i++)
for(int j=0;j<N;j++){
....
a[i][j]+=f(i);
...
}
and
for (int i=0;i<N;i++){
float temp=f(i);
for(int j=0;j<N;j++){
....
a[i][j]+=temp;
...
}}
Obviously second has less FLOPs. But what about performance? It depends!

How big f() implementation, will temp be in register, will temp be in cache....

Sometimes more FLOPs will provide better performance.

All this illustrate that performance !=FLOPs at all. On modern devices with complex memory architecture it's very noticeable.

That's why even FLOPs counting will fail as long as different types of hardware used for computations and computations inhomogenious (as MultiBeam on different ARs is).

Regarding CreditScrew and optimization discourage:

Consider stock app and opt one. Opt provides better performance (mostly due to memory access patterns that can't be measured in FLOPs btw). So, opt hosts do same work faster and earn more credits.
But at one point stock implements same optimizations.... and recalibration occurs. FLOPs are the same. stock app process same work on same hardware faster... so stock recalibrated to get same credits as before.
What will be with opt hosts now?,...

As as said, directly inhibits optimization...

BTW, that issue (habit for not accounting for memory access costs) shows itself very vividly in latest Spectre exploits. They are very elegant in the manner of getting additional info just from timings. It's very resemble some quantum physics cases where one can get additional info about system just because some possibilities exists (non-zero) even if they are not realized in particular experiment at all.
37) Message boards : Number crunching : ROCm 1.8 (Message 1936801)
Posted 29 days ago by Profile Raistmer
Post:
I'm not sure Linux build implements -tt option.
Time targeting use profiling abilities of OpenCL runtime - worth to check if Urs ported that block of code to Linux or not.
Look into stderr what it reports.


I was using -tt on this system when I had the ProDuo cards with AMD standard drivers and it didn't cause a problem. Now with ROCm, I kept the args the same as before and then simplified to what I use in Windows. I could try to remove it to see if it makes a difference, but I want it to run for a while to see if switching to non-SoG makes a difference.


Non-implemented option will just be ignored. So it will not cause trouble per se. But if it unsupported then app will use older way to select size of kernel. And this could lead to errors vs Windows version.
If errors continue try to increase -period_iterations_num value.
38) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1936799)
Posted 29 days ago by Profile Raistmer
Post:

I just hope Raistmer is in the mood to explain why he's saying his optimizations are getting punished.

Not mood but time to write example, will do later.

And why Eric is saying the opposite (pretty much).

Where? Could you provide link?
39) Message boards : Number crunching : ROCm 1.8 (Message 1936798)
Posted 29 days ago by Profile Raistmer
Post:
I'm not sure Linux build implements -tt option.
Time targeting use profiling abilities of OpenCL runtime - worth to check if Urs ported that block of code to Linux or not.
Look into stderr what it reports.
40) Message boards : Number crunching : Let's Play CreditNew (Credit & RAC support thread) (Message 1936634)
Posted 22 May 2018 by Profile Raistmer
Post:

The 'stock' (server application) case is easier to grasp. Each WU has a fixed size, calculated by the splitter from the AR. That's expressed in fpops, and will be the same every time for WUs of the same AR, such as the blc vlars. If you study <rsc_fpops_est>, you'll know exactly how big tasks of each AR have been assessed to be - that's a real figure. In the meantime, the server is also keeping track of how fast your machine has been processing tasks recently - the APR - and passes back that number to your client every time new work is allocated. It's the fixed task size, and the varying speed, which your client uses to estimate how long each task is going to take.

First big point of failure, especially with AstroPulse where initial CPU code was unoptimised at all and optimization was done on algorithm level. That just ruin the idea of pre-calculated flops.

Any errors in estimation flops vs AR (for MB), flops vs blanking % (for AP, but I don't remember if someone bothered to account changes in FLOPS for blanking though in reality blanking can change task computational weigh at least two fold easely) will add to declination from reality.

CreditScrew just what it is, based on too many non-valid in real life assumptions.
The most hated (for me) part in it: it directly discourages any app optimization. I think it's counterproductive. Think Petry would agree...


Previous 20 · Next 20


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.