nbody bench u asm

  • Začetnik teme Začetnik teme bmaxa
  • Datum pokretanja Datum pokretanja

bmaxa

Legenda
Poruka
70.808
Najbrzi entry je ovaj: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html
Moj je u prilogu.
Samo da iskomentarisem jedan deo:
ovo:
Kod:
        sqrtpd xmm7,xmm3
        mulpd xmm3,xmm7
        divpd xmm6,xmm3
je dovoljno na Zen procesorima umesto optimizacije za Intel:
Kod:
;       cvtpd2ps xmm4,xmm3
;       rsqrtps xmm4,xmm4
;       mulpd xmm3,dqword[L2]                                                                                                                                                                                               [364/1951]
;       cvtps2pd xmm4,xmm4
        ;--------------------

;       movapd xmm7, xmm4

;       movapd xmm8,xmm3
;       mulpd xmm8, xmm7
;       mulpd xmm8, xmm7
;       mulpd xmm8, xmm7

;       mulpd xmm7,dqword[L1]

;       subpd xmm7,xmm8

        ;------------------------

;       movapd xmm8,xmm3
;       mulpd xmm8, xmm7
;       mulpd xmm8, xmm7
;       mulpd xmm8, xmm7

;       mulpd xmm7,dqword[L1]

;       subpd xmm7,xmm8 ; distance -> xmm7
;       mulpd xmm6,xmm7 ; mag -> xmm6
Rezultati moga su:
Kod:
-0.169075164                                                                                                                                                                                                                  [2/1821]
-0.169059907

Performance counter stats for './nbody2 50000000':

          3,045.51 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                52      page-faults:u             #    0.017 K/sec
    10,284,747,996      cycles:u                  #    3.377 GHz                      (62.50%)
         1,663,774      stalled-cycles-frontend:u #    0.02% frontend cycles idle     (62.50%)
     6,560,200,491      stalled-cycles-backend:u  #   63.79% backend cycles idle      (62.50%)
    32,440,786,887      instructions:u            #    3.15  insn per cycle
                                                  #    0.20  stalled cycles per insn  (62.50%)
     1,948,850,295      branches:u                #  639.909 M/sec                    (62.51%)
            16,650      branch-misses:u           #    0.00% of all branches          (62.50%)
    13,253,138,548      L1-dcache-loads:u         # 4351.697 M/sec                    (62.50%)
            20,261      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.49%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       3.048611231 seconds time elapsed

       3.033987000 seconds user
       0.000000000 seconds sys
Rezultati za C:
Kod:
-0.169075164                                                                                                                                                                                                                  [2/1849]
-0.169059907

Performance counter stats for './fastc 50000000':

          2,638.85 msec task-clock:u              #    0.995 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                66      page-faults:u             #    0.025 K/sec
     8,901,583,969      cycles:u                  #    3.373 GHz                      (62.36%)
           513,262      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.40%)
     8,131,537,782      stalled-cycles-backend:u  #   91.35% backend cycles idle      (62.47%)
     7,848,064,835      instructions:u            #    0.88  insn per cycle
                                                  #    1.04  stalled cycles per insn  (62.58%)
        50,051,004      branches:u                #   18.967 M/sec                    (62.65%)
             3,021      branch-misses:u           #    0.01% of all branches          (62.61%)
     5,099,695,311      L1-dcache-loads:u         # 1932.543 M/sec                    (62.49%)
            10,158      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.43%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.652433932 seconds time elapsed

       2.631922000 seconds user
       0.000000000 seconds sys
C optimizuje i izracunavanje energije pa je brzi, ja samo advance ;)
Inace C ima samo 1 instrukciju po taktu moj 3 ;)
Ovo je novi entry od skora, i dosta je brzi od drugoplasiranog ;)
 

Prilozi

Inace, drugoplasirani Rust https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-rust-7.html
kod mene ima perf:
Kod:
 Performance counter stats for './fastrs 50000000':

          3,828.79 msec task-clock:u              #    0.997 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               113      page-faults:u             #    0.030 K/sec
    12,928,803,695      cycles:u                  #    3.377 GHz                      (62.41%)
         3,135,882      stalled-cycles-frontend:u #    0.02% frontend cycles idle     (62.41%)
     8,981,265,210      stalled-cycles-backend:u  #   69.47% backend cycles idle      (62.49%)
    42,382,263,116      instructions:u            #    3.28  insn per cycle
                                                  #    0.21  stalled cycles per insn  (62.50%)
     1,999,086,419      branches:u                #  522.120 M/sec                    (62.58%)
            12,830      branch-misses:u           #    0.00% of all branches          (62.59%)
    17,049,593,439      L1-dcache-loads:u         # 4453.003 M/sec                    (62.59%)
            19,770      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.43%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       3.840771217 seconds time elapsed

       3.813848000 seconds user
       0.003320000 seconds sys
Ovi su optimizovani za Intel pa ima nesto gore vreme nego na i5(3.4sec) koji koristi sajt, isto vazi i za prvoplasirani c program (2.2sec).
Inace sajt koristi i5 3330 koji ima maks klok 3.2
Ja sam spustio takt na 3.4 i undervoltovao da bih smanjio potrosnju posto vozim WCG 24/7 ;)
Inace 2700X.
 
A evo ga optimizovani C za AMD, pogle sad:
Kod:
Performance counter stats for './fastc 50000000':

          1,818.72 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                65      page-faults:u             #    0.036 K/sec
     6,143,099,913      cycles:u                  #    3.378 GHz                      (62.39%)
           711,089      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.39%)
     5,405,708,505      stalled-cycles-backend:u  #   88.00% backend cycles idle      (62.39%)
     6,399,232,187      instructions:u            #    1.04  insn per cycle
                                                  #    0.84  stalled cycles per insn  (62.48%)
        50,081,345      branches:u                #   27.537 M/sec                    (62.65%)
             1,451      branch-misses:u           #    0.00% of all branches          (62.68%)
     4,002,502,915      L1-dcache-loads:u         # 2200.726 M/sec                    (62.59%)
             6,239      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.43%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       1.819344629 seconds time elapsed

       1.812157000 seconds user
       0.000000000 seconds sys
Znaci smanjio sam vreme sa 2.60 na 1.80, samo sto sam presaltovao kod na AMD;)
Inace na sajtu je vreme 2.2 to je i5 na 3.2, moje je 2700X na 3.4.

edit: Ne, napravio sam gresku tip racuna 1/sqrt(distamce) to sam propustio.
Novi rezultat je:
Kod:
-0.169075164                                                                                                                                                                                                                  [7/1999]
-0.169059907

Performance counter stats for './fastc 50000000':

          2,381.75 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                66      page-faults:u             #    0.028 K/sec
     8,040,337,309      cycles:u                  #    3.376 GHz                      (62.41%)
         1,002,183      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.54%)
     7,321,731,860      stalled-cycles-backend:u  #   91.06% backend cycles idle      (62.59%)
     6,600,765,648      instructions:u            #    0.82  insn per cycle
                                                  #    1.11  stalled cycles per insn  (62.59%)
        50,065,728      branches:u                #   21.021 M/sec                    (62.59%)
             2,015      branch-misses:u           #    0.00% of all branches          (62.52%)
     4,500,540,581      L1-dcache-loads:u         # 1889.592 M/sec                    (62.40%)
             6,630      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.35%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.382371906 seconds time elapsed

       2.373932000 seconds user
       0.000000000 seconds sys
Znaci na i5 sa 3.2 je 2.2 a amd 2700x na 3.4 je 2.4, dakle Intel je bolji za AVX/SSE od Zen 1;)
Nemam zen2/3 da probam ;)
 
Poslednja izmena:
Optimizovacu jos da prestignem lika ;)
Nego probao na brzinu sa "gather" x86 instrukcijama i ispostavilo se da su katastrofalno
spore ;(
Mislim ono kao jos loop i neke koje su ono instrukcije koje su tu eto samo radi kompatibilnosti ;)
 
Inace evo kad sam pustio CPU da radi na normalnoj brzini:
fastc, onaj lik koji me iznervirao sto je maksimalno optimizovao (AVX):
Kod:
-0.169075164
-0.169059907

 Performance counter stats for './fastc 50000000':

          1,988.28 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                67      page-faults:u             #    0.034 K/sec
     8,059,303,822      cycles:u                  #    4.053 GHz                      (62.46%)
           200,129      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.46%)
     7,382,686,755      stalled-cycles-backend:u  #   91.60% backend cycles idle      (62.46%)
     6,589,592,669      instructions:u            #    0.82  insn per cycle
                                                  #    1.12  stalled cycles per insn  (62.46%)
        49,984,983      branches:u                #   25.140 M/sec                    (62.47%)
             1,753      branch-misses:u           #    0.00% of all branches          (62.57%)
     4,498,871,311      L1-dcache-loads:u         # 2262.693 M/sec                    (62.57%)
             8,279      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.55%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       1.990374103 seconds time elapsed

       1.982009000 seconds user
       0.000000000 seconds sys

Moja verzija sa SSE2 osnovna:
Kod:
-0.169075164
-0.169059907

 Performance counter stats for './nbodysse2 50000000':

          2,603.44 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                53      page-faults:u             #    0.020 K/sec
    10,290,993,689      cycles:u                  #    3.953 GHz                      (62.35%)
           545,799      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.37%)
     6,563,060,268      stalled-cycles-backend:u  #   63.77% backend cycles idle      (62.49%)
    32,393,147,312      instructions:u            #    3.15  insn per cycle
                                                  #    0.20  stalled cycles per insn  (62.61%)
     1,943,915,974      branches:u                #  746.673 M/sec                    (62.70%)
            12,391      branch-misses:u           #    0.00% of all branches          (62.61%)
    13,260,365,081      L1-dcache-loads:u         # 5093.407 M/sec                    (62.50%)
            11,338      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.38%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.606247643 seconds time elapsed

       2.585024000 seconds user
       0.003319000 seconds sys
moja optimizovana verzija za simd sa nizovima umesto strukturama:
Kod:
-0.169075164
-0.169059907

 Performance counter stats for './nbody2 50000000':

          3,188.49 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                53      page-faults:u             #    0.017 K/sec
    12,961,532,262      cycles:u                  #    4.065 GHz                      (62.49%)
           267,650      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.49%)
    10,087,414,286      stalled-cycles-backend:u  #   77.83% backend cycles idle      (62.49%)
    22,690,576,224      instructions:u            #    1.75  insn per cycle
                                                  #    0.44  stalled cycles per insn  (62.49%)
     1,498,517,339      branches:u                #  469.976 M/sec                    (62.49%)
            10,791      branch-misses:u           #    0.00% of all branches          (62.52%)
    11,845,468,225      L1-dcache-loads:u         # 3715.066 M/sec                    (62.52%)
            10,205      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.52%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       3.191446269 seconds time elapsed

       3.178639000 seconds user
       0.000000000 seconds sys
Dakle manje instrukcija manje grana a sporije od osnovne sse2 verzije?

Ali ovaj lik je maksimalno optimizovao algoritam, sto se vidi, tako da skoro da nema grana a broj instrukcija
je takodje minimalan.
 
Inace kao referenca evo drugo mesto rust.
Kod:
-0.169075164
-0.169059907

 Performance counter stats for './fastrs 50000000':

          3,228.52 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               139      page-faults:u             #    0.043 K/sec
    12,897,808,753      cycles:u                  #    3.995 GHz                      (62.40%)
           650,765      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.37%)
     8,978,149,955      stalled-cycles-backend:u  #   69.61% backend cycles idle      (62.48%)
    42,419,527,620      instructions:u            #    3.29  insn per cycle
                                                  #    0.21  stalled cycles per insn  (62.57%)
     2,000,289,340      branches:u                #  619.568 M/sec                    (62.69%)
             8,920      branch-misses:u           #    0.00% of all branches          (62.59%)
    17,047,855,495      L1-dcache-loads:u         # 5280.393 M/sec                    (62.49%)
            18,294      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.40%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       3.232975405 seconds time elapsed

       3.216925000 seconds user
       0.003314000 seconds sys


       0.006585000 seconds sys
Inace ovi najbrzi entriji su svi uzeli moju izvornu SSE2 otpimizaciju koja je drzala prvo mesto duze vreme:)
 

Back
Top