c - int vs short vectorization -

- June 15, 2011

i have following kernel vectorized arrays integers:

    long valor = 0, i=0;      __m128i vsum, vecpi, vecci, vecqci;      vsum = _mm_set1_epi32(0);      int32_t * const pa = a->data;     int32_t * const pb = b->data;      int sumdot[1];      for( ; i<size-3 ;i+=4){             vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] );             vecci = _mm_loadu_si128((__m128i *)&(pb)[i] );             vecqci = _mm_mullo_epi32(vecpi,vecci);             vsum = _mm_add_epi32(vsum,vecqci);     }      vsum = _mm_hadd_epi32(vsum, vsum);     vsum = _mm_hadd_epi32(vsum, vsum);     _mm_storeu_si128((__m128i *)&(sumdot), vsum);      for( ; i<size; i++)           valor += a->data[i] * b->data[i];      valor += sumdot[0];

and works fine. however, if change datatype of , b short instead of int, shouldn't use following code:

    long valor = 0, i=0;      __m128i vsum, vecpi, vecci, vecqci;      vsum = _mm_set1_epi16(0);      int16_t * const pa = a->data;     int16_t * const pb = b->data;      int sumdot[1];      for( ; i<size-7 ;i+=8){             vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] );             vecci = _mm_loadu_si128((__m128i *)&(pb)[i] );             vecqci = _mm_mullo_epi16(vecpi,vecci);             vsum = _mm_add_epi16(vsum,vecqci);     }      vsum = _mm_hadd_epi16(vsum, vsum);     vsum = _mm_hadd_epi16(vsum, vsum);     _mm_storeu_si128((__m128i *)&(sumdot), vsum);      for( ; i<size; i++)           valor += a->data[i] * b->data[i];      valor += sumdot[0];

this second kernel doesn't work , don't know why. know entries of vectors in first , second case same (no overflow well). can me finding mistake?

thanks

here's few things see.

in both int , short case, when you're storing __m128 sumdot, use _mm_storeu_si128 on targets much smaller 128 bits. means you've been corrupting memory, , lucky not bitten.
- related this, because sumdot int[1] in short case, storing two shorts in 1 int, , reading int.
in short case you're missing 1 horizontal vector reduction step. remember you've got 8 shorts per vector, must have log_2(8) = 3 vector reduction steps.
```
vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); 
```
(optional) since you're onto sse4.1 already, might use 1 of goodies has: pextr* instructions. take index of lane extract. you're interested in bottom lane (lane 0) because that's sum ends after vector reduction.

~~/* 32-bit */ sumdot[0] = _mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = _mm_extract_epi16(vsum, 0);~~

edit: apparently compiler doesn't sign-extend 16-bit word extracted _mm_extract_epi16. must convince yourself.
```
/* 32-bit */ sumdot[0] = (int32_t)_mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = (int16_t)_mm_extract_epi16(vsum, 0); 
```
edit2: found better solution! uses instruction need (pmaddwd), , identical 32-bit code except iteration bounds different, , instead of _mm_mullo_epi16 use _mm_madd_epi16 in loop. needs 2 32-bit vector reduction stages. http://pastebin.com/a9ibkmwp
(optional) style make no difference use _mm_setzero_*() functions instead of _mm_set1_*(0).

Search This Blog

Sp

c - int vs short vectorization -

Comments

Post a Comment

Popular posts from this blog

php - Magento - Deleted Base url key -

javascript - Tooltipster plugin not firing jquery function when button or any click even occur -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -