c - int vs short vectorization -
i have following kernel vectorized arrays integers:
long valor = 0, i=0; __m128i vsum, vecpi, vecci, vecqci; vsum = _mm_set1_epi32(0); int32_t * const pa = a->data; int32_t * const pb = b->data; int sumdot[1]; for( ; i<size-3 ;i+=4){ vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] ); vecci = _mm_loadu_si128((__m128i *)&(pb)[i] ); vecqci = _mm_mullo_epi32(vecpi,vecci); vsum = _mm_add_epi32(vsum,vecqci); } vsum = _mm_hadd_epi32(vsum, vsum); vsum = _mm_hadd_epi32(vsum, vsum); _mm_storeu_si128((__m128i *)&(sumdot), vsum); for( ; i<size; i++) valor += a->data[i] * b->data[i]; valor += sumdot[0];
and works fine. however, if change datatype of , b short instead of int, shouldn't use following code:
long valor = 0, i=0; __m128i vsum, vecpi, vecci, vecqci; vsum = _mm_set1_epi16(0); int16_t * const pa = a->data; int16_t * const pb = b->data; int sumdot[1]; for( ; i<size-7 ;i+=8){ vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] ); vecci = _mm_loadu_si128((__m128i *)&(pb)[i] ); vecqci = _mm_mullo_epi16(vecpi,vecci); vsum = _mm_add_epi16(vsum,vecqci); } vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); _mm_storeu_si128((__m128i *)&(sumdot), vsum); for( ; i<size; i++) valor += a->data[i] * b->data[i]; valor += sumdot[0];
this second kernel doesn't work , don't know why. know entries of vectors in first , second case same (no overflow well). can me finding mistake?
thanks
here's few things see.
in both
int
,short
case, when you're storing__m128
sumdot
, use_mm_storeu_si128
on targets much smaller 128 bits. means you've been corrupting memory, , lucky not bitten.- related this, because
sumdot
int[1]
inshort
case, storing twoshort
s in 1int
, , readingint
.
- related this, because
in
short
case you're missing 1 horizontal vector reduction step. remember you've got 8short
s per vector, must have log_2(8) = 3 vector reduction steps.vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum);
(optional) since you're onto sse4.1 already, might use 1 of goodies has:
pextr*
instructions. take index of lane extract. you're interested in bottom lane (lane 0) because that's sum ends after vector reduction./* 32-bit */ sumdot[0] = _mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = _mm_extract_epi16(vsum, 0);
edit: apparently compiler doesn't sign-extend 16-bit word extracted
_mm_extract_epi16
. must convince yourself./* 32-bit */ sumdot[0] = (int32_t)_mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
edit2: found better solution! uses instruction need (
pmaddwd
), , identical 32-bit code except iteration bounds different, , instead of_mm_mullo_epi16
use_mm_madd_epi16
in loop. needs 2 32-bit vector reduction stages. http://pastebin.com/a9ibkmwp- (optional) style make no difference use
_mm_setzero_*()
functions instead of_mm_set1_*(0)
.
Comments
Post a Comment