c - int vs short vectorization -
i have following kernel vectorized arrays integers:
long valor = 0, i=0; __m128i vsum, vecpi, vecci, vecqci; vsum = _mm_set1_epi32(0); int32_t * const pa = a->data; int32_t * const pb = b->data; int sumdot[1]; for( ; i<size-3 ;i+=4){ vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] ); vecci = _mm_loadu_si128((__m128i *)&(pb)[i] ); vecqci = _mm_mullo_epi32(vecpi,vecci); vsum = _mm_add_epi32(vsum,vecqci); } vsum = _mm_hadd_epi32(vsum, vsum); vsum = _mm_hadd_epi32(vsum, vsum); _mm_storeu_si128((__m128i *)&(sumdot), vsum); for( ; i<size; i++) valor += a->data[i] * b->data[i]; valor += sumdot[0]; and works fine. however, if change datatype of , b short instead of int, shouldn't use following code:
long valor = 0, i=0; __m128i vsum, vecpi, vecci, vecqci; vsum = _mm_set1_epi16(0); int16_t * const pa = a->data; int16_t * const pb = b->data; int sumdot[1]; for( ; i<size-7 ;i+=8){ vecpi = _mm_loadu_si128((__m128i *)&(pa)[i] ); vecci = _mm_loadu_si128((__m128i *)&(pb)[i] ); vecqci = _mm_mullo_epi16(vecpi,vecci); vsum = _mm_add_epi16(vsum,vecqci); } vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); _mm_storeu_si128((__m128i *)&(sumdot), vsum); for( ; i<size; i++) valor += a->data[i] * b->data[i]; valor += sumdot[0]; this second kernel doesn't work , don't know why. know entries of vectors in first , second case same (no overflow well). can me finding mistake?
thanks
here's few things see.
in both
int,shortcase, when you're storing__m128sumdot, use_mm_storeu_si128on targets much smaller 128 bits. means you've been corrupting memory, , lucky not bitten.- related this, because
sumdotint[1]inshortcase, storing twoshorts in 1int, , readingint.
- related this, because
in
shortcase you're missing 1 horizontal vector reduction step. remember you've got 8shorts per vector, must have log_2(8) = 3 vector reduction steps.vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum);(optional) since you're onto sse4.1 already, might use 1 of goodies has:
pextr*instructions. take index of lane extract. you're interested in bottom lane (lane 0) because that's sum ends after vector reduction./* 32-bit */ sumdot[0] = _mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = _mm_extract_epi16(vsum, 0);edit: apparently compiler doesn't sign-extend 16-bit word extracted
_mm_extract_epi16. must convince yourself./* 32-bit */ sumdot[0] = (int32_t)_mm_extract_epi32(vsum, 0); /* 16-bit */ sumdot[0] = (int16_t)_mm_extract_epi16(vsum, 0);edit2: found better solution! uses instruction need (
pmaddwd), , identical 32-bit code except iteration bounds different, , instead of_mm_mullo_epi16use_mm_madd_epi16in loop. needs 2 32-bit vector reduction stages. http://pastebin.com/a9ibkmwp- (optional) style make no difference use
_mm_setzero_*()functions instead of_mm_set1_*(0).
Comments
Post a Comment