I am trying to optimize some parts of the OpenCV code using NEON. Here is the source block of code I'm working on. (Note: if that matters, you can find the full source in "opencvfolder / modules / video / src / lkpyramid.cpp". This is an implementation of the object tracking algorithm.)
for( ; x < colsn; x++ )
{
deriv_type t0 = (deriv_type)(trow0[x+cn] - trow0[x-cn]);
deriv_type t1 = (deriv_type)((trow1[x+cn] + trow1[x-cn])*3 + trow1[x]*10);
drow[x*2] = t0; drow[x*2+1] = t1;
}
In this code, the size of the derivative type is 2 bytes. And here is the NEON assembly I wrote. With source code, I measure 10-11 frames per second. With NEON, this is worse; I can only get 5-6 frames per second. I really don't know much about NEON, there are probably a lot of errors in this code. Where am I doing wrong? Thanks
for( ; x < colsn; x+=4 )
{
__asm__ __volatile__(
"vld1.16 d2, [%2] \n\t"
"vld1.16 d3, [%3] \n\t"
"vsub.i16 d9, d2, d3 \n\t"
"vld1.16 d4, [%4] \n\t"
"vld1.16 d5, [%5] \n\t"
"vld1.16 d6, [%6] \n\t"
"vmov.i16 d7, #3 \n\t"
"vmov.i16 d8, #10 \n\t"
"vadd.i16 d4, d4, d5 \n\t"
"vmul.i16 d10, d4, d7 \n\t"
"vmla.i16 d10, d6, d8 \n\t"
"vst2.16 {d9,d10}, [%0] \n\t"
:
:"r"(drow+x*2), "r"(drow+x*2+1), "r"(trow0+x+cn), "r"(trow0+x-cn), "r"(trow1+x+cn), "r"(trow1+x-cn), "r"(trow1)
:"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10"
);
}
EDIT
This is a verification with internal characteristics. It is almost the same as before. It is still slow.
const int16x8_t vk3 = { 3, 3, 3, 3, 3, 3, 3, 3 };
const int16x8_t vk10 = { 10, 10, 10, 10, 10, 10, 10, 10 };
for( ; x < colsn; x+=8 )
{
int16x8x2_t loaded;
int16x8_t t0a = vld1q_s16(&trow0[x + cn]);
int16x8_t t0b = vld1q_s16(&trow0[x - cn]);
loaded.val[0] = vsubq_s16(t0a, t0b);
loaded.val[1] = vld1q_s16(&trow1[x + cn]);
int16x8_t t1b = vld1q_s16(&trow1[x - cn]);
int16x8_t t1c = vld1q_s16(&trow1[x]);
loaded.val[1] = vaddq_s16(loaded.val[1], t1b);
loaded.val[1] = vmulq_s16(loaded.val[1], vk3);
loaded.val[1] = vmlaq_s16(loaded.val[1], t1c, vk10);
}