fix the channel 3 bug in matrix operation perf and buf fix for LUT haardetect convertC3C4 resize warpaffine copytom settom add convovle remove stereo
add channel 3 support add fast way Between CPU and GPU for the data which is aligned