以下是一个数组示例:
alignas(16) double c[voiceSize][blockSize];
这是我正在尝试优化的函数:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = value + deltaValue * sampleIndex;
}
}
这是我尝试使用内在函数(SSE2)的代码:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
__m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
很遗憾,即使自动优化,它比“scalar”代码慢。
在您的意见中,瓶颈在哪里?我哪里做错了?
我正在使用MSVC
,Release/x86
,/02
优化标志(Favor fast code
)。
编辑:按照@wim的建议执行此操作后,性能似乎比C版本更好:
inline void Process(int voiceIndex, int blockSize) {
double *pC = c[voiceIndex];
double value = start + step * delta;
double deltaValue = rate * delta;
__m128d value_add = _mm_set1_pd(value);
__m128d deltaValue_mul = _mm_set1_pd(deltaValue);
__m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
__m128d sampleIndex_add = _mm_set1_pd(2.0);
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
__m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
result_mul = _mm_add_pd(result_mul, value_add);
_mm_store_pd(pC + sampleIndex, result_mul);
}
}
为什么
_mm_setr_pd
很耗费资源?
__m128d sampleIndex_vec = _mm_set_pd(-1.0,-2.0);
和__m128d sampleIndex_add = _mm_set1_pd(2.0);
的东西开始,放在循环外面。在循环内部,您可以将__m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
替换为sampleIndex_vec = _mm_add_pd(sampleIndex_vec, sampleIndex_add);
和result_mul = sampleIndex_vec
。这样就可以摆脱讨厌的_mm_setr_pd(sampleIndex, sampleIndex + 1);
了。(未经测试。) - wimsampleIndex
计数器,这比每次迭代进行两个整数转换为双精度浮点数要高效得多。使用gcc -S查看两个版本之间的差异。 - wim