在OpenCL内核中使用的Break和Continue

Question

在OpenCL内核中使用的Break和Continue

copencl

3

如何在OpenCL内核中实现这些功能？'return;'和'break;'是否等效？

我正在使用openCL 1.2。

我想使用3个嵌套的for循环来循环遍历一个typedef结构的嵌套数组。

编辑

意识到需要展示一些代码以更好地理解问题。

在内核中...

typedef struct tag_sfextras
{
    float *high;
    float *low;
}sfextras;

typedef struct tag_sdirection
{
    int time;
    float result;
    sfextras *fextras;
}sdirection;

__kernel void Call(sdirection *_direction,
                   int _index,
                   int _start,
                   int _stop,
                   __global float *_result)
{

    float _sum = 0.0f;

    if (_index > 1)
    {
        _result[0] = 0.0f;

        int i = get_global_id(0);

        if (_direction[i].time >= _stop)
        {
            break;//or return?...                     
        }

        if (_direction[i].time < _start)
        {
            continue;// what to put here?...          
        }
        else
        {
            _start = _direction[i].time + (1440 * 60);
        }

        int d = get_global_id(1);
        int f = get_global_id(2);

        float _fextras_weight = 0.0f;// need to zeroize on each inner loop (for f)

        _fextras_weight += (float)pow(_direction[_index - 1].fextras[d].high[f] - _direction[i].fextras[d].high[f], 2.0f);
        _fextras_weight += (float)pow(_direction[_index - 1].fextras[d].low[f] - _direction[i].fextras[d].low[f], 2.0f);

        _result[0] += _fextras_weight*_direction[i].result;
        _sum += _fextras_weight;

    }

    if (_sum > 0.0f)
    {
        _result[0] /= _sum;
    }
}

IN HOST（我试图在内核中复制的代码，以提高效率）

            if(_direction_index > 1)
          {
              _fextras = 0.0f;
              for(int i=0;i<_direction_index-1;i++)
                {
                    if(_direction[i].time >= _stop)
                      {
                          break;
                      }

                    if(_direction[i].time < _start)
                      {
                          continue;
                      }
                    else
                      {
                          _direction_start = _direction[i].time + (1440*60);
                      }

                    for(int d=0;d<_DIRECTION;d++)
                      {
                          for(int f=0;f<_FEXTRAS;f++)
                            {
                                float _fextras_weight = 0.0f;

                                _fextras_weight += (float)pow(_direction[_direction_index-1].fextras[d].high[f]-_direction[i].fextras[d].high[f],2.0f);
                                _fextras_weight += (float)pow(_direction[_direction_index-1].fextras[d].low[f]-_direction[i].fextras[d].low[f],2.0f);

                                _fextras += _fextras_weight*_direction[i].result;
                                _sum += _fextras_weight;
                            }
                      }
                }

               if(_sum > 0.0f)
                 {
                    _fextras /= _sum;
                 }
          }

- ssn

(一如既往) 显示代码 - abelenky

建议您阅读一本关于C语言的书，并查找break、return和goto（无论别人怎么说，goto都可以用来跳出嵌套循环）。 - Paul Ogilvie

你的问题是关于C语言的。 - Paul Ogilvie

@PaulOgilvie 不是很... OpenCL 1.2 - ssn

函数在最后一个 } 后不应该有最终的 ;。 - abelenky

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- huseyin tugrul buyukisik · Accepted Answer

取消OpenCL的所有其他线程会导致未定义的行为，因为它们中的许多线程可能正在写入/读取全局/本地内存，这也可能会清除正在运行的线程（停止所有其他/剩余的内核/线程）。这可能就是为什么OpenCL中没有这样的东西。

但是，您可以添加一个输出数组，每个线程都写入其最后状态。如果一个元素具有“返回”代码，则应检查“返回后”的代码以省略那些结果的计算结果并接受“返回前”的结果。这也需要在输出阶段进行原子操作，因此变慢了，这是不好的。

但是，您可以安全地从单个内核返回：

以下代码在HD7870和R7-240上编译良好，并通过“return”提前退出（结束某些线程的内核执行但不是全部），而没有错误，因为“return”不是OpenCL所应用的约束之一。

__kernel void rarToVideo(__global int * p,__global char * c)
            {
              ...
                            if (tmp)
                            {
                                foo=1;
                            }
                            else
                            {
                                return;
                            }

             ...
            }

使用C++的OpenCL 1.2头文件。

但是，如果你仍然需要虚假返回并且一个线程不会影响其他线程的输入/输出，那么像这样的东西会有所帮助：

  // beginning phase of this thread
  if(globalAtomicElement[0]>=RETURNED)
  {
      // finished this thread so it doesn't waste ALU/LD-ST/....
      // leaves room for other wavefronts at least
      outputState[threadId]=NOT_STARTED;
      return;
  }
   ...
    ...
  // ending phase of this thread
  // localState has information if this thread needed a "return"
  // 0=NOT_RETURNED
  // 1=RETURNED
  // 2=NOT_STARTED
  lastResult=atomic_add(globalAtomicElement,localState);
  if(lastResult>=RETURNED)
  {
       outputState[threadId]=AFTER_RETURNED; // you ommit 
                                             // this thread's result 
                                             // because an other thread
                                             // pretends to stop all

        // so this thread wasted cycles but dont worry,
        // it would always waste even if you don't use
        // a core for GCN 1.0 - GCN 3.0 architectures
        // a core always spin within a compute unit if a 
        // core/shader is working on something.
        // polaris architecture will have ability
        // to shut down unused cores so that will not be
        // a problem of power consumption either. 
  }
  else if(lastResult==NOT_RETURNED && thisThreadReturned)
  {
       outputState[threadId]=RETURNED; // this is returning
                                       // thread
                                       //(finishing,pretending to stop all)
  }
  else if(lastResult==NOT_RETURNED && !thisThreadReturned)
  {
       outputState[threadId]=BEFORE_RETURNED; // you accept this thread's
                                              // results because no thread
                                              // has ever stopped and this
                                              // thread surely computed
                                              //everything before that
  }

然后在主机端，您只检查/过滤“BEFORE_RETURNED”和“RETURNED”的结果，并消除“AFTER_RETURNED”的结果。

在OpenCL 2.0中，您可以尝试以下操作：

- 从1个线程开始 - 返回？否=生成2个线程，是=停止 - 递归计算：返回？否=生成2个线程，是=停止 - 递归完成所有工作

这可以至少节省一半的线程（或者1/4或1/8 ...或1 / N），但效率较低，因为只有2个线程。