《第二届TPU编程竞赛》赛题分析-Transpose

算能SOPHGO
智算赋能数字世界

功能描述：

transpose所做的工作是将一个数组进行换轴操作。对于多维数组，其包含多个轴，比如常用的四维数据就有四个索引维度，而transpose做的是按照多维数组的轴的索引进行变换。

Host端传递到Device端的参数结构：

typedef struct {

int N, C, H, W;

int order[4];

unsigned long long output_addr;

unsigned long long input_addr;

} __attribute__((packed)) param_t;

说明：

N, C, H, W: 指定了四个维度的具体大小；

order[4]: 指定了换轴的参数，例如order={0,2,3,1}则指出了张量从{0, 1, 2, 3}的索引顺序变为(0, 2, 3, 1)，即从N*C*H*W变为N*H*W*C；

output_addr：输出数据的地址(global memory)；

input_addr：输入数据的地址(global memory)；

transpose过程含义：

可以参考“okkernel/host/transpose.cpp”中的transpose函数，该函数计算的结果将用于与device端输出结果进行比较，判断device端输出的结果是否正确：

void transpose(const T *input, T *buffer,
               const int *input_shape,
               const int *trans_order,
               const int *trans_shape,
               const int shape_dims) {
               for (int n = 0; n < input_shape[0]; n++) {
                   for (int c = 0; c < input_shape[1]; c++) {
                       for (int h = 0; h < input_shape[2]; h++) {
                           for (int w = 0; w < input_shape[3]; w++) {
                    				int nchw[4] = {n, c, h, w};
                    				int dst_idx = 
                                        nchw[trans_order[0]] * trans_shape[1] * trans_shape[2] * trans_shape[3] +
                                  		nchw[trans_order[1]] * trans_shape[2] * trans_shape[3] +
                                  		nchw[trans_order[2]] * trans_shape[3] + 
                                        nchw[trans_order[3]];
                    				int src_idx = 
                                        n * input_shape[1] * input_shape[2] * input_shape[3] +
                                  		c * input_shape[2] * input_shape[3] +
                                        h * input_shape[3] + 
                                        w;
                    				buffer[dst_idx] = input[src_idx];
                }
            }
        }
    }
}

数据在内存中按章N-C-H-W维度顺序存放，遍历n*c*h*w个数据，为每个数据 input[src_idx]寻找转换后的索引dst_idx，将其放入buffer的相应位置buffer[dst_idx]，完成transpose过程。

举例思路：

比如一个NCHW=(2, 5, h, w)的张量，其送到tpu上示意图如左侧所示，如果order[4]={1, 0, 2, 3}，则输出张量为(5, 2, h, w)，最简单的思路是先利用gdmaS2L将索引N=0的张量数据送入tpu，再采用gdmaL2S送出，通过设置合理的stride实现合理的输出位置，随后再处理索引N=1的张量数据。

赛题分析

《第二届TPU编程竞赛》赛题分析-ReduceSum

《第二届TPU编程竞赛》赛题分析-RGB2BGR

《第二届TPU编程竞赛》赛题分析-Transpose