https://vengineer.hatenablog.com/entry/71292598

Vengineerの戯言 : Twitter、Slideshare
SystemVerilogの世界へようこそ、すべては、SystemC v0.9公開から始まった

東工大の中原先生のBNNツール、GUINNESSがGithubにて公開されました。

@muojpさんが早速、dockerイメージをアップしています。

ドキュメントは、ドロップボックスにストアされています。こっちをクリックすると、PDFがダウンロードできます。

動作環境は、Ubuntu 14.04 or 16.04 です。Xilinx社のSDSoCを使うためにこの環境になっているようです。

Pyhton 2.6+で動作し、NVIDIAのGPUを利用する場合は、CUDA 8.0 (cuDNNライブラリ)が必要です。
フレームワークには、Chainer 1.23.0 or 1.24.0 を使っています。

SDSoCは、2016.4 (or 2017.1) 。先日、2017.2がリリースされてしまいましたが。。。。
サポートしているFPGAボードは、Xilinx ZC702, ZCU102, Digilent Zedboard, Zyboです。
PYNQもサポート予定です。ということは、Arty-Z7でも動くでしょう。

GUIには、PyQt4を利用しています。グラフプロットにはmatplotlib, pytho-opencv2, numpy, scipy も必要です。

GUINNESS GUIを使って、CNN Specificationを選択しますが、事前にサポートされているモデルのみなのかな。

ソースコード:guinness.pyを見ると、

　・LeNet5
　・TiyCNN
　・VGG9ave
　・VGG11ave
　・VGG16ave
　・VGG119ave

がサポートされているようです。

各モデルをロード後、そのモデルを構成するLayerの削除や新規にLayerを追加することも可能です。
サポートしているLayerは、

　・Conv (Int)
　・Conv (Bin)
　・Max Pool
　・Ave Pool
　・Dense

です。

モデルを作成し、セーブしたら学習用データと学習用ラベルを使って学習します。
Optimizerには、

　・SGC
　・Adam

が用意されています。オプションで、GPUを利用可能です。

学習後、FPGAへのマッピングのために、ボードが選択できます。

　・Zed
　・Zybo
　・VC702 <= これ、ZC702の誤り？
　・ZCU102

これらのボードは、SDSoCが標準でサポートしているボードですね。
PYNQやArty-Z7用のSDSoC用プラットフォームが出てくれば、ここに追加するだけでOKだと思います。

GUINNESSはバックエンドに、Xilinx社のSDSoCを使っているので、
Zynq or Zynq UltraScale+ MPSoCが搭載しているボードであれば、
そのボードのSDSoC用プラットフォームを作ればいいのです。

生成されるファイルは、

　・HLS ディレクトリ
　・sdsoc ディレクトリ
　・config.pickle
　・eval.py
　・net3.py
　・temp.model
　・temp_log.csv
　・test1.proj

sdsocディレクトリに以下のようなSDSoC用のC++ソースコードが生成されます。

　・Makefile
　・cnn.cpp
　・main.cpp
　・socket_main.cpp

このファイルを使って、SDSoCにて指定したボードのBitStreamを生成します。

ここまでが上記のドキュメントに書いてある内容です。

ここからは、あたしが得意のソースコードの解析です。

SDSoC用のC++コードは、このファイルにて生成しています。

template_Makefile をテンプレートに、プロジェクト/sdsoc/Makefile を
template_cpp_r7_bcnn.cpp をテンプレートに、プロジェクト/sdsoc/cnn.cpp を
template_cpp_r7_main.cpp をテンプレートに、プロジェクト/sdsoc/main.cpp を
template_cpp_r7_socket_main.cpp をテンプレートに、sdsoc/socket_main.cpp を生成しています。

template_Makefileを見てみると、

PLATFORM = (TARGET_BOARD)
SDSFLAGS = -sds-pf ${PLATFORM} \
	-sds-hw BinCNN (CNN_C_SOURCE) -sds-end \
	-poll-mode 1

CC = sds++ ${SDSFLAGS}

とあります。GUINNESS GUIで指定したターゲットボードを SDSoCのプラットフォーム (-sds-pf オプション)として、
-sds-hw で機能名(BinCNN)を生成したCNNのソースコードを使って、sds++コマンドで BitStream を生成しています。

SDSoCのドキュメントによると、poll-mode を 0と1で指定できます。
1は、DMAポーリングモードがイネーブルになり、0 (デフォルト)では割り込みモードがイネーブルになっています。

-sds-hw で機能名として指定した BinCNN は、template_cpp_r7_bcnn.cppの最後の方にあります。。
以下のように、なっています。

#ifdef __SDSCC__
#pragma SDS data access_pattern(t_bin_convW: SEQUENTIAL)
#pragma SDS data access_pattern(t_BNFb: SEQUENTIAL)
#pragma SDS data access_pattern(t_in_img: SEQUENTIAL)
#pragma SDS data zero_copy(t_bin_convW[0:(WEIGHT_SIZ)])
#pragma SDS data zero_copy(t_BNFb[0:(BIAS_SIZ)])
#pragma SDS data zero_copy(t_in_img[0:(IMGSIZ)*(IMGSIZ)])
#endif
void BinCNN(
#ifdef __SDSCC__
        int *t_bin_convW,
        int *t_BNFb,
        ap_int<64> t_in_img[(IMGSIZ)*(IMGSIZ)],
        int fc_result[(OUT_DENSE_SIZ)],
        int init
#else 
        int t_bin_convW[(WEIGHT_SIZ)],
        int t_BNFb[(BIAS_SIZ)],
        ap_int<64> t_in_img[(IMGSIZ)*(IMGSIZ)],
        int fc_result[(OUT_DENSE_SIZ)],
        int init
#endif
)
{
	if( init == 1)
		setup( t_bin_convW, t_BNFb);
	else
		kernel( t_in_img, fc_result);
}

引数 init によって、setup 関数と kernel 関数を切り替えています。
引数 init を 1 に設定することで、setup 関数が実行され、重みとバイアスメモリの初期化を行います。
引数 init を 0 に設定することで、kernel 関数が実行され、入力した画像データに対する推論を行います。

メイン関数の template_main.cpp では、

    printf("setup... \n");
    BinCNN( t_bin_convW, t_BNFb, t_tmp_img, fc_result, 1);

のように、最後の引数 init に 1を設定して、重みとバイアスを設定しています。

後半の下記の部分で、プログラムの第二引数で指定された回数、推論を行います。。。

    printf("Inference %d times ... ", cnt);
    for( i = 0; i < cnt; i++){
        BinCNN( t_bin_convW, t_BNFb, t_tmp_img, fc_result, 0);
    }
    printf("OK\n");

template_cpp_r7_socket_main.cpp では、サーバーとソケット通信して、画像データを取り込み推論するというもののようです。
このプログラムとhttps://github.com/HirokiNakahara/GUINNESS-Tutorial/blob/master/cnn_capture.py を使って、
PCのカメラから取り込んだ画像を FPGAボードで推論し、PCに送り返しています。

追記)、2017.09.09

CNNの各層のコードを生成する部分は、gen_cpp_code_v3.py の下記の部分です。
layer_typeによって各層の実行コードを呼んでいます。
回路的には、セレクタになるんでしょうね。

引用
#(DEF_CNN_LAYER)
from collections import Counter
def_cnn_layer = ''

bn_idx = 0
dense_idx = 0
counter = Counter(initial_options)
for layer_type, cnt in counter.items():
        if layer_type == 0 and cnt > 0:
                for i in range(len(initial_options)):
                        if initial_options[i] == 0:
                                def_cnn_layer += '            case %d:\n' % i
                def_cnn_layer += '            int_conv2d_layer<bit_64, bit_%d, 64, %d, %d, %d>\n            ( in_img, fb_tmp, conv0W, b0_BNFb);\n            break;\n' % (max_bconv_width,max_bconv_width,int(infmap_siz[0]),int(infmap_siz[0]))

        elif layer_type == 1 and cnt > 0:
                for i in range(len(initial_options)):
                        if initial_options[i] == 1:
                                def_cnn_layer += '            case %d:\n' % i
                def_cnn_layer += '            bin_conv2d_pipeline(fb_tmp,bin_layer_idx,fsize[layer],n_in[layer],n_out[layer]);\n            bin_layer_idx++;\n            break;\n'

        elif layer_type == 2 and cnt > 0:
                for i in range(len(initial_options)):
                        if initial_options[i] == 2:
                                def_cnn_layer += '            case %d:\n' % i
                def_cnn_layer += '            max_pooling_layer<bit_%d, %d, %d>(fb_tmp);\n            break;\n' % (max_bconv_width,int(imgsiz),int(infmap_siz[i]))

        elif layer_type == 3 and cnt > 0:
                for i in range(len(initial_options)):
                        if initial_options[i] == 3:
                                def_cnn_layer += '            case %d:\n' % i
                                def_cnn_layer += '            {\n'
                                def_cnn_layer += '                ap_int<%d>mask = 0x1;\n' % int(n_in_fmaps[i])
                                def_cnn_layer += '                for( of = 0; of < %d; of++){\n' % int(n_ou_fmaps[i])
                                def_cnn_layer += '                      ap_int<11> tmp = 0;\n'
                                def_cnn_layer += '                      for( y = 0; y < %d; y++){\n' % int(infmap_siz[i])
                                def_cnn_layer += '                              for( x = 0; x < %d; x++){\n' % int(infmap_siz[i])
                                def_cnn_layer += '                                      if( (fb_tmp[y][x] & mask) != 0)\n'
                                def_cnn_layer += '                                              tmp++;\n'
                                def_cnn_layer += '                              }\n'
                                def_cnn_layer += '                      }\n'
                                def_cnn_layer += '                      if( tmp >= %d*%d/2)\n' % (int(infmap_siz[i]),int(infmap_siz[i]))
                                def_cnn_layer += '                              fc_tmp[of] = 1;\n'
                                def_cnn_layer += '                      else\n'
                                def_cnn_layer += '                              fc_tmp[of] = 0;\n'
                                def_cnn_layer += '                      mask = mask << 1;\n'
                                def_cnn_layer += '                }\n                }\n            break;\n'

        elif layer_type == 4 and cnt > 0:
                for i in range(len(initial_options)):
                        if initial_options[i] == 4:
                                def_cnn_layer += '            case %d:\n' % i
                                def_cnn_layer += '            fc_layer< %d, %d>( fc_tmp, fc%dW, b%d_BNFb, fc_result);\n            break;\n' % (int(n_ou_fmaps[i]),int(n_in_fmaps[i]),dense_idx,bn_idx)
                                bn_idx += 1
                                dense_idx += 1
                        elif initial_options[i] == 0 or initial_options[i] == 1:
                                bn_idx += 1

def_cnn_layer += '            default: break;\n'

こんな感じですね。。。

追記)、2017.09.24
Open-source GUINNESS makes FPGA-accelerated, binarized neural networks easy to pour right from the SDSoC tap