Notes on MXNet Deployment

目前一直在学习 MXNet，这篇日志记录下 MXNet 在不同平台上的部署和使用。

1. 在不同平台上的部署

1.1 University Server

学校或者系里的 Server 一般都是免费的，如果能够抢到机器当然是最好的啦。不过除了要抢之外，还需要注意以下几点：

没有 root 权限，这意味着很多需要 sudo 的操作不可行
Disk 往往是挂在在阵列上，磁盘与 CPU 之间的通信是靠网络传输的，所以千万不要一次性将很大的文件读入内存，否则等待你的就是 forever 的 load 了

1.1.1 Installation

Step 1: Check CUDA Version

这个不需要 root 权限，也就是不需要 sudo

cat /usr/local/cuda/version.txt

Step 2: conda 环境

conda env list

如果需要的话，删除之前的 environment，

conda env remove --name ENVIRONMENT_NAME

也可以更新现有的包管理器

1 2	conda update conda pip install --upgrade pip

Step 3: Create Gluon Environment

Download Gluon Tutorial:

git clone https://github.com/mli/gluon-tutorials-zh

Create Virtual Environment gluon and activate it

1
2
3

cd gluon-tutorials-zh
conda env create -f environment.yml
source activate gluon

Step 4: Install notedown plugin

1
2
3

pip install https://github.com/mli/notedown/tarball/master
jupyter notebook --generate-config
echo "c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'" >>~/.jupyter/jupyter_notebook_config.py

Step 5: Install the GPU version of MXNet

因为我们的 CUDA 是 9.0，所以装对应版本的 mxnet

1 2	pip uninstall -y mxnet pip install mxnet-cu90mkl

测试安装是否成功，运行下列 python 代码

1 2	import mxnet as mx mx.nd.array([1,2], ctx=mx.gpu(0))

Step 6: Install gluonbook

1	pip install gluonbook

Step 7: Install Gluon-CV

在用之前可以用 yolk -V gluoncv 查看一下 pip 上面 gluon-cv 的版本号，通常是落后于最新的，所以最好还是按照下面操作手动安装。pip install yolk3K

1
2
3

cd ~
git clone https://github.com/dmlc/gluon-cv
cd gluon-cv && python setup.py install --user

但是如果使用了 0.3.0 版本的 Gluon-CV 在调用的时候，会报错 cannot import name 'SimpleQueue'，退回来用 0.2.0 版本就没这个问题了。时间戳是，2018-08-06；如果硬要用的话，解决方法应该是自己编译 MXNet 1.3，目前官方的 pip 安装只提供到 1.2.

Step 7: Clone from Gitlab

1 2	cd ~ git clone https://gitlab.com/XXXX/YYYY.git

Step 8: 下载模型园

from gluoncv import model_zoo
# load a ResNet model trained on CIFAR10
cifar_resnet20 = model_zoo.get_model('cifar_resnet20_v1', pretrained=True)
# load a pre-trained ssd model
ssd0 = model_zoo.get_model('ssd_300_vgg16_atrous_voc', pretrained=True)
# load ssd model with pre-trained feature extractors
ssd1 = model_zoo.get_model('ssd_512_vgg16_atrous_voc', pretrained_base=True)
# load ssd model without initialization
ssd2 = model_zoo.get_model('ssd_512_resnet50_v1_voc', pretrained_base=False)

1.1.2 Using

Checking Issues Before Using the Machines

Which GPUs are available? - nvidia-smi
Is there enough host memory? - htop
Are there available CPUs? - htop

Using the machines in general

The screen-program can be used for executing programs despite terminating the ssh-connection. Please make sure to close your unused screens.

AWS

Step 1: Update System

sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3

Step 2: Install CUDA

访问 https://developer.nvidia.com/cuda-toolkit-archive 选择对应的 CUDA 驱动下载

我选择 9.0 的版本

1 2	wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run sudo sh cuda_9.0.176_384.81_linux-run

然后测试下安装成功了么

nvidia-smi

最后，将 CUDA 加入到库的路径中，以方便其他库找到它，注意因为我是 9.0，所以是 9.0 的版本。

echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:/usr/local/cuda-9.0/lib64" >>.bashrc

Step 3: Install Conda

1 2	wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

安装完成后，运行一次 source ~/.bashrc 让 CUDA 和 conda 生效

source ~/.bashrc

更新现有的包管理器

1 2	conda update conda pip install --upgrade pip

Step 4: Install MXNet

cd ~
git clone https://github.com/mli/gluon-tutorials-zh.git
cd gluon-tutorials-zh/
conda env create -f environment.yml

默认环境里安装了 CPU 版本的 MXNet。现在我们将它替换成 GPU 版本的 MXNet（1.2.1 版）。

1
2
3

source activate gluon
pip uninstall mxnet
pip install mxnet-cu90==1.2.1

Step 5: 运行 Jupyter notebook

cd 到目标文件夹

1	jupyter notebook

测试安装是否成功，运行下列 python 代码

1 2	import mxnet as mx mx.nd.array([1,2], ctx=mx.gpu(0))

Step 6: Install gluonbook

1	pip install gluonbook

Step 7: Install Gluon-CV

在用之前可以用 yolk -V gluoncv 查看一下 pip 上面 gluon-cv 的版本号，通常是落后于最新的，所以最好还是按照下面操作手动安装。

安装 yolk

pip install yolk3k

安装 gluoncv

pip install gluoncv

Step 8: Clone from Gitlab

首先我们要设置 ssh，也就是将 AWS 这台机器上产生的秘钥，放到 Gitlab 的秘钥上，这样两者就认识了。但因为我们没法在 AWS 的机器里打开图形界面，所以 AWS 机器上产生的秘钥要先 download 到我本地的机子上，然后再复制其里面的内容到 Gitlab 网页上。

在 AWS 机子上产生秘钥，因为 AWS 的机子是 Linux，所以根据这个网页

Generating a new SSH key

1	ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

Adding your SSH key to the ssh-agent

1 2	eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_rsa

Download id_rsa to local machine，请到 local machine 的 terminal 上（不是 AWS 登录的那个）

scp -i "XXXX.pem" username@remote:~/.ssh/id_rsa.pub ~/Downloads

里面的 username@remote 就是在 AWS 中实例 -> 连接后的登录命令中可以看到。把下下来的文件打开，然后复制复制到右上角头像 -> Settings -> SSH Keys 里面。

Clone from Gitlab

1 2	cd ~ git clone https://gitlab.com/XXXX/YYYY.git

Step 9: 下载模型园

在 python 中运行

from gluoncv import model_zoo
# load a ResNet model trained on CIFAR10
cifar_resnet20 = model_zoo.get_model('cifar_resnet20_v1', pretrained=True)
# load a pre-trained ssd model
ssd0 = model_zoo.get_model('ssd_300_vgg16_atrous_voc', pretrained=True)
# load ssd model with pre-trained feature extractors
ssd1 = model_zoo.get_model('ssd_512_vgg16_atrous_voc', pretrained_base=True)
# load ssd model without initialization
ssd2 = model_zoo.get_model('ssd_512_resnet50_v1_voc', pretrained_base=False)

Step 10: Upload your own dataset

在自己的 local machine 上

scp -i "XXXX.pem" -r /file/to/send username@remote:/where/to/put

-r 表示文件夹，递归复制整个目录

Step 11: 映射到本地

ssh -i "XXXX.pem" -N -f -L 8888:localhost:8890 user_name@server_address

Step xxx: 关闭不使用的实例

因为云服务按使用时长计费，我们通常会在不使用实例时将其关闭。

如果较短时间内还将重新开启实例，右击图 11.16 中的示例，选择 “Instance State” → “Stop” 将实例停止，等下次使用时选择 “Instance State” → “Start” 重新开启实例。这种情况下，开启的实例将保留其停止前硬盘上的存储（例如无需再安装 CUDA 和其他运行环境）。然而，停止状态的实例也会因其所保留的硬盘空间而产生少量计费。

如果较长时间内不会重新开启实例，右击图 11.16 中的示例，选择 “Image” → “Create” 创建镜像。然后，选择 “Instance State” → “Terminate” 将实例终结（硬盘不再产生计费）。当下次使用时，我们可按本节中创建并运行 EC2 实例的步骤重新创建一个基于保存镜像的实例。唯一的区别在于，在图 11.10 的第一步 “1. Chosse AMI” 中，我们需要通过左栏 “My AMIs” 选择之前保存的镜像。这样创建的实例将保留镜像上硬盘的存储（例如无需再安装 CUDA 和其他运行环境）。

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

sudo apt install libgl1-mesa-glx

Own Computer

共同操作

list all versions of a package that’s available

https://stackoverflow.com/questions/4888027/python-and-pip-list-all-versions-of-a-package-thats-available

对 Python 3 来说是

1 2	pip install yolk3k yolk -V gluoncv

Uninstall a package

pip uninstall xxxx

Map remote port to local port

First, activate virtual environment on remote server, e.g.,
1. cd gluon_tutorial
2. source activate gluon
Second, port forwarding, ssh -N -f -L 8888:localhost:8890 qcn484@imgpu1, 8890 is the port on remote server, 8888 is port of localhost, open on my laptop.

Upload and Download files via FTP Client

Choose SFTP Protocol, only SFTP
Server: ssh-diku-image.science.ku.dk
KU Username
KU Passwork
Port 22, Done

Check the occupied port and kill the progress

1 2	lsof -i:xxxx # xxxx is the port you want to check kill yyyy # yyyy is the pid you want to kill

SSH Login Without Password Using `ssh-keygen` & `ssh-copy-id`

According to 3 Steps to Perform SSH Login Without Password Using ssh-keygen & ssh-copy-id

Assuming you already have ~/.ssh/id_rsa.pub, just using ssh-copy-id -i ~/.ssh/id_rsa.pub remote-host.

Using `scp`

If you are on the computer from which you want to send file to a remote computer:

scp /file/to/send username@remote:/where/to/put

scp <file> username@ipadress:

On the other hand if you are on the computer wanting to receive file from a remote computer:

scp username@remote:/file/to/send /where/to/put

Do not forget the final :

Sync codes among different machines

git pull origin master:master

Add SSH key to GitHub/GitLab/Coding.net Account

on Mac

1 2	$ pbcopy < ~/.ssh/id_rsa.pub # Copies the contents of the id_rsa.pub file to your clipboard

on Windows

1 2	$ clip < ~/.ssh/id_rsa.pub # Copies the contents of the id_rsa.pub file to your clipboard

on Linux

$ sudo apt-get install xclip
# Downloads and installs xclip. If you don't have `apt-get`, you might need to use another installer (like `yum`)
$ xclip -sel clip < ~/.ssh/id_rsa.pub
# Copies the contents of the id_rsa.pub file to your clipboard

同时推送多个

在 .git/config 下添加下面的，然后以后就可以 git push all master，当然，你还要让这台机子的 ssh key 添加到两个 repo 对应的网站上

1
2
3

[remote "all"]
	url = git@git.coding.net:XXXX/YYYY.git
	url = git@gitlab.com:XXXX/YYYY.git

Uninstall CUDA and NVIDIA Driver

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

自动化脚本

按照前面内容部署好以后，并不是一劳永逸。因为穷，我用的都是 AWS 上的 spot instance，这个只有 running 和 terminate 两种。虽然可以创建成 image，然后下次申请的时候按照 image 重开，但是有时会遇上 nVidia 驱动挂掉的问题。所以这时候如果有自动化脚本可以省去我们很多事情就好了

如果您觉得我的文章对您有所帮助，不妨小额捐助一下，您的鼓励是我长期坚持的动力。

1. 在不同平台上的部署

1.1 University Server

1.1.1 Installation

Step 1: Check CUDA Version

Step 2: conda 环境

Step 3: Create Gluon Environment

Step 4: Install notedown plugin

Step 5: Install the GPU version of MXNet

Step 6: Install gluonbook

Step 7: Install Gluon-CV

Step 7: Clone from Gitlab

Step 8: 下载模型园

1.1.2 Using

Checking Issues Before Using the Machines

Using the machines in general

AWS

Step 1: Update System

Step 2: Install CUDA

Step 3: Install Conda

Step 4: Install MXNet

Step 5: 运行 Jupyter notebook

Step 6: Install gluonbook

Step 7: Install Gluon-CV

Step 8: Clone from Gitlab

Step 9: 下载模型园

Step 10: Upload your own dataset

Step 11: 映射到本地

Step xxx: 关闭不使用的实例

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Own Computer

共同操作

Uninstall a package

Map remote port to local port

Upload and Download files via FTP Client

Check the occupied port and kill the progress

SSH Login Without Password Using ssh-keygen & ssh-copy-id

Using scp

Sync codes among different machines

Add SSH key to GitHub/GitLab/Coding.net Account

同时推送多个

Uninstall CUDA and NVIDIA Driver

自动化脚本

SSH Login Without Password Using `ssh-keygen` & `ssh-copy-id`

Using `scp`