- 创建 spot instance
- 申请1个spot instance(比如p2.xlarge), 创建一个满足环境最小(土豪可以无视)需求的EBS根卷, 设置为实例终止后不删除, 即取消那个勾, 比如10G, 选取一个amazon ami, 比如ubuntu 16.04, 启动
- 连接成功后, 配置你自己的环境,比如安装软件和设置参数, 比如cuda, anaconda,tensorflow...之类, 测试无误后终止实例(注意,对spot instance进行poweroff, 被视为终止行为)
- 对实例留下来的EBS卷做snapshot
- 基于该snapshot创建AMI
- 再次申请spot instance时,使用自己创建的这个ami, 就不用弄装环境了, 如果有变动,你可以基于该ami启动的EBS,修改后,重新snapshot来创建AMI
- 数据保留
- 创建一个用于保存数据的EBS卷,attach到实例,并且
lsblk sudo mkfs -t ext4 device_name sudo mkdir mount_point sudo mount /dev/device_name mount_point - 跑程序,download数据,或者保存checkpoint
- 停止instance
- 注意该EBS卷会保留
- 再次创建或申请instance,attach上面的数据EBS卷到实例,lsblk, mount就可以了, 注意不能mkfs了
- 使用之前的数据
- 创建一个用于保存数据的EBS卷,attach到实例,并且
- 使用ssh链接以及数据传输
ssh -i /path/my-key-pair.pem user_name@public_dns_name scp -i /path/my-key-pair.pem /path/SampleFile.txt user_name@public_dns_name:destination_path - 环境配置
-
Anaconda
wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh bash Anaconda-latest-Linux-x86_64.sh export PATH=~/anaconda3/bin:$PATH -
CUDA 9.0
-
install make gcc first
-
check GPU
lspci | grep -i nvidia- install
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run sudo sh cuda_9.0.176_384.81_linux.run --tmpdir=<path>- environment setup
export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} -
-
cuDNN
first download the installation file into local disk and then use scp to transfer data
sudo dpkg -i libcudnn7_7.0.3.11-1+cuda9.0_amd64.deb export CUDA_HOEM=/usr/local/cuda -
Install libcupti-dev library
sudo apt-get install cuda-command-line-tools export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64 -
Tensorflow
sudo apt-get install python3-pip python3-dev pip3 install tensorflow-gpu -
Validate
import tensorflow as tf hello = tf.constant('Hello, TensorFlow!') with tf.device('/gpu:0'), tf.Session() as sess: print(sess.run(hello))
-
Download code
git clone https://github.com/hcz28/style_transfer.git -
Others
- security group should enable ssh in the inbound rules
-
References