PyTorch中的拷贝与就地操作详解

前言

PyTroch中我们经常使用到Numpy进行数据的处理，然后再转为Tensor，但是关系到数据的更改时我们要注意方法是否是共享地址，这关系到整个网络的更新。本篇就In-palce操作，拷贝操作中的注意点进行总结。

In-place操作

pytorch中原地操作的后缀为_，如.add_()或.scatter_()，就地操作是直接更改给定Tensor的内容而不进行复制的操作，即不会为变量分配新的内存。Python操作类似+=或*=也是就地操作。（我加了我自己~）

为什么in-place操作可以在处理高维数据时可以帮助减少内存使用呢，下面使用一个例子进行说明，定义以下简单函数来测量PyTorch的异位ReLU（out-of-place）和就地ReLU(in-place)分配的内存：

import torch # import main library
import torch.nn as nn # import modules like nn.ReLU()
import torch.nn.functional as F # import torch functions like F.relu() and F.relu_()

def get_memory_allocated(device, inplace = False):
 '''
 Function measures allocated memory before and after the ReLU function call.
 INPUT:
 - device: gpu device to run the operation
 - inplace: True - to run ReLU in-place, False - for normal ReLU call
 '''
 
 # Create a large tensor
 t = torch.randn(10000, 10000, device=device)
 
 # Measure allocated memory
 torch.cuda.synchronize()
 start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
 start_memory = torch.cuda.memory_allocated() / 1024**2
 
 # Call in-place or normal ReLU
 if inplace:
 F.relu_(t)
 else:
 output = F.relu(t)
 
 # Measure allocated memory after the call
 torch.cuda.synchronize()
 end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
 end_memory = torch.cuda.memory_allocated() / 1024**2
 
 # Return amount of memory allocated for ReLU call
 return end_memory - start_memory, end_max_memory - start_max_memory
 # setup the device
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")
#开始测试
# Call the function to measure the allocated memory for the out-of-place ReLU
memory_allocated, max_memory_allocated = get_memory_allocated(device, inplace = False)
print('Allocated memory: {}'.format(memory_allocated))
print('Allocated max memory: {}'.format(max_memory_allocated))
'''
Allocated memory: 382.0
Allocated max memory: 382.0
'''
#Then call the in-place ReLU as follows:
memory_allocated_inplace, max_memory_allocated_inplace = get_memory_allocated(device, inplace = True)
print('Allocated memory: {}'.format(memory_allocated_inplace))
print('Allocated max memory: {}'.format(max_memory_allocated_inplace))
'''
Allocated memory: 0.0
Allocated max memory: 0.0
'''

看起来，使用就地操作可以帮助我们节省一些GPU内存。但是，在使用就地操作时应该格外谨慎。

就地操作的主要缺点主要原因有2点，官方文档：

1.可能会覆盖计算梯度所需的值，这意味着破坏了模型的训练过程。

2.每个就地操作实际上都需要实现来重写计算图。异地操作Out-of-place分配新对象并保留对旧图的引用，而就地操作则需要更改表示此操作的函数的所有输入的创建者。

在Autograd中支持就地操作很困难，并且在大多数情况下不鼓励使用。Autograd积极的缓冲区释放和重用使其非常高效，就地操作实际上降低内存使用量的情况很少。除非在沉重的内存压力下运行，否则可能永远不需要使用它们。

总结：Autograd很香了，就地操作要慎用。

拷贝方法

浅拷贝方法：共享 data 的内存地址，数据会同步变化

* a.numpy() # Tensor—>Numpy array

* view() #改变tensor的形状，但共享数据内存，不要直接使用id进行判断

* y = x[:] # 索引

* torch.from_numpy() # Numpy array—>Tensor

* torch.detach() # 新的tensor会脱离计算图，不会牵扯梯度计算。

* model:forward()

还有很多选择函数也是数据共享内存，如index_select() masked_select() gather()。

以及后文提到的就地操作in-place。

深拷贝方法：

* torch.clone() # 新的tensor会保留在计算图中，参与梯度计算

下面进行验证，首先验证浅拷贝：

import torch as t
import numpy as np
a = np.ones(4)
b = t.from_numpy(a) # Numpy->Tensor
print(a)
print(b)
'''输出：
[1. 1. 1. 1.]
tensor([1., 1., 1., 1.], dtype=torch.float64)
'''
b.add_(1)# add_会修改b自身
print(a)
print(b)
'''输出：
[2. 2. 2. 2.]
tensor([2., 2., 2., 2.], dtype=torch.float64)
b进行add操作后, a,b同步发生了变化
'''

Tensor和numpy对象共享内存（浅拷贝操作），所以他们之间的转换很快，且会同步变化。

造torch中y = x + y这样的运算是会新开内存的，然后将y指向新内存。为了进行验证，我们可以使用Python自带的id函数：如果两个实例的ID一致，那么它们所对应的内存地址相同；但需要注意是在torcXNih~zhn{~{XyNyX[>ih~z[ZJ~Z^YZIiJ