Jetson 上 import torch 遇到 Illegal instruction 错误

问题与尝试解决

在 Jetson Xavier NX 使用 Python 在 import torch 时突然遇到了”Illegal instruction”,如下面的代码和输出所示,

$ python  
Python 3.6.9 (default, Mar 10 2023, 16:46:00)  
[GCC 8.4.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import torch  
Illegal instruction

出错后 Python 直接崩溃。

网上搜的,以及 AI 回答的几乎都是一个原因和解决方案:“通常是由于安装的 PyTorch wheel 与 Jetson Xavier NX 的 CPU 架构或 JetPack 版本不完全兼容所导致的。为了解决这个问题,需要使用 NVIDIA 官方为 Jetson 平台提供的预构建 PyTorch wheel,这些 wheel 已针对 ARM aarch64 架构和特定的 JetPack 版本进行了优化。”

pip3 uninstall torch torchvision 卸载当前安装的 PyTorch 和 TorchVision 之后,按照 NVIDIA 官方说明 下载 torch 并安装和测试,

$ wget https://nvidia.box.com/shared/static/fjtbno0vpo676a25cgvuqc1wty0fkkg6.whl -O torch-1.10.0-cp36-cp36m-linux_aarch64.whl
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl  
Defaulting to user installation because normal site-packages is not writeable  
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple  
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl  
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8)  
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1)  
Installing collected packages: torch  
Successfully installed torch-1.10.0
$ python  
Python 3.6.9 (default, Mar 10 2023, 16:46:00)  
[GCC 8.4.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import torch  
Illegal instruction

很明显,安装没有问题,只要是官方列出来支持的版本,次次都能成功。但是只要 import torch 就直接崩溃。当然,网上说的将 OPENBLAS_CORETYPE=ARMV8 加到 .bashrc 并加载也是试过的。

正确的解决办法

找不到答案,在重新烧录系统之前,打算最后再自己试着看看 Python 的详细调试输出,

$ python -vvv  
import _frozen_importlib # frozen  
import _imp # builtin  
import sys # builtin  
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
.........
# trying /usr/lib/python3.6/dist-packages/usercustomize.pyc  
import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x7f90847d30>  
Python 3.6.9 (default, Mar 10 2023, 16:46:00)  
[GCC 8.4.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
.........
# trying /usr/lib/python3.6/rlcompleter.py  
# /usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc matches /usr/lib/python3.6/rlcompleter.py  
# code object from '/usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc'  
import 'rlcompleter' # <_frozen_importlib_external.SourceFileLoader object at 0x7f9068fbe0>  
>>> import torch  
# trying /home/jetson/torch.cpython-36m-aarch64-linux-gnu.so  
# trying /home/jetson/torch.abi3.so  
# trying /home/jetson/torch.so  
# trying /home/jetson/torch.py
.........
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py  
# /home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc matches /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py  
# code object from '/home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc'  
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-aarch64-linux-gnu.so  
Illegal instruction
$

输出信息非常非常非常多,但是最后几行让我看到了曙光:Python 是在加载 _multiarray_umath.cpython-36m-aarch64-linux-gnu.so这个共享库时遇到问题的。估计是这个 numpy 的库包含了一些指令集(instruction set) 让 Jetson 的 ARM 架构无法识别或执行。还是先问一下 AI,AI 给出了指令,先卸载,再重新安装。这次不同的地方在于,将 numpy 这个包也卸载了。

卸载、安装和测试:

$ pip3 uninstall torch torchvision torchaudio numpy  
Found existing installation: torch 1.8.0  
Uninstalling torch-1.8.0:  
 Would remove:  
   /home/jetson/.local/bin/convert-caffe2-to-onnx  
   /home/jetson/.local/bin/convert-onnx-to-caffe2  
   /home/jetson/.local/lib/python3.6/site-packages/caffe2/*  
   /home/jetson/.local/lib/python3.6/site-packages/torch-1.8.0.dist-info/*  
   /home/jetson/.local/lib/python3.6/site-packages/torch/*  
Proceed (Y/n)? y  
 Successfully uninstalled torch-1.8.0  
WARNING: Skipping torchvision as it is not installed.  
WARNING: Skipping torchaudio as it is not installed.  
Found existing installation: numpy 1.19.5  
Uninstalling numpy-1.19.5:  
 Would remove:  
   /home/jetson/.local/bin/f2py  
   /home/jetson/.local/bin/f2py3  
   /home/jetson/.local/bin/f2py3.6  
   /home/jetson/.local/lib/python3.6/site-packages/numpy-1.19.5.dist-info/*  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libgfortran-daac5196.so.5.0.0  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-32ff4d91.3.13.so  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libz-558a5e64.so.1.2.7  
   /home/jetson/.local/lib/python3.6/site-packages/numpy/*  
Proceed (Y/n)? y  
 Successfully uninstalled numpy-1.19.5
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl  
Defaulting to user installation because normal site-packages is not writeable  
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple  
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl  
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8)  
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1)  
Installing collected packages: torch  
Successfully installed torch-1.10.0  
jetson@SSD:~$ python  
Python 3.6.9 (default, Mar 10 2023, 16:46:00)  
[GCC 8.4.0] on linux  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import torch  
>>> print(torch.__version__)  
1.10.0  
>>> import numpy  
>>> print(numpy.__version__)  
1.13.3  
>>> quit()
$ python3 -c "  
> import torch  
> print('PyTorch:', torch.__version__)  
> print('CUDA available:', torch.cuda.is_available())  
> if torch.cuda.is_available():  
>     print('Device:', torch.cuda.get_device_name(0))  
> "  
PyTorch: 1.10.0  
CUDA available: True  
Device: Xavier

可以看出来,问题的症结在于,用户空间的 numpy-1.19.5 与 Jetson NX 平台支持的 torch 不兼容。

那为什么突然就不兼容了呢?我猜测是因为这个:这台 Jetson Xavier NX 系统里原来的 numpy 是 1.13.3 版本。不知道谁在用户空间里安装了 numpy 1.19.5 版本。因为用户空间里 python3 指向 python3.6 而 pip3 指向了 python3.7 的 pip,我前些天做了修改,将用户空间里的 python3 和 pip3 都指向了 Python 3.6.9 版本的了。不确定,但是多半是因为这个,因为在这之间系统里没有发生别的事情。

在崩溃边缘试探

后续发现系统里的 SciPy 需要 NumPy 1.14.5,于是我就尝试着安装一下,

$ pip3 install numpy==1.14.5
......
Successfully built numpy  
Installing collected packages: numpy  
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the  
source of the following dependency conflicts.  
imgaug 0.4.0 requires opencv-python, which is not installed.  
tifffile 2020.9.3 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible.  
seaborn 0.11.2 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible.  
scikit-image 0.17.2 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible.  
pandas 1.1.5 requires numpy>=1.15.4, but you have numpy 1.14.5 which is incompatible.  
opencv-contrib-python 4.5.5.62 requires numpy>=1.19.3; python_version >= "3.6" and platform_system == "Linux" and platform_machin  
e == "aarch64", but you have numpy 1.14.5 which is incompatible.  
imgaug 0.4.0 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible.  
Successfully installed numpy-1.14.5

可以看到,也许我最高需要安装到 NumPy 1.19.3,但是前面导致问题的版本是 NumPy 1.19.5。不管怎么说,先试试看,

$ python3 -c "import numpy
print(numpy.__version__)
import torch
print(torch.__version__)
print('CUDA available: ' + str(torch.cuda.is_available()))
print('cuDNN version: ' + str(torch.backends.cudnn.version()))
a = torch.cuda.FloatTensor(2).zero_()
print('Tensor a = ' + str(a))
b = torch.randn(2).cuda()
print('Tensor b = ' + str(b))
c = a + b*2
print('Tensor c = ' + str(c))
"

结果是 NumPy 1.14.5 并没有导致 PyTorch 导入时 Python 崩溃。准备再试试看 NumPy 1.19.5。我就是想试试,不行了再删除。

很遗憾,这次再次遇到了“Illegal instruction”。于是,根据上面 pip 的兼容性提示,我又安装了 NumPy 1.19.3,再测试,

$ python3 -c "import numpy  
print(numpy.__version__)  
import torch  
print(torch.__version__)"  
1.19.3  
1.10.0

没有问题!

前面较早版本的 NumPy 安装的时候,下载之后还要先创建 wheel;后面的 1.15.4 版本都是直接下载 whl 的安装包了。可见前面的版本是真“过时”得厉害。