Jetson 上 import torch 遇到 Illegal instruction 错误

记录了Jetson 上 Numpy 模块版本不兼容导致的”Illegal instruction”问题的解决。

问题与尝试解决

在 Jetson Xavier NX 使用 Python 在 import torch 时突然遇到了”Illegal instruction”,如下面的代码和输出所示,

$ python 
Python 3.6.9 (default, Mar 10 2023, 16:46:00) 
[GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 
>>> import torch 
Illegal instruction

出错后 Python 直接崩溃。

网上搜的,以及 AI 回答的几乎都是一个原因和解决方案:“通常是由于安装的 PyTorch wheel 与 Jetson Xavier NX 的 CPU 架构或 JetPack 版本不完全兼容所导致的。为了解决这个问题,需要使用 NVIDIA 官方为 Jetson 平台提供的预构建 PyTorch wheel,这些 wheel 已针对 ARM aarch64 架构和特定的 JetPack 版本进行了优化。”

pip3 uninstall torch torchvision 卸载当前安装的 PyTorch 和 TorchVision 之后,按照 NVIDIA 官方说明 下载 torch 并安装和测试,

$ wget https://nvidia.box.com/shared/static/fjtbno0vpo676a25cgvuqc1wty0fkkg6.whl -O torch-1.10.0-cp36-cp36m-linux_aarch64.whl 
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl 
Defaulting to user installation because normal site-packages is not writeable 
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple 
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl 
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8) 
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1) 
Installing collected packages: torch 
Successfully installed torch-1.10.0 
$ python 
Python 3.6.9 (default, Mar 10 2023, 16:46:00) 
[GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. 
>>> import torch
Illegal instruction

很明显,安装没有问题,只要是官方列出来支持的版本,次次都能成功。但是只要 import torch 就直接崩溃。当然,网上说的将 OPENBLAS_CORETYPE=ARMV8 加到 .bashrc 并加载也是试过的。

正确的解决办法

找不到答案,在重新烧录系统之前,打算最后再自己试着看看 Python 的详细调试输出,

$ python -v
import _frozen_importlib # frozen 
import _imp # builtin 
import sys # builtin 
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
.........
# trying /usr/lib/python3.6/dist-packages/usercustomize.pyc 
import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x7f90847d30> 
Python 3.6.9 (default, Mar 10 2023, 16:46:00) 
[GCC 8.4.0] on linux 
Type "help", "copyright", "credits" or "license" for more information. 
.........
# trying /usr/lib/python3.6/rlcompleter.py 
# /usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc matches /usr/lib/python3.6/rlcompleter.py 
# code object from '/usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc' 
import 'rlcompleter' # <_frozen_importlib_external.SourceFileLoader object at 0x7f9068fbe0> 
>>> import torch 
# trying /home/jetson/torch.cpython-36m-aarch64-linux-gnu.so 
# trying /home/jetson/torch.abi3.so 
# trying /home/jetson/torch.so 
# trying /home/jetson/torch.py
.........
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py 
# /home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc matches /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py 
# code object from '/home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc' 
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-aarch64-linux-gnu.so 
Illegal instruction
$

输出信息非常非常非常多,但是最后几行让我看到了曙光:Python 是在加载 _multiarray_umath.cpython-36m-aarch64-linux-gnu.so这个共享库时遇到问题的。估计是这个 numpy 的库包含了一些指令集(instruction set) 让 Jetson 的 ARM 架构无法识别或执行。还是先问一下 AI,AI 给出了指令,先卸载,再重新安装。这次不同的地方在于,将 numpy 这个包也卸载了。

卸载、安装和测试:

$ pip3 uninstall torch torchvision torchaudio numpy  
Found existing installation: torch 1.8.0  
Uninstalling torch-1.8.0:  
 Would remove:  
   /home/jetson/.local/bin/convert-caffe2-to-onnx  
   /home/jetson/.local/bin/convert-onnx-to-caffe2  
   /home/jetson/.local/lib/python3.6/site-packages/caffe2/*  
   /home/jetson/.local/lib/python3.6/site-packages/torch-1.8.0.dist-info/*  
   /home/jetson/.local/lib/python3.6/site-packages/torch/*  
Proceed (Y/n)? y  
 Successfully uninstalled torch-1.8.0  
WARNING: Skipping torchvision as it is not installed.  
WARNING: Skipping torchaudio as it is not installed.  
Found existing installation: numpy 1.19.5  
Uninstalling numpy-1.19.5:  
 Would remove:  
   /home/jetson/.local/bin/f2py  
   /home/jetson/.local/bin/f2py3  
   /home/jetson/.local/bin/f2py3.6  
   /home/jetson/.local/lib/python3.6/site-packages/numpy-1.19.5.dist-info/*  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libgfortran-daac5196.so.5.0.0  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-32ff4d91.3.13.so  
   /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libz-558a5e64.so.1.2.7  
   /home/jetson/.local/lib/python3.6/site-packages/numpy/*  
Proceed (Y/n)? y  
 Successfully uninstalled numpy-1.19.5
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl  
Defaulting to user installation because normal site-packages is not writeable  
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple  
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8) 
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1) 
Installing collected packages: torch 
Successfully installed torch-1.10.0 
jetson@SSD:~$ python 
Python 3.6.9 (default, Mar 10 2023, 16:46:00) 
[GCC 8.4.0] on linux 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import torch 
>>> print(torch.__version__) 
1.10.0 
>>> import numpy 
>>> print(numpy.__version__) 
1.13.3 
>>> quit()
$ python3 -c " 
> import torch 
> print('PyTorch:', torch.__version__) 
> print('CUDA available:', torch.cuda.is_available()) 
> if torch.cuda.is_available(): 
>     print('Device:', torch.cuda.get_device_name(0)) 
> " 
PyTorch: 1.10.0 
CUDA available: True 
Device: Xavier

可以看出来,问题的症结在于,用户空间的 numpy-1.19.5 与 Jetson NX 平台支持的 torch 不兼容。

那为什么突然就不兼容了呢?我猜测是因为这个:这台 Jetson Xavier NX 系统里原来的 numpy 是 1.13.3 版本。不知道谁在用户空间里安装了 numpy 1.19.5 版本。因为用户空间里 python3 指向 python3.6 而 pip3 指向了 python3.7 的 pip,我前些天做了修改,将用户空间里的 python3 和 pip3 都指向了 Python 3.6.9 版本的了。不确定,但是多半是因为这个,因为在这之间系统里没有发生别的事情。

在崩溃边缘试探

后续发现系统里的 SciPy 需要 NumPy 1.14.5,于是我就尝试着安装一下,

$ pip3 install numpy==1.14.5
......
Successfully built numpy 
Installing collected packages: numpy 
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the 
source of the following dependency conflicts. 
imgaug 0.4.0 requires opencv-python, which is not installed. 
tifffile 2020.9.3 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible. 
seaborn 0.11.2 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible. 
scikit-image 0.17.2 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible. 
pandas 1.1.5 requires numpy>=1.15.4, but you have numpy 1.14.5 which is incompatible. 
opencv-contrib-python 4.5.5.62 requires numpy>=1.19.3; python_version >= "3.6" and platform_system == "Linux" and platform_machin 
e == "aarch64", but you have numpy 1.14.5 which is incompatible. 
imgaug 0.4.0 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible. 
Successfully installed numpy-1.14.5

可以看到,也许我最高需要安装到 NumPy 1.19.3,但是前面导致问题的版本是 NumPy 1.19.5。不管怎么说,先试试看,

$ python3 -c "import numpy 
print(numpy.__version__) 
import torch 
print(torch.__version__) 
print('CUDA available: ' + str(torch.cuda.is_available())) 
print('cuDNN version: ' + str(torch.backends.cudnn.version())) 
a = torch.cuda.FloatTensor(2).zero_() 
print('Tensor a = ' + str(a)) 
b = torch.randn(2).cuda() 
print('Tensor b = ' + str(b)) 
c = a + b*2 
print('Tensor c = ' + str(c)) "

结果是 NumPy 1.14.5 并没有导致 PyTorch 导入时 Python 崩溃。准备再试试看 NumPy 1.19.5。我就是想试试,不行了再删除。

很遗憾,这次再次遇到了“Illegal instruction”。于是,根据上面 pip 的兼容性提示,我又安装了 NumPy 1.19.3,再测试,

$ python3 -c "import numpy print(numpy.__version__) import torch print(torch.__version__)" 1.19.3 1.10.0

没有问题!

前面较早版本的 NumPy 安装的时候,下载之后还要先创建 wheel;后面的 1.15.4 版本都是直接下载 whl 的安装包了。可见前面的版本是真“过时”得厉害。©

本文发表于水景一页。永久链接:<http://cnzhx.net/blog/illegal-instruction-when-import-torch-on-jetson-nx/>。转载请保留此信息及相应链接。

雁过留声,人过留名

您的邮箱地址不会被公开。 必填项已用 * 标注

特别提示:与当前文章主题无关的讨论相关但需要较多讨论求助信息请发布到水景一页讨论区的相应版块,谢谢您的理解与合作!请参考本站互助指南
您可以在评论中使用如下的 HTML 标记来辅助表达: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>