在 Jetson Xavier NX 使用 Python 在 import torch
时突然遇到了”Illegal instruction”,如下面的代码和输出所示,
$ python
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction
出错后 Python 直接崩溃。
网上搜的,以及 AI 回答的几乎都是一个原因和解决方案:“通常是由于安装的 PyTorch wheel 与 Jetson Xavier NX 的 CPU 架构或 JetPack 版本不完全兼容所导致的。为了解决这个问题,需要使用 NVIDIA 官方为 Jetson 平台提供的预构建 PyTorch wheel,这些 wheel 已针对 ARM aarch64 架构和特定的 JetPack 版本进行了优化。”
pip3 uninstall torch torchvision
卸载当前安装的 PyTorch 和 TorchVision 之后,按照 NVIDIA 官方说明 下载 torch 并安装和测试,
$ wget https://nvidia.box.com/shared/static/fjtbno0vpo676a25cgvuqc1wty0fkkg6.whl -O torch-1.10.0-cp36-cp36m-linux_aarch64.whl
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8)
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1)
Installing collected packages: torch
Successfully installed torch-1.10.0
$ python
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction
很明显,安装没有问题,只要是官方列出来支持的版本,次次都能成功。但是只要 import torch
就直接崩溃。当然,网上说的将 OPENBLAS_CORETYPE=ARMV8
加到 .bashrc
并加载也是试过的。
正确的解决办法¶
找不到答案,在重新烧录系统之前,打算最后再自己试着看看 Python 的详细调试输出,
$ python -vvv
import _frozen_importlib # frozen
import _imp # builtin
import sys # builtin
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
.........
# trying /usr/lib/python3.6/dist-packages/usercustomize.pyc
import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x7f90847d30>
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
.........
# trying /usr/lib/python3.6/rlcompleter.py
# /usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc matches /usr/lib/python3.6/rlcompleter.py
# code object from '/usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc'
import 'rlcompleter' # <_frozen_importlib_external.SourceFileLoader object at 0x7f9068fbe0>
>>> import torch
# trying /home/jetson/torch.cpython-36m-aarch64-linux-gnu.so
# trying /home/jetson/torch.abi3.so
# trying /home/jetson/torch.so
# trying /home/jetson/torch.py
.........
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py
# /home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc matches /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py
# code object from '/home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc'
# trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-aarch64-linux-gnu.so
Illegal instruction
$
输出信息非常非常非常多,但是最后几行让我看到了曙光:Python 是在加载 _multiarray_umath.cpython-36m-aarch64-linux-gnu.so
这个共享库时遇到问题的。估计是这个 numpy 的库包含了一些指令集(instruction set) 让 Jetson 的 ARM 架构无法识别或执行。还是先问一下 AI,AI 给出了指令,先卸载,再重新安装。这次不同的地方在于,将 numpy 这个包也卸载了。
卸载、安装和测试:
$ pip3 uninstall torch torchvision torchaudio numpy
Found existing installation: torch 1.8.0
Uninstalling torch-1.8.0:
Would remove:
/home/jetson/.local/bin/convert-caffe2-to-onnx
/home/jetson/.local/bin/convert-onnx-to-caffe2
/home/jetson/.local/lib/python3.6/site-packages/caffe2/*
/home/jetson/.local/lib/python3.6/site-packages/torch-1.8.0.dist-info/*
/home/jetson/.local/lib/python3.6/site-packages/torch/*
Proceed (Y/n)? y
Successfully uninstalled torch-1.8.0
WARNING: Skipping torchvision as it is not installed.
WARNING: Skipping torchaudio as it is not installed.
Found existing installation: numpy 1.19.5
Uninstalling numpy-1.19.5:
Would remove:
/home/jetson/.local/bin/f2py
/home/jetson/.local/bin/f2py3
/home/jetson/.local/bin/f2py3.6
/home/jetson/.local/lib/python3.6/site-packages/numpy-1.19.5.dist-info/*
/home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libgfortran-daac5196.so.5.0.0
/home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-32ff4d91.3.13.so
/home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libz-558a5e64.so.1.2.7
/home/jetson/.local/lib/python3.6/site-packages/numpy/*
Proceed (Y/n)? y
Successfully uninstalled numpy-1.19.5
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8)
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1)
Installing collected packages: torch
Successfully installed torch-1.10.0
jetson@SSD:~$ python
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.10.0
>>> import numpy
>>> print(numpy.__version__)
1.13.3
>>> quit()
$ python3 -c "
> import torch
> print('PyTorch:', torch.__version__)
> print('CUDA available:', torch.cuda.is_available())
> if torch.cuda.is_available():
> print('Device:', torch.cuda.get_device_name(0))
> "
PyTorch: 1.10.0
CUDA available: True
Device: Xavier
可以看出来,问题的症结在于,用户空间的 numpy-1.19.5 与 Jetson NX 平台支持的 torch 不兼容。
那为什么突然就不兼容了呢?我猜测是因为这个:这台 Jetson Xavier NX 系统里原来的 numpy 是 1.13.3 版本。不知道谁在用户空间里安装了 numpy 1.19.5 版本。因为用户空间里 python3 指向 python3.6 而 pip3 指向了 python3.7 的 pip,我前些天做了修改,将用户空间里的 python3 和 pip3 都指向了 Python 3.6.9 版本的了。不确定,但是多半是因为这个,因为在这之间系统里没有发生别的事情。
在崩溃边缘试探¶
后续发现系统里的 SciPy 需要 NumPy 1.14.5,于是我就尝试着安装一下,
$ pip3 install numpy==1.14.5
......
Successfully built numpy
Installing collected packages: numpy
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the
source of the following dependency conflicts.
imgaug 0.4.0 requires opencv-python, which is not installed.
tifffile 2020.9.3 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible.
seaborn 0.11.2 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible.
scikit-image 0.17.2 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible.
pandas 1.1.5 requires numpy>=1.15.4, but you have numpy 1.14.5 which is incompatible.
opencv-contrib-python 4.5.5.62 requires numpy>=1.19.3; python_version >= "3.6" and platform_system == "Linux" and platform_machin
e == "aarch64", but you have numpy 1.14.5 which is incompatible.
imgaug 0.4.0 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible.
Successfully installed numpy-1.14.5
可以看到,也许我最高需要安装到 NumPy 1.19.3,但是前面导致问题的版本是 NumPy 1.19.5。不管怎么说,先试试看,
$ python3 -c "import numpy
print(numpy.__version__)
import torch
print(torch.__version__)
print('CUDA available: ' + str(torch.cuda.is_available()))
print('cuDNN version: ' + str(torch.backends.cudnn.version()))
a = torch.cuda.FloatTensor(2).zero_()
print('Tensor a = ' + str(a))
b = torch.randn(2).cuda()
print('Tensor b = ' + str(b))
c = a + b*2
print('Tensor c = ' + str(c))
"
结果是 NumPy 1.14.5 并没有导致 PyTorch 导入时 Python 崩溃。准备再试试看 NumPy 1.19.5。我就是想试试,不行了再删除。
很遗憾,这次再次遇到了“Illegal instruction”。于是,根据上面 pip 的兼容性提示,我又安装了 NumPy 1.19.3,再测试,
$ python3 -c "import numpy
print(numpy.__version__)
import torch
print(torch.__version__)"
1.19.3
1.10.0
没有问题!
前面较早版本的 NumPy 安装的时候,下载之后还要先创建 wheel;后面的 1.15.4 版本都是直接下载 whl 的安装包了。可见前面的版本是真“过时”得厉害。
本文发表于水景一页。永久链接:<http://cnzhx.net/blog/illegal-instruction-when-import-torch-on-jetson-nx/>。转载请保留此信息及相应链接。