记录了Jetson 上 Numpy 模块版本不兼容导致的”Illegal instruction”问题的解决。
问题与尝试解决¶
在 Jetson Xavier NX 使用 Python 在 import torch
时突然遇到了”Illegal instruction”,如下面的代码和输出所示,
$ python
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction
出错后 Python 直接崩溃。
网上搜的,以及 AI 回答的几乎都是一个原因和解决方案:“通常是由于安装的 PyTorch wheel 与 Jetson Xavier NX 的 CPU 架构或 JetPack 版本不完全兼容所导致的。为了解决这个问题,需要使用 NVIDIA 官方为 Jetson 平台提供的预构建 PyTorch wheel,这些 wheel 已针对 ARM aarch64 架构和特定的 JetPack 版本进行了优化。”
pip3 uninstall torch torchvision
卸载当前安装的 PyTorch 和 TorchVision 之后,按照 NVIDIA 官方说明 下载 torch 并安装和测试,
$ wget https://nvidia.box.com/shared/static/fjtbno0vpo676a25cgvuqc1wty0fkkg6.whl -O torch-1.10.0-cp36-cp36m-linux_aarch64.whl
$ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing ./torch-1.10.0-cp36-cp36m-linux_aarch64.whl
Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8)
Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1)
Installing collected packages: torch
Successfully installed torch-1.10.0
$ python
Python 3.6.9 (default, Mar 10 2023, 16:46:00)
[GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction
很明显,安装没有问题,只要是官方列出来支持的版本,次次都能成功。但是只要 import torch
就直接崩溃。当然,网上说的将 OPENBLAS_CORETYPE=ARMV8
加到 .bashrc
并加载也是试过的。
正确的解决办法¶
找不到答案,在重新烧录系统之前,打算最后再自己试着看看 Python 的详细调试输出,
$ python -v import _frozen_importlib # frozen import _imp # builtin import sys # builtin import '_warnings' # <class '_frozen_importlib.BuiltinImporter'> ......... # trying /usr/lib/python3.6/dist-packages/usercustomize.pyc import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x7f90847d30> Python 3.6.9 (default, Mar 10 2023, 16:46:00) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. ......... # trying /usr/lib/python3.6/rlcompleter.py # /usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc matches /usr/lib/python3.6/rlcompleter.py # code object from '/usr/lib/python3.6/__pycache__/rlcompleter.cpython-36.pyc' import 'rlcompleter' # <_frozen_importlib_external.SourceFileLoader object at 0x7f9068fbe0> >>> import torch # trying /home/jetson/torch.cpython-36m-aarch64-linux-gnu.so # trying /home/jetson/torch.abi3.so # trying /home/jetson/torch.so # trying /home/jetson/torch.py ......... # trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py # /home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc matches /home/jetson/.local/lib/python3.6/site-packages/numpy/core/overrides.py # code object from '/home/jetson/.local/lib/python3.6/site-packages/numpy/core/__pycache__/overrides.cpython-36.pyc' # trying /home/jetson/.local/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-aarch64-linux-gnu.so Illegal instruction $
输出信息非常非常非常多,但是最后几行让我看到了曙光:Python 是在加载 _multiarray_umath.cpython-36m-aarch64-linux-gnu.so
这个共享库时遇到问题的。估计是这个 numpy 的库包含了一些指令集(instruction set) 让 Jetson 的 ARM 架构无法识别或执行。还是先问一下 AI,AI 给出了指令,先卸载,再重新安装。这次不同的地方在于,将 numpy 这个包也卸载了。
卸载、安装和测试:
$ pip3 uninstall torch torchvision torchaudio numpy Found existing installation: torch 1.8.0 Uninstalling torch-1.8.0: Would remove: /home/jetson/.local/bin/convert-caffe2-to-onnx /home/jetson/.local/bin/convert-onnx-to-caffe2 /home/jetson/.local/lib/python3.6/site-packages/caffe2/* /home/jetson/.local/lib/python3.6/site-packages/torch-1.8.0.dist-info/* /home/jetson/.local/lib/python3.6/site-packages/torch/* Proceed (Y/n)? y Successfully uninstalled torch-1.8.0 WARNING: Skipping torchvision as it is not installed. WARNING: Skipping torchaudio as it is not installed. Found existing installation: numpy 1.19.5 Uninstalling numpy-1.19.5: Would remove: /home/jetson/.local/bin/f2py /home/jetson/.local/bin/f2py3 /home/jetson/.local/bin/f2py3.6 /home/jetson/.local/lib/python3.6/site-packages/numpy-1.19.5.dist-info/* /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libgfortran-daac5196.so.5.0.0 /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libopenblasp-r0-32ff4d91.3.13.so /home/jetson/.local/lib/python3.6/site-packages/numpy.libs/libz-558a5e64.so.1.2.7 /home/jetson/.local/lib/python3.6/site-packages/numpy/* Proceed (Y/n)? y Successfully uninstalled numpy-1.19.5 $ pip3 install torch-1.10.0-cp36-cp36m-linux_aarch64.whl Defaulting to user installation because normal site-packages is not writeable Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: dataclasses in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (0.8) Requirement already satisfied: typing-extensions in ./.local/lib/python3.6/site-packages (from torch==1.10.0) (4.1.1) Installing collected packages: torch Successfully installed torch-1.10.0 jetson@SSD:~$ python Python 3.6.9 (default, Mar 10 2023, 16:46:00) [GCC 8.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.__version__) 1.10.0 >>> import numpy >>> print(numpy.__version__) 1.13.3 >>> quit() $ python3 -c " > import torch > print('PyTorch:', torch.__version__) > print('CUDA available:', torch.cuda.is_available()) > if torch.cuda.is_available(): > print('Device:', torch.cuda.get_device_name(0)) > " PyTorch: 1.10.0 CUDA available: True Device: Xavier
可以看出来,问题的症结在于,用户空间的 numpy-1.19.5 与 Jetson NX 平台支持的 torch 不兼容。
那为什么突然就不兼容了呢?我猜测是因为这个:这台 Jetson Xavier NX 系统里原来的 numpy 是 1.13.3 版本。不知道谁在用户空间里安装了 numpy 1.19.5 版本。因为用户空间里 python3 指向 python3.6 而 pip3 指向了 python3.7 的 pip,我前些天做了修改,将用户空间里的 python3 和 pip3 都指向了 Python 3.6.9 版本的了。不确定,但是多半是因为这个,因为在这之间系统里没有发生别的事情。
在崩溃边缘试探¶
后续发现系统里的 SciPy 需要 NumPy 1.14.5,于是我就尝试着安装一下,
$ pip3 install numpy==1.14.5 ...... Successfully built numpy Installing collected packages: numpy ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. imgaug 0.4.0 requires opencv-python, which is not installed. tifffile 2020.9.3 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible. seaborn 0.11.2 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible. scikit-image 0.17.2 requires numpy>=1.15.1, but you have numpy 1.14.5 which is incompatible. pandas 1.1.5 requires numpy>=1.15.4, but you have numpy 1.14.5 which is incompatible. opencv-contrib-python 4.5.5.62 requires numpy>=1.19.3; python_version >= "3.6" and platform_system == "Linux" and platform_machin e == "aarch64", but you have numpy 1.14.5 which is incompatible. imgaug 0.4.0 requires numpy>=1.15, but you have numpy 1.14.5 which is incompatible. Successfully installed numpy-1.14.5
可以看到,也许我最高需要安装到 NumPy 1.19.3,但是前面导致问题的版本是 NumPy 1.19.5。不管怎么说,先试试看,
$ python3 -c "import numpy print(numpy.__version__) import torch print(torch.__version__) print('CUDA available: ' + str(torch.cuda.is_available())) print('cuDNN version: ' + str(torch.backends.cudnn.version())) a = torch.cuda.FloatTensor(2).zero_() print('Tensor a = ' + str(a)) b = torch.randn(2).cuda() print('Tensor b = ' + str(b)) c = a + b*2 print('Tensor c = ' + str(c)) "
结果是 NumPy 1.14.5 并没有导致 PyTorch 导入时 Python 崩溃。准备再试试看 NumPy 1.19.5。我就是想试试,不行了再删除。
很遗憾,这次再次遇到了“Illegal instruction”。于是,根据上面 pip 的兼容性提示,我又安装了 NumPy 1.19.3,再测试,
$ python3 -c "import numpy print(numpy.__version__) import torch print(torch.__version__)" 1.19.3 1.10.0
没有问题!
前面较早版本的 NumPy 安装的时候,下载之后还要先创建 wheel;后面的 1.15.4 版本都是直接下载 whl 的安装包了。可见前面的版本是真“过时”得厉害。©
本文发表于水景一页。永久链接:<http://cnzhx.net/blog/illegal-instruction-when-import-torch-on-jetson-nx/>。转载请保留此信息及相应链接。