Need some help replicating wheel building for 2.0.0+cu117
Hi everyone. I’m currently working on a project that is pinned to Pytorch 2.0.0+cu117. During the course of the project, I’ve run against some problems, identified the root cause and came up with solution to patch the behavior.
I’m currently in the process of upstreaming the patch, but until that is done, reviewed, accepted and merged () it’s gonna take some time. Thus, I decided in the meantime to apply my patch on top of 2.0.0+cu117, rebuild the wheel and host it internally so that my other colleagues are able to benefit from the fix. This brings us to my problem.
Notwithstanding the changes made in the patch, I’m struggling to create a wheel package with the exact same contents as torch-2.0.0+cu117-cp39-cp39-linux_x86_64.whl
, distributed from https://download.pytorch.org/whl/torch/. I was wondering if someone with more CI/release knowledge would help identify what I’m missing.
So currently I’ve made the following changes to the Dockerfile
in the root folder
diff --git Dockerfile Dockerfile
index e6ade308499..b50cbea3ea4 100644
--- Dockerfile
+++ Dockerfile
@@ -40,7 +40,7 @@ COPY requirements.txt .
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
- /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
+ /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build mkl-include pyyaml numpy ipython && \
/opt/conda/bin/python -mpip install -r requirements.txt && \
/opt/conda/bin/conda clean -ya
@@ -50,13 +50,18 @@ COPY . .
RUN git submodule update --init --recursive
FROM conda as build
+ARG PYTORCH_BUILD_VERSION
+ARG PYTORCH_BUILD_NUMBER="0"
+ENV PYTORCH_BUILD_VERSION=${PYTORCH_BUILD_VERSION}
+ENV PYTORCH_BUILD_NUMBER=${PYTORCH_BUILD_NUMBER}
WORKDIR /opt/pytorch
COPY --from=conda /opt/conda /opt/conda
COPY --from=submodule-update /opt/pytorch /opt/pytorch
RUN --mount=type=cache,target=/opt/ccache \
- TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
+ TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0+PTX 7.5 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
- python setup.py install
+ BUILD_TEST=0 \
+ python setup.py install bdist_wheel
FROM conda as conda-installs
ARG PYTHON_VERSION=3.8
@@ -101,3 +106,4 @@ WORKDIR /workspace
FROM official as dev
# Should override the already installed version from the official-image stage
COPY --from=build /opt/conda /opt/conda
+COPY --from=build /opt/pytorch /opt/pytorch
and then I’m kicking off the build process with
DOCKER_BUILDKIT=1 docker build . --build-arg PYTHON_VERSION=3.9 --build-arg PYTORCH_BUILD_VERSION=2.0.0r1+cu117 --build-arg BASE_IMAGE=nvidia/cuda:11.7.1-cudnn8-devel-ubuntu18.04 --tag pytorch:2.0.0r1-cu117
After 3 hours () I have a wheel inside /opt/pytorch/dist
that I can pull out of the container.
I’ve compared the contents of torch-2.0.0+cu117-cp39-cp39-linux_x86_64.whl
(original) with torch-2.0.0r1+cu117-cp39-cp39-linux_x86_64.whl
(patched) and I’ve noticed I’m missing a couple of crucial libraries as you can see below (note that common .so objects between both folders are not displayed). Most of it are cuda libraries that I expected to be present and available in the base image nvidia/cuda:11.7.1-cudnn8-devel-ubuntu18.04
and something related to gomp
.
Any idea what am I missing to achieve the same wheel as in production? I’ve skimmed through the repo’s GitHub workflows but I’m struggling to pinpoint the exact details of how the wheels are built for the project.
I appreciate any help,
Sérgio