Need some help replicating the Python wheel for 2.0.0+cu117

Need some help replicating wheel building for 2.0.0+cu117

Hi everyone. I’m currently working on a project that is pinned to Pytorch 2.0.0+cu117. During the course of the project, I’ve run against some problems, identified the root cause and came up with solution to patch the behavior.

I’m currently in the process of upstreaming the patch, but until that is done, reviewed, accepted and merged (:crossed_fingers:) it’s gonna take some time. Thus, I decided in the meantime to apply my patch on top of 2.0.0+cu117, rebuild the wheel and host it internally so that my other colleagues are able to benefit from the fix. This brings us to my problem.

Notwithstanding the changes made in the patch, I’m struggling to create a wheel package with the exact same contents as torch-2.0.0+cu117-cp39-cp39-linux_x86_64.whl, distributed from https://download.pytorch.org/whl/torch/. I was wondering if someone with more CI/release knowledge would help identify what I’m missing.

So currently I’ve made the following changes to the Dockerfile in the root folder

diff --git Dockerfile Dockerfile
index e6ade308499..b50cbea3ea4 100644
--- Dockerfile
+++ Dockerfile
@@ -40,7 +40,7 @@ COPY requirements.txt .
 RUN chmod +x ~/miniconda.sh && \
     bash ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
+    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build mkl-include pyyaml numpy ipython && \
     /opt/conda/bin/python -mpip install -r requirements.txt && \
     /opt/conda/bin/conda clean -ya
 
@@ -50,13 +50,18 @@ COPY . .
 RUN git submodule update --init --recursive
 
 FROM conda as build
+ARG PYTORCH_BUILD_VERSION
+ARG PYTORCH_BUILD_NUMBER="0"
+ENV PYTORCH_BUILD_VERSION=${PYTORCH_BUILD_VERSION}
+ENV PYTORCH_BUILD_NUMBER=${PYTORCH_BUILD_NUMBER}
 WORKDIR /opt/pytorch
 COPY --from=conda /opt/conda /opt/conda
 COPY --from=submodule-update /opt/pytorch /opt/pytorch
 RUN --mount=type=cache,target=/opt/ccache \
-    TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
+    TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0+PTX 7.5 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
     CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
-    python setup.py install
+    BUILD_TEST=0 \
+    python setup.py install bdist_wheel
 
 FROM conda as conda-installs
 ARG PYTHON_VERSION=3.8
@@ -101,3 +106,4 @@ WORKDIR /workspace
 FROM official as dev
 # Should override the already installed version from the official-image stage
 COPY --from=build /opt/conda /opt/conda
+COPY --from=build /opt/pytorch /opt/pytorch

and then I’m kicking off the build process with

DOCKER_BUILDKIT=1 docker build . --build-arg PYTHON_VERSION=3.9 --build-arg PYTORCH_BUILD_VERSION=2.0.0r1+cu117 --build-arg BASE_IMAGE=nvidia/cuda:11.7.1-cudnn8-devel-ubuntu18.04 --tag pytorch:2.0.0r1-cu117

After 3 hours (:smiling_face_with_tear:) I have a wheel inside /opt/pytorch/dist that I can pull out of the container.

I’ve compared the contents of torch-2.0.0+cu117-cp39-cp39-linux_x86_64.whl (original) with torch-2.0.0r1+cu117-cp39-cp39-linux_x86_64.whl (patched) and I’ve noticed I’m missing a couple of crucial libraries as you can see below (note that common .so objects between both folders are not displayed). Most of it are cuda libraries that I expected to be present and available in the base image nvidia/cuda:11.7.1-cudnn8-devel-ubuntu18.04 and something related to gomp.

Any idea what am I missing to achieve the same wheel as in production? I’ve skimmed through the repo’s GitHub workflows but I’m struggling to pinpoint the exact details of how the wheels are built for the project.

I appreciate any help,
Sérgio

Hi Sergio,

Did you see our wheel build-scripts here: GitHub - pytorch/builder: Continuous builder and binary build scripts for pytorch
These are the entry-point scripts to build our wheels: https://github.com/pytorch/builder/tree/main/manywheel
See the README or the build_all_docker.sh

I think what you are missing is this logic which unzips the wheel, copies over a bunch of dependencies and then rezips the wheel + fixes the wheel RECORD file.

1 Like

I completely missed that entire repo. Thank you for the pointers. I’ll have a look.

That did the trick @smth. I finally had some time to revisit the issue. I’ll just document the experience below in case someone else faces the same needs.

Context: in my case I wanted to build 2.0.0 with a custom patch. There are things that are specific to building
this particular version, at this particular point in time. Expect those to change by the time you read this.

Steps:

  1. After adding your patch to the PyTorch version you want to modify, you will need to tag it. Reason: the build script
    currently does not accept setting the build version manually (see url) and will unconditionally look for a tag corresponding to your current git HEAD (on your pytorch repo).

  2. You need to build the docker image that will be used to build and package pytorch. To do that, clone GitHub - pytorch/builder: Continuous builder and binary build scripts for pytorch and from this repo, the image you will need will depend on the type of package you need. I was interested in generating a package for PyTorch 2.0.0, linux, with cuda 11.7 and for Python 3.9. At the moment this requires a couple of patches to work.

  3. Checkout the correct branch/ref for your PyTorch version.

    git checkout release/2.0
    
  4. Patch the build process if needed. For 2.0, I had to make the following changes

    diff --git common/install_magma.sh common/install_magma.sh
    index b524c92..efaff05 100644
    --- common/install_magma.sh
    +++ common/install_magma.sh
    @@ -16,7 +16,7 @@ function do_install() {
            set -x
            tmp_dir=$(mktemp -d)
            pushd ${tmp_dir}
    -        wget -q https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
    +        curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
            tar -xvf "${magma_archive}"
            mkdir -p "${cuda_dir}/magma"
            mv include "${cuda_dir}/magma/include"
    diff --git common/install_patchelf.sh common/install_patchelf.sh
    index 032e3cc..869a20e 100644
    --- common/install_patchelf.sh
    +++ common/install_patchelf.sh
    @@ -2,7 +2,8 @@
    
    set -ex
    
    -git clone https://github.com/NixOS/patchelf
    +# see https://github.com/pytorch/pytorch/issues/97266
    +git clone -b 0.17.2 https://github.com/NixOS/patchelf
    cd patchelf
    sed -i 's/serial/parallel/g' configure.ac
    ./bootstrap.sh
    diff --git manywheel/build_docker.sh manywheel/build_docker.sh
    index b807490..0dbbbf7 100755
    --- manywheel/build_docker.sh
    +++ manywheel/build_docker.sh
    @@ -33,7 +33,7 @@ case ${GPU_ARCH_TYPE} in
            DOCKER_TAG=cuda${GPU_ARCH_VERSION}
            LEGACY_DOCKER_IMAGE=${DOCKER_REGISTRY}/pytorch/manylinux-cuda${GPU_ARCH_VERSION//./}
            # Keep this up to date with the minimum version of CUDA we currently support
    -        GPU_IMAGE=nvidia/cuda:10.2-devel-centos7
    +        GPU_IMAGE=nvidia/cuda:11.7.1-devel-centos7
            DEVTOOLSET_VERSION="9"
            if [[ ${GPU_ARCH_VERSION:0:2} == "10" ]]; then
                DEVTOOLSET_VERSION="7"
    diff --git manywheel/build_scripts/build_utils.sh manywheel/build_scripts/build_utils.sh
    index 5b8216a..5b5035d 100755
    --- manywheel/build_scripts/build_utils.sh
    +++ manywheel/build_scripts/build_utils.sh
    @@ -4,7 +4,7 @@
    # XXX: the official https server at www.openssl.org cannot be reached
    # with the old versions of openssl and curl in Centos 5.11 hence the fallback
    # to the ftp mirror:
    -OPENSSL_DOWNLOAD_URL=ftp://ftp.openssl.org/source/old/1.1.1/
    +OPENSSL_DOWNLOAD_URL=https://ftp.openssl.org/source/old/1.1.1/
    # Ditto the curl sources
    CURL_DOWNLOAD_URL=http://curl.askapache.com/download
    

    More recent versions might work out of the box.

  5. We’re now ready to build the image. This is what I called for my particular needs.

    TOPDIR=$(git rev-parse --show-toplevel)
    CUDA_VERSION=11.7
    GPU_ARCH_TYPE=cuda GPU_ARCH_VERSION="${CUDA_VERSION}" "${TOPDIR}/manywheel/build_docker.sh"
    

    At the end, the new image will be tagged as docker.io/pytorch/manylinux-cuda117 and docker.io/pytorch/manylinux-builder:cuda11.7-2.0. Take note of the image name and tag.

  6. The next step is to replicate what is being done by CI, namely in here and here. From the terminal call

    export BINARY_ENV_FILE=/tmp/env
    export BUILD_ENVIRONMENT=linux-binary-manywheel
    export BUILDER_ROOT=/builder
    export BUILDER_HOST=<the path to your pytorch/builder repo>
    export DESIRED_CUDA=cu117
    export DESIRED_PYTHON="3.9"
    export DOCKER_IMAGE=docker.io/pytorch/manylinux-builder:cuda11.7-2.0 # the name of the container created in the previous step
    export GPU_ARCH_VERSION=11.7
    export GPU_ARCH_TYPE=cuda
    export PYTORCH_FINAL_PACKAGE_DIR=/artifacts
    export PACKAGE_TYPE=manywheel
    export PYTORCH_HOST=<the path to your pytorch/pytorch repo>
    export PYTORCH_ROOT=/pytorch
    export SKIP_ALL_TESTS=1
    
    mkdir -p /tmp/artifacts/
    container_name=$(docker run \
        -e BINARY_ENV_FILE \
        -e BUILDER_ROOT \
        -e BUILD_ENVIRONMENT \
        -e DESIRED_CUDA \
        -e DESIRED_PYTHON \
        -e GITHUB_ACTIONS \
        -e GPU_ARCH_TYPE \
        -e GPU_ARCH_VERSION \
        -e PACKAGE_TYPE \
        -e PYTORCH_FINAL_PACKAGE_DIR \
        -e PYTORCH_ROOT \
        -e SKIP_ALL_TESTS \
        -e PYTORCH_EXTRA_INSTALL_REQUIREMENTS \
        --tty \
        --detach \
        -v "${PYTORCH_HOST}:/pytorch" \
        -v "${BUILDER_HOST}:/builder" \
        -v "/tmp/artifacts:/artifacts" \
        -w / \
        "${DOCKER_IMAGE}"
    )
    docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
    docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/${{ inputs.PACKAGE_TYPE }}/build.sh"
    
  7. Get your wheel from inside /tmp/artifacts/

That’s it.

2 Likes