Proposal: Add a fast url_large_download() helper for large file downloads (multipart + threading) #6236

SOORAJTS2001 · 2025-11-14T11:20:10Z

SOORAJTS2001
Nov 14, 2025

Hello guys,

I’ve been exploring Avocado’s download utilities and noticed that while url_download() works well for small and medium files, there isn’t a built-in option optimized for downloading large files efficiently.

Right now, downloads are handled using a single HTTP stream via urllib, which becomes noticeably slow for multi-GB artifacts. For larger workloads, a multi-segment or multi-connection approach can significantly improve performance.

I’d like to contribute a new optional helper, something like url_large_download(), which would:

perform multipart HTTP Range downloads
use multiple threads to download file segments in parallel
merge the segments into the final file
fall back to using wget or the existing method if the server does not support Range requests

This would not modify or break existing behavior — it would simply provide a faster option for workloads involving large files.

Before opening a PR, I wanted to ask:
Would you be open to adding such a helper to Avocado’s utilities?
If so, I can prepare an implementation, tests, and documentation aligned with the project’s style.

Additionally I think aria2c is also a good option, what do you guys think about it.

Thanks, and happy to discuss the design details!

bssrikanth · 2025-11-14T13:54:06Z

bssrikanth
Nov 14, 2025

+1

Me too was facing similar issues where large guest image bootstrapping was failing in our lab due to urlopen limitations. via #6237 I was able to fix that issue. That replaces urllib with requests for streaming downloads which is more reliable, but external.

0 replies

SOORAJTS2001 · 2025-11-15T05:31:19Z

SOORAJTS2001
Nov 15, 2025
Author

Hi @bssrikanth,

Thank you for your insights on the download improvements and for highlighting the requests.get(stream=True) approach in the related PR.

I’ve been experimenting with another direction, a multi-threaded multipart downloader using standard library HTTP Range requests. It splits the file into segments and downloads them in parallel, which gives a noticeable performance boost for large files.

Since this method stays within the Python stdlib, it doesn’t add new dependencies and remains consistent with Avocado’s current urllib usage.

If you think this approach is valuable I am glad to co-author your PR

Would appreciate your thoughts on whether this direction aligns with the project’s expectations.

import logging
import math
import os
import shutil
import threading
import urllib.error
import urllib.request

log = logging.getLogger(__name__)


def url_large_download(url, filename, segments=4, timeout=15, retries=3):
    """
    Hybrid downloader:
    - Uses segmented HTTP Range download if supported by server
    - Falls back to urllib streaming if Content-Range is unavailable
    """

    # some servers requires user agent
    UA = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

    def _supports_range(url, timeout=10):
        """
        Sends GET Range:0-1 and checks if server returns Content-Range.
        If not, server doesn't support partial requests.
        """
        headers = {
            "Range": "bytes=0-1",
            "User-Agent": UA,
            "Accept": "*/*"
        }

        req = urllib.request.Request(url, headers=headers)
        try:
            with urllib.request.urlopen(req, timeout=timeout) as resp:
                cr = resp.headers.get("Content-Range")
                if cr and "/" in cr:
                    return True, int(cr.split("/")[-1])  # supports range
                return False, None
        except Exception:
            # Range probe failed. Falling back to stream.
            return False, None

    def _stream_download(url, filename, timeout=15):
        """
        Basic urllib streaming download (single connection).
        Works everywhere.
        """
        headers = {"User-Agent": UA}
        req = urllib.request.Request(url, headers=headers)

        with urllib.request.urlopen(req, timeout=timeout) as resp:
            with open(filename, "wb") as out:
                shutil.copyfileobj(resp, out)

        log.info("Stream download complete: %s", filename)

    def _download_range(url, start, end, part_file, timeout, retries):
        """Download a specific byte range with retry logic."""
        headers = {
            "Range": f"bytes={start}-{end}",
            "User-Agent": UA,
            "Accept": "*/*"
        }
        # retry if there is any failure in between
        for attempt in range(1, retries + 1):
            try:
                req = urllib.request.Request(url, headers=headers)
                with urllib.request.urlopen(req, timeout=timeout) as resp:
                    with open(part_file, "wb") as out:
                        shutil.copyfileobj(resp, out)
                return  # SUCCESS

            except (http.client.RemoteDisconnected,
                    urllib.error.URLError,
                    ConnectionResetError) as exc:
                log.warning(
                    "Range %s-%s attempt %d failed: %s",
                    start, end, attempt, exc
                )

        raise RuntimeError(f"Failed to download range {start}-{end}")

    # Checking if server supports range requests
    supports_range, total_size = _supports_range(url)

    if not supports_range:
        # Range not supported, falling back to simple streaming
        _stream_download(url, filename, timeout)
        return filename

    # Range supported, using segmented download.
    log.info("Total size: %.2f MB", total_size / (1024 * 1024))

    # Step 2: compute ranges
    part_size = math.ceil(total_size / segments)
    ranges = []
    for i in range(segments):
        start = i * part_size
        end = min(start + part_size - 1, total_size - 1)
        ranges.append((start, end))

    # Step 3: launch threads carefully
    threads = []
    part_files = []

    for idx, (start, end) in enumerate(ranges):
        part_file = f"{filename}.part{idx}"
        part_files.append(part_file)

        t = threading.Thread(
            target=_download_range,
            args=(url, start, end, part_file, timeout, retries),
            daemon=True
        )
        threads.append(t)
        t.start()

    # Step 4: wait for all threads to finish
    for t in threads:
        t.join()

    # Step 5: merge parts
    with open(filename, "wb") as out:
        for part in part_files:
            with open(part, "rb") as pf:
                shutil.copyfileobj(pf, out)

    # Step 6: cleanup
    for part in part_files:
        try:
            os.remove(part)
        except FileNotFoundError:
            pass

    log.info("Segmented download complete: %s", filename)
    return filename


url_large_download(
    url="https://bom.proof.ovh.net/files/1Gb.dat",
    filename="sample.bin",
    segments=4
)

I tried it out with this server

urllib has it's own limitations like

Every urlopen() creates a fresh TCP/TLS connection.
premature connection drops

0 replies

bssrikanth · 2025-11-17T10:35:45Z

bssrikanth
Nov 17, 2025

Hi Sooraj,

Thank you for the patch and the detailed implementation! I tested it with my original use case (bootstrapping large guest image files, often several GB in size), and it works really well – no more intermittent connection resets or aborts that I was seeing with the plain urlopen() approach.

A couple of observations from the testing/user-experience side:

It temporarily requires additional disk space roughly equal to the file size because of the .part* files. For very large images this can be a noticeable limitation on systems with small /tmp or limited free space.
The interactive progress bar that users are used to from url_download_interactive() is currently missing, so the download appears “silent” until it finishes (or fails).

I agree that your approach solves the original problem while using the standard library.

My only concern is the amount of new code we’d be adding and the long-term maintenance burden. Unless Avocado has a strict policy against third-party dependencies, a much smaller and still-reliable solution could be achieved with the requests library as show in PR #6237 .

That said, if the project prefers to avoid any new dependencies, I have no problems dropping my PR.

Requesting what the core team thinks about.

Thanks again.

2 replies

clebergnu Nov 25, 2025
Maintainer

Hi @bssrikanth @SOORAJTS2001, I'm very happy about this proposal. Please go ahead and turn it into a PR. Thanks!

SOORAJTS2001 Dec 3, 2025
Author

Hi @clebergnu, I have just made a PR about it, Please check it out at #6247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add a fast url_large_download() helper for large file downloads (multipart + threading) #6236

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: Add a fast url_large_download() helper for large file downloads (multipart + threading) #6236

Uh oh!

SOORAJTS2001 Nov 14, 2025

Replies: 3 comments · 2 replies

Uh oh!

bssrikanth Nov 14, 2025

Uh oh!

Uh oh!

SOORAJTS2001 Nov 15, 2025 Author

Uh oh!

bssrikanth Nov 17, 2025

Uh oh!

clebergnu Nov 25, 2025 Maintainer

Uh oh!

SOORAJTS2001 Dec 3, 2025 Author

SOORAJTS2001
Nov 14, 2025

Replies: 3 comments 2 replies

bssrikanth
Nov 14, 2025

SOORAJTS2001
Nov 15, 2025
Author

bssrikanth
Nov 17, 2025

clebergnu Nov 25, 2025
Maintainer

SOORAJTS2001 Dec 3, 2025
Author