Describe the bug
Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with BoersenZeitung
How to reproduce
from fundus import PublisherCollection, Crawler
from fundus.logging import set_log_level
from logging import DEBUG
publisher = PublisherCollection.de.BoersenZeitung
crawler = Crawler(publisher)
set_log_level(DEBUG)
for article in crawler.crawl(max_articles=50, only_complete=False, error_handling="suppress"):
print(article.html.responded_url)
print(article.title)
print("--------------------------------")
Expected behavior.
I would expect to consistently see a title being parsed and printed
Logs and Stack traces
No response
Screenshots
Logs in 1. iteration:

Logs in 2. iteration:

Additional Context
Here is an example of an incomplete HTML file test.zip
Environment
python==3.9
aiohttp==3.8.6
aioitertools==0.11.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.2.0
black==23.1.0
Brotli==1.1.0
certifi==2024.2.2
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
cssselect==1.2.0
decorator==5.1.1
dict2xml==1.7.6
dill==0.3.8
exceptiongroup==1.2.0
FastWARC==0.14.5
feedparser==6.0.11
frozenlist==1.4.1
-e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus
idna==3.6
iniconfig==2.0.0
isort==5.12.0
langdetect==1.0.9
lxml==4.9.4
more-itertools==9.1.0
multidict==6.0.4
mypy==1.9.0
mypy-extensions==1.0.0
numpy==1.26.4
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
platformdirs==4.1.0
pluggy==1.4.0
pytest==7.2.2
python-dateutil==2.8.2
pytz==2024.1
requests==2.31.0
robotspy==0.10.0
sgmllib3k==1.0.0
six==1.16.0
tomli==2.0.1
tqdm==4.66.1
types-colorama==0.4.15.20240106
types-lxml==2023.2.11
types-python-dateutil==2.8.19.20240106
types-requests==2.28.11.17
types-urllib3==1.26.25.14
typing_extensions==4.9.0
tzdata==2024.1
urllib3==2.2.0
validators==0.28.0
xmltodict==0.14.1
yarl==1.9.4
Describe the bug
Some Websites occasionally don't return the full HTML, but rather an HTML page mostly containing script elements. I first noticed this while working with
BoersenZeitungHow to reproduce
Expected behavior.
I would expect to consistently see a title being parsed and printed
Logs and Stack traces
No response
Screenshots
Logs in 1. iteration:
Logs in 2. iteration:
Additional Context
Here is an example of an incomplete HTML file test.zip
Environment
python==3.9 aiohttp==3.8.6 aioitertools==0.11.0 aiosignal==1.3.1 async-timeout==4.0.3 attrs==23.2.0 black==23.1.0 Brotli==1.1.0 certifi==2024.2.2 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 cssselect==1.2.0 decorator==5.1.1 dict2xml==1.7.6 dill==0.3.8 exceptiongroup==1.2.0 FastWARC==0.14.5 feedparser==6.0.11 frozenlist==1.4.1 -e git+https://github.com/flairNLP/fundus.git@05cc97dd8be59ac05d89456ac0db39cddce74e02#egg=fundus idna==3.6 iniconfig==2.0.0 isort==5.12.0 langdetect==1.0.9 lxml==4.9.4 more-itertools==9.1.0 multidict==6.0.4 mypy==1.9.0 mypy-extensions==1.0.0 numpy==1.26.4 packaging==23.2 pandas==2.2.2 pathspec==0.12.1 platformdirs==4.1.0 pluggy==1.4.0 pytest==7.2.2 python-dateutil==2.8.2 pytz==2024.1 requests==2.31.0 robotspy==0.10.0 sgmllib3k==1.0.0 six==1.16.0 tomli==2.0.1 tqdm==4.66.1 types-colorama==0.4.15.20240106 types-lxml==2023.2.11 types-python-dateutil==2.8.19.20240106 types-requests==2.28.11.17 types-urllib3==1.26.25.14 typing_extensions==4.9.0 tzdata==2024.1 urllib3==2.2.0 validators==0.28.0 xmltodict==0.14.1 yarl==1.9.4