Hi everyone,
I am building an automated scraper for an academic repository (https://repository.lib.cuhk.edu.hk/en/collection/etd/year), which structures resources by year. However, we've hit a hard block.
Here is the situation:
- Aggressive CAPTCHA: The site uses a mandatory CAPTCHA (image verification) system that triggers for almost every visit, regardless of whether it's a real human browser or a bot.
- AI/Agent Failure: Our automated agents attempted to solve the image verification multiple times but were consistently blocked.
- JS-Heavy: The site is heavily reliant on JavaScript to load the actual item lists, making simple HTTP requests (
requests/curl) useless.
My Questions:
- What is the state-of-the-art stack for bypassing such aggressive image CAPTCHAs in a fully automated Python pipeline today? (e.g., combining
undetected-chromedriver with a specific solver service?)
- Since this is an academic repository, it likely supports OAI-PMH. Does anyone have experience bypassing the frontend completely by finding standard API endpoints (like
/oai/request) on similar library systems?
- Any specific configuration recommendations for browser automation to reduce the CAPTCHA difficulty/frequency on this specific type of site?
Thanks in advance for any insights!
Hi everyone,
I am building an automated scraper for an academic repository (https://repository.lib.cuhk.edu.hk/en/collection/etd/year), which structures resources by year. However, we've hit a hard block.
Here is the situation:
requests/curl) useless.My Questions:
undetected-chromedriverwith a specific solver service?)/oai/request) on similar library systems?Thanks in advance for any insights!