Skip to content

aeshamangukiya/experience-capture-engine

Repository files navigation

Enterprise Website Experience Capture Engine

Production-ready TypeScript Playwright automation platform for scalable website experience capture. Designed for enterprise-scale operations supporting 500+ websites with high reliability and maintainability.

Architecture

Core Components

  • BrowserManager: Manages browser lifecycle, launches browser once and reuses across all captures
  • CaptureEngine: Central capture logic handling desktop and mobile captures with screenshots, HTML, and metadata
  • NavigationManager: Handles safe navigation with retry logic, networkidle waits, and lazy-loading support
  • CookieHandler: Enterprise cookie acceptance logic with custom selector support
  • AutoScroll: Utility for triggering lazy-loaded content on ecommerce sites
  • OutputValidator: Validates capture output structure and file integrity

Project Structure

experience-capture-engine/
├── config/
│   └── sites.json          # Site configurations
├── src/
│   ├── core/               # Core capture engine components
│   │   ├── browserManager.ts
│   │   ├── captureEngine.ts
│   │   ├── navigation.ts
│   │   ├── cookieHandler.ts
│   │   ├── htmlExporter.ts
│   │   └── metadata.ts
│   ├── workflows/          # Workflow orchestrators
│   │   ├── landingCapture.ts
│   │   └── loginCapture.ts
│   ├── utils/              # Utility modules
│   │   ├── logger.ts
│   │   ├── fileSystem.ts
│   │   ├── retry.ts
│   │   ├── autoScroll.ts
│   │   ├── delays.ts
│   │   ├── outputValidator.ts
│   │   └── validate.ts
│   ├── types/              # TypeScript type definitions
│   │   └── index.ts
│   └── index.ts            # Main entry point
├── output/                 # Capture output directory
│   └── {SiteName}/
│       ├── landing/
│       │   ├── desktop.png
│       │   ├── mobile.png
│       │   ├── page.html
│       │   └── metadata.json
│       └── login/
│           └── ...
└── package.json

Features

Enterprise Hardening

  • Navigation Stability: Networkidle waits, 60s timeout, 3-attempt retry with exponential backoff
  • Lazy Loading Support: Auto-scroll utility triggers lazy-loaded content before screenshots
  • Browser Lifecycle: Single browser instance reused across all captures, isolated contexts per capture
  • Desktop + Mobile: Full-page screenshots for both desktop (1920x1080) and mobile (iPhone 13)
  • Cookie Handling: Comprehensive cookie selector support with custom selectors per site
  • Resilience: Continues execution if one site fails, captures error screenshots
  • Anti-Bot Measures: Realistic user agents, random delays between actions, configurable headless mode

Scalability Features

  • Configuration-Driven: JSON-based site configuration with optional custom settings
  • Parallelization-Ready: Architecture prepared for async processing queues
  • Output Validation: Automated validation of capture outputs
  • Structured Logging: Timestamped, color-coded logs with INFO/WARN/ERROR levels

Installation

npm install
npx playwright install chromium

Usage

Basic Commands

# Capture all sites (landing + login pages)
npm run capture

# Capture only landing pages
npm run capture:landing

# Capture only login pages
npm run capture:login

# Validate output structure
npm run validate

Environment Variables

  • PLAYWRIGHT_HEADLESS: Set to "true" or "false" to control headless mode (default: true)
PLAYWRIGHT_HEADLESS=false npm run capture

Configuration

Site Configuration (config/sites.json)

Basic configuration:

[
    {
        "name": "Nike",
        "landingUrl": "https://www.nike.com",
        "loginUrl": "https://www.nike.com/login"
    }
]

Advanced configuration with custom settings:

[
    {
        "name": "Nike",
        "landingUrl": "https://www.nike.com",
        "loginUrl": "https://www.nike.com/login",
        "cookieSelectors": [
            "button#accept-cookies",
            "[data-testid='cookie-accept']"
        ],
        "waitSelectors": [
            ".main-content",
            "#product-grid"
        ],
        "navigationTimeout": 90000,
        "skipMobile": false
    }
]

Configuration Options

  • name (required): Site name used for output directory
  • landingUrl (required): Landing page URL
  • loginUrl (required): Login page URL
  • cookieSelectors (optional): Custom cookie button selectors (tried before common selectors)
  • waitSelectors (optional): Selectors to wait for before capture (ensures page is ready)
  • navigationTimeout (optional): Custom navigation timeout in milliseconds (default: 60000)
  • skipMobile (optional): Skip mobile capture for this site (default: false)

Adding New Websites

  1. Add site configuration to config/sites.json:
{
    "name": "ExampleSite",
    "landingUrl": "https://example.com",
    "loginUrl": "https://example.com/login"
}
  1. Run capture:
npm run capture
  1. Output will be saved to output/ExampleSite/landing/ and output/ExampleSite/login/

Output Structure

Each capture produces:

  • desktop.png: Full-page desktop screenshot (1920x1080)
  • mobile.png: Full-page mobile screenshot (iPhone 13)
  • page.html: Fully rendered HTML after JavaScript execution
  • metadata.json: Capture metadata including URLs, timestamps, viewport, user agent

Metadata Example

{
    "siteName": "Nike",
    "type": "landing",
    "originalUrl": "https://www.nike.com",
    "finalUrl": "https://www.nike.com/us/en",
    "timestamp": "2026-02-13T06:30:12.731Z",
    "deviceType": "desktop",
    "captureDate": "February 13, 2026, 06:30 AM",
    "userAgent": "Mozilla/5.0...",
    "viewport": {
        "width": 1920,
        "height": 1080
    }
}

Validation

The validation command checks:

  • All required files exist (desktop.png, mobile.png, page.html, metadata.json)
  • Files are not empty
  • Metadata.json has valid structure
  • Output directory structure is correct
npm run validate

Scalability Notes

Current Architecture

  • Sequential processing: Sites are processed one at a time
  • Browser reuse: Single browser instance shared across all captures
  • Context isolation: Each capture uses isolated browser context

Future Parallelization

The architecture is prepared for parallelization:

  • Browser instance can support multiple concurrent contexts
  • No shared state conflicts between captures
  • Async processing queue can be added without architectural changes

Performance Considerations

  • Browser Launch: ~2-3 seconds (one-time cost)
  • Per Capture: ~15-30 seconds (navigation + stabilization + screenshots)
  • 500 Sites: ~2-4 hours sequential, ~15-30 minutes with 10x parallelization

Troubleshooting

Navigation Timeouts

If a site consistently times out:

  1. Increase navigationTimeout in site config
  2. Check if site requires authentication
  3. Verify URL is accessible

Cookie Banners Not Detected

  1. Add custom cookieSelectors to site config
  2. Check browser console for cookie banner selectors
  3. Cookie acceptance is non-blocking - captures continue even if cookies aren't accepted

Missing Screenshots

  1. Run validation: npm run validate
  2. Check error screenshots: *-error.png files indicate where capture failed
  3. Review logs for specific error messages

Browser Crashes

  1. Ensure Playwright browsers are installed: npx playwright install chromium
  2. Check system resources (memory, CPU)
  3. Try running with PLAYWRIGHT_HEADLESS=false to see browser behavior

Development

Build

npm run build

Lint

npm run lint

Clean

npm run clean

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors