Skip to content

Website Scraping Automation#

Learn how to build intelligent web scraping workflows that can extract data from websites, monitor changes, and process information automatically.

🎯 What You'll Build#

A comprehensive web scraping system that: - Extracts structured data from websites - Handles JavaScript-heavy dynamic content - Monitors website changes and updates - Processes and cleans extracted data - Stores data in databases or spreadsheets - Implements anti-scraping protection bypassing

πŸ“‹ Requirements#

  • n8n instance with browser automation capabilities
  • Puppeteer/Playwright for browser automation
  • Data storage (database, Google Sheets, etc.)
  • Proxy services (optional for large-scale scraping)
  • Basic understanding of HTML and CSS selectors

πŸ”§ Workflow Overview#

Key Components#

  1. URL Manager - Manages scraping queue and URLs
  2. Browser Automation - Controls headless browser for content extraction
  3. Data Extractor - Parses and extracts structured data
  4. Data Processor - Cleans and transforms extracted data
  5. Change Detector - Identifies changes between scraping runs
  6. Storage Manager - Saves data to various destinations

πŸ“ Step-by-Step Guide#

1. URL and Site Management#

  1. URL Queue System

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    // Manage scraping queue with priorities
    class ScrapingQueue {
      constructor() {
        this.urls = new Map();
        this.priorities = { high: 3, medium: 2, low: 1 };
      }
    
      addUrl(url, options = {}) {
        const urlData = {
          url: url,
          priority: options.priority || 'medium',
          selector: options.selector,
          frequency: options.frequency || 'daily',
          lastScraped: options.lastScraped || null,
          metadata: options.metadata || {},
          retryCount: 0,
          maxRetries: options.maxRetries || 3
        };
    
        this.urls.set(url, urlData);
        return urlData;
      }
    
      getNextUrls(limit = 10) {
        const now = new Date();
        const readyUrls = Array.from(this.urls.values())
          .filter(urlData => this.isReadyToScrape(urlData, now))
          .sort((a, b) => this.priorities[b.priority] - this.priorities[a.priority])
          .slice(0, limit);
    
        return readyUrls;
      }
    
      isReadyToScrape(urlData, currentTime) {
        if (urlData.retryCount >= urlData.maxRetries) {
          return false;
        }
    
        if (!urlData.lastScraped) {
          return true;
        }
    
        const timeSinceLastScrape = currentTime - new Date(urlData.lastScraped);
        const scrapeInterval = this.getScrapeInterval(urlData.frequency);
    
        return timeSinceLastScrape >= scrapeInterval;
      }
    
      getScrapeInterval(frequency) {
        const intervals = {
          hourly: 60 * 60 * 1000,
          daily: 24 * 60 * 60 * 1000,
          weekly: 7 * 24 * 60 * 60 * 1000,
          monthly: 30 * 24 * 60 * 60 * 1000
        };
    
        return intervals[frequency] || intervals.daily;
      }
    }
    

  2. Site Configuration Management

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    // Site-specific scraping configurations
    const siteConfigs = {
      'ecommerce.example.com': {
        selectors: {
          products: '.product-item',
          title: '.product-title',
          price: '.price',
          description: '.product-description',
          availability: '.stock-status',
          images: '.product-image img'
        },
        pagination: {
          nextSelector: '.next-page',
          maxPages: 10
        },
        delays: {
          pageLoad: 3000,
          betweenClicks: 1000
        },
        antiScraping: {
          userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          viewport: { width: 1920, height: 1080 }
        }
      },
      'news.example.com': {
        selectors: {
          articles: '.article',
          title: '.article-title',
          content: '.article-content',
          author: '.author',
          publishDate: '.publish-date',
          category: '.category'
        },
        pagination: {
          infiniteScroll: true,
          scrollDelay: 2000
        }
      }
    };
    

2. Browser Automation#

  1. Advanced Browser Control
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    // Sophisticated browser automation with Puppeteer
    class WebScraper {
      constructor(config = {}) {
        this.config = {
          headless: config.headless !== false,
          viewport: config.viewport || { width: 1920, height: 1080 },
          userAgent: config.userAgent || 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          timeout: config.timeout || 30000,
          ...config
        };
      }
    
      async scrapeUrl(url, siteConfig) {
        const browser = await puppeteer.launch({
          headless: this.config.headless,
          args: [
            '--no-sandbox',
            '--disable-setuid-sandbox',
            '--disable-dev-shm-usage',
            '--disable-accelerated-2d-canvas',
            '--no-first-run',
            '--no-zygote',
            '--disable-gpu'
          ]
        });
    
        try {
          const page = await browser.newPage();
          await this.setupPage(page, siteConfig);
    
          // Navigate to URL
          await page.goto(url, {
            waitUntil: 'networkidle2',
            timeout: this.config.timeout
          });
    
          // Handle anti-bot measures
          await this.handleAntiBotMeasures(page, siteConfig);
    
          // Wait for dynamic content
          await this.waitForContent(page, siteConfig);
    
          // Extract data
          const data = await this.extractData(page, siteConfig);
    
          // Handle pagination if needed
          if (siteConfig.pagination) {
            const additionalData = await this.handlePagination(page, siteConfig);
            data.push(...additionalData);
          }
    
          return data;
        } finally {
          await browser.close();
        }
      }
    
      async setupPage(page, siteConfig) {
        // Set user agent and viewport
        await page.setUserAgent(this.config.userAgent);
        await page.setViewport(this.config.viewport);
    
        // Handle cookies and authentication
        if (siteConfig.cookies) {
          await page.setCookie(...siteConfig.cookies);
        }
    
        if (siteConfig.authentication) {
          await this.handleAuthentication(page, siteConfig.authentication);
        }
    
        // Inject custom scripts if needed
        if (siteConfig.customScripts) {
          await page.evaluateOnNewDocument(siteConfig.customScripts);
        }
    
        // Set up request interception for debugging
        await page.setRequestInterception(true);
        page.on('request', (request) => {
          // Block unnecessary resources for speed
          const resourceType = request.resourceType();
          if (['image', 'stylesheet', 'font'].includes(resourceType)) {
            request.abort();
          } else {
            request.continue();
          }
        });
      }
    
      async handleAntiBotMeasures(page, siteConfig) {
        // Random mouse movements
        await this.simulateHumanBehavior(page);
    
        // Solve CAPTCHAs if needed
        if (siteConfig.antiScraping?.captcha) {
          await this.solveCaptcha(page, siteConfig.antiScraping.captcha);
        }
    
        // Handle rate limiting
        if (siteConfig.antiScraping?.rateLimit) {
          await this.delay(siteConfig.antiScraping.rateLimit);
        }
      }
    
      async simulateHumanBehavior(page) {
        // Random mouse movements
        const mouse = page.mouse;
        const viewport = page.viewport();
    
        for (let i = 0; i < 3; i++) {
          await mouse.move(
            Math.random() * viewport.width,
            Math.random() * viewport.height
          );
          await this.delay(100 + Math.random() * 200);
        }
    
        // Random scrolling
        await page.evaluate(() => {
          window.scrollBy(0, Math.random() * 200);
        });
        await this.delay(500);
      }
    }
    

3. Data Extraction#

  1. Smart Data Extraction
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    // Extract structured data from web pages
    class DataExtractor {
      constructor(siteConfig) {
        this.config = siteConfig;
        this.extractors = {
          text: this.extractText.bind(this),
          attribute: this.extractAttribute.bind(this),
          html: this.extractHTML.bind(this),
          table: this.extractTable.bind(this),
          list: this.extractList.bind(this),
          image: this.extractImage.bind(this)
        };
      }
    
      async extractData(page, config) {
        const results = [];
    
        // Find all main container elements
        const containers = await page.$$eval(
          config.selectors.container || 'body',
          elements => elements.map(el => el.outerHTML)
        );
    
        for (const containerHTML of containers) {
          const itemData = await this.extractItemData(page, containerHTML, config);
          if (itemData && Object.keys(itemData).length > 0) {
            results.push(itemData);
          }
        }
    
        return results;
      }
    
      async extractItemData(page, containerHTML, config) {
        const itemData = {};
        const tempContent = await page.evaluate((html, selectors) => {
          const tempDiv = document.createElement('div');
          tempDiv.innerHTML = html;
          document.body.appendChild(tempDiv);
    
          const data = {};
    
          // Extract each field based on configuration
          for (const [field, fieldConfig] of Object.entries(selectors)) {
            if (field === 'container') continue;
    
            try {
              const element = tempDiv.querySelector(fieldConfig);
              if (element) {
                data[field] = {
                  text: element.textContent?.trim(),
                  html: element.outerHTML,
                  attributes: this.getElementAttributes(element)
                };
              }
            } catch (error) {
              console.warn(`Error extracting ${field}:`, error);
            }
          }
    
          tempDiv.remove();
          return data;
        }, containerHTML, config.selectors);
    
        // Process extracted data
        for (const [field, rawData] of Object.entries(tempContent)) {
          itemData[field] = this.processFieldData(field, rawData, config);
        }
    
        // Add metadata
        itemData.scraped_at = new Date().toISOString();
        itemData.source_url = page.url();
    
        return itemData;
      }
    
      processFieldData(fieldName, rawData, config) {
        let processedData = rawData.text;
    
        // Apply field-specific processing
        if (config.fieldProcessing && config.fieldProcessing[fieldName]) {
          const processing = config.fieldProcessing[fieldName];
    
          // Clean and transform data
          if (processing.clean) {
            processedData = this.cleanText(processedData, processing.clean);
          }
    
          // Parse specific formats
          if (processing.parse) {
            processedData = this.parseData(processedData, processing.parse);
          }
    
          // Validate data
          if (processing.validate) {
            const isValid = this.validateData(processedData, processing.validate);
            if (!isValid) {
              return { error: 'Validation failed', raw: processedData };
            }
          }
        }
    
        return {
          value: processedData,
          confidence: this.calculateConfidence(rawData),
          metadata: {
            html: rawData.html,
            attributes: rawData.attributes
          }
        };
      }
    
      cleanText(text, cleaningConfig) {
        if (!text) return '';
    
        let cleaned = text;
    
        // Remove extra whitespace
        if (cleaningConfig.whitespace !== false) {
          cleaned = cleaned.replace(/\s+/g, ' ').trim();
        }
    
        // Remove HTML entities
        if (cleaningConfig.htmlEntities !== false) {
          cleaned = cleaned.replace(/&[a-zA-Z0-9#]+;/g, '');
        }
    
        // Apply custom regex patterns
        if (cleaningConfig.regex) {
          for (const [pattern, replacement] of Object.entries(cleaningConfig.regex)) {
            cleaned = cleaned.replace(new RegExp(pattern, 'g'), replacement);
          }
        }
    
        return cleaned;
      }
    
      parseData(text, parseConfig) {
        switch (parseConfig.type) {
          case 'number':
            return this.parseNumber(text, parseConfig);
          case 'date':
            return this.parseDate(text, parseConfig);
          case 'currency':
            return this.parseCurrency(text, parseConfig);
          case 'url':
            return this.parseURL(text, parseConfig);
          case 'price':
            return this.parsePrice(text, parseConfig);
          default:
            return text;
        }
      }
    
      parsePrice(text, config) {
        // Extract price from text
        const priceRegex = /[\$,\€,\Β£]\s*([0-9,]+\.?[0-9]*)/g;
        const match = text.match(priceRegex);
    
        if (match) {
          const numericPrice = parseFloat(match[0].replace(/[^\d.]/g, ''));
          return {
            raw: match[0],
            value: numericPrice,
            currency: this.extractCurrency(match[0])
          };
        }
    
        return null;
      }
    }
    

4. Data Processing and Storage#

  1. Data Processing Pipeline

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    // Process and clean scraped data
    class DataProcessor {
      constructor(config) {
        this.config = config;
        this.transformations = this.loadTransformations();
      }
    
      async processData(rawData, sourceConfig) {
        let processedData = rawData;
    
        // Apply transformations
        for (const transformation of this.transformations) {
          processedData = await this.applyTransformation(processedData, transformation);
        }
    
        // Deduplicate data
        processedData = await this.deduplicateData(processedData, sourceConfig);
    
        // Validate data quality
        const qualityReport = this.assessDataQuality(processedData);
    
        return {
          data: processedData,
          quality: qualityReport,
          processed_at: new Date().toISOString()
        };
      }
    
      async applyTransformation(data, transformation) {
        switch (transformation.type) {
          case 'normalize':
            return this.normalizeData(data, transformation.config);
          case 'enrich':
            return await this.enrichData(data, transformation.config);
          case 'filter':
            return this.filterData(data, transformation.config);
          case 'aggregate':
            return this.aggregateData(data, transformation.config);
          default:
            return data;
        }
      }
    
      async enrichData(data, config) {
        if (config.geocoding && data.address) {
          data.coordinates = await this.geocodeAddress(data.address);
        }
    
        if (config.sentiment && data.reviews) {
          data.sentiment_analysis = await this.analyzeSentiment(data.reviews);
        }
    
        if (config.classification && data.content) {
          data.category = await this.classifyContent(data.content);
        }
    
        return data;
      }
    
      async deduplicateData(data, sourceConfig) {
        const seen = new Set();
        const deduplicated = [];
    
        for (const item of data) {
          // Create unique key based on configurable fields
          const keyFields = sourceConfig.deduplicationKey || ['title', 'price'];
          const key = keyFields.map(field => item[field]).join('|');
    
          if (!seen.has(key)) {
            seen.add(key);
            deduplicated.push(item);
          }
        }
    
        return deduplicated;
      }
    
      assessDataQuality(data) {
        const totalItems = data.length;
        const completeItems = data.filter(item => this.isComplete(item)).length;
        const validItems = data.filter(item => this.isValid(item)).length;
    
        return {
          total_items: totalItems,
          completeness_rate: (completeItems / totalItems) * 100,
          validity_rate: (validItems / totalItems) * 100,
          average_confidence: this.calculateAverageConfidence(data),
          issues: this.identifyQualityIssues(data)
        };
      }
    }
    

  2. Multi-destination Storage

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    // Store data to multiple destinations
    class StorageManager {
      constructor(config) {
        this.destinations = this.initializeDestinations(config);
      }
    
      async storeData(data, metadata) {
        const results = [];
    
        for (const [name, destination] of Object.entries(this.destinations)) {
          try {
            const result = await this.storeToDestination(data, destination, metadata);
            results.push({ destination: name, success: true, result });
          } catch (error) {
            results.push({
              destination: name,
              success: false,
              error: error.message
            });
          }
        }
    
        return results;
      }
    
      async storeToDestination(data, destination, metadata) {
        switch (destination.type) {
          case 'database':
            return await this.storeToDatabase(data, destination, metadata);
          case 'google_sheets':
            return await this.storeToGoogleSheets(data, destination, metadata);
          case 'json_file':
            return await this.storeToJSONFile(data, destination, metadata);
          case 'api':
            return await this.storeToAPI(data, destination, metadata);
          default:
            throw new Error(`Unknown destination type: ${destination.type}`);
        }
      }
    
      async storeToGoogleSheets(data, config, metadata) {
        const { GoogleSpreadsheet } = require('google-spreadsheet');
        const doc = new GoogleSpreadsheet(config.spreadsheetId);
    
        await doc.useServiceAccountAuth({
          client_email: config.clientEmail,
          private_key: config.privateKey
        });
    
        await doc.loadInfo();
        const sheet = doc.sheetsByIndex[config.sheetIndex || 0];
    
        // Prepare rows for Google Sheets
        const rows = data.map(item => this.flattenForSheets(item));
    
        // Add rows in batches to avoid rate limits
        const batchSize = 100;
        for (let i = 0; i < rows.length; i += batchSize) {
          const batch = rows.slice(i, i + batchSize);
          await sheet.addRows(batch);
        }
    
        return {
          rows_added: rows.length,
          sheet_title: sheet.title,
          spreadsheet_url: doc.url
        };
      }
    
      flattenForSheets(item) {
        const flattened = {};
    
        for (const [key, value] of Object.entries(item)) {
          if (typeof value === 'object' && value !== null) {
            // Handle nested objects
            for (const [subKey, subValue] of Object.entries(value)) {
              flattened[`${key}_${subKey}`] = subValue;
            }
          } else {
            flattened[key] = value;
          }
        }
    
        return flattened;
      }
    }
    

5. Change Detection and Monitoring#

  1. Intelligent Change Detection
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    // Detect changes between scraping runs
    class ChangeDetector {
      constructor(storage) {
        this.storage = storage;
        this.hashCache = new Map();
      }
    
      async detectChanges(newData, sourceUrl) {
        const previousData = await this.getPreviousData(sourceUrl);
        const changes = {
          added: [],
          removed: [],
          modified: [],
          unchanged: []
        };
    
        if (!previousData || previousData.length === 0) {
          changes.added = newData;
          return changes;
        }
    
        const previousMap = this.createDataMap(previousData);
        const newMap = this.createDataMap(newData);
    
        // Find new items
        for (const [key, newItem] of newMap.entries()) {
          if (!previousMap.has(key)) {
            changes.added.push(newItem);
          } else {
            const previousItem = previousMap.get(key);
            const changeAnalysis = this.compareItems(previousItem, newItem);
    
            if (changeAnalysis.hasChanges) {
              changes.modified.push({
                current: newItem,
                previous: previousItem,
                changes: changeAnalysis.changes
              });
            } else {
              changes.unchanged.push(newItem);
            }
          }
        }
    
        // Find removed items
        for (const [key, previousItem] of previousMap.entries()) {
          if (!newMap.has(key)) {
            changes.removed.push(previousItem);
          }
        }
    
        return changes;
      }
    
      compareItems(item1, item2) {
        const changes = {};
        let hasChanges = false;
    
        for (const [field, value] of Object.entries(item1)) {
          if (JSON.stringify(value) !== JSON.stringify(item2[field])) {
            changes[field] = {
              from: value,
              to: item2[field],
              type: this.getChangeType(value, item2[field])
            };
            hasChanges = true;
          }
        }
    
        return { hasChanges, changes };
      }
    
      getChangeType(oldValue, newValue) {
        if (typeof oldValue !== typeof newValue) {
          return 'type_change';
        }
    
        if (typeof oldValue === 'number') {
          const diff = newValue - oldValue;
          if (diff > 0) return 'increase';
          if (diff < 0) return 'decrease';
          return 'numerical_change';
        }
    
        if (typeof oldValue === 'string') {
          if (oldValue.length === 0 && newValue.length > 0) return 'added';
          if (oldValue.length > 0 && newValue.length === 0) return 'removed';
          return 'text_change';
        }
    
        return 'value_change';
      }
    }
    

πŸš€ Advanced Features#

Anti-Scraping Evasion#

  1. Proxy Rotation and User Agent Management
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    // Rotate proxies and user agents
    class AntiDetection {
      constructor() {
        this.proxies = this.loadProxyList();
        this.userAgents = this.loadUserAgents();
        this.currentProxyIndex = 0;
        this.currentUserAgentIndex = 0;
      }
    
      getProxy() {
        const proxy = this.proxies[this.currentProxyIndex];
        this.currentProxyIndex = (this.currentProxyIndex + 1) % this.proxies.length;
        return proxy;
      }
    
      getUserAgent() {
        const userAgent = this.userAgents[this.currentUserAgentIndex];
        this.currentUserAgentIndex = (this.currentUserAgentIndex + 1) % this.userAgents.length;
        return userAgent;
      }
    
      async getFingerprint() {
        // Generate realistic browser fingerprint
        return {
          userAgent: this.getUserAgent(),
          viewport: this.getRandomViewport(),
          timezone: this.getRandomTimezone(),
          language: this.getRandomLanguage(),
          platform: this.getRandomPlatform(),
          webgl: this.generateWebGLParams()
        };
      }
    
      getRandomViewport() {
        const viewports = [
          { width: 1920, height: 1080 },
          { width: 1366, height: 768 },
          { width: 1440, height: 900 },
          { width: 1536, height: 864 }
        ];
    
        return viewports[Math.floor(Math.random() * viewports.length)];
      }
    }
    

AI-Powered Content Understanding#

  1. Content Analysis with AI
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    // Use AI to understand and categorize content
    class AIContentAnalyzer {
      constructor() {
        this.openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
      }
    
      async analyzePageContent(data) {
        const analysis = {};
    
        // Extract key information
        analysis.summary = await this.generateSummary(data);
        analysis.categories = await this.categorizeContent(data);
        analysis.entities = await this.extractEntities(data);
        analysis.sentiment = await this.analyzeSentiment(data);
    
        // Identify patterns and insights
        analysis.insights = await this.generateInsights(data);
    
        return analysis;
      }
    
      async generateSummary(data) {
        const content = this.concatenateContent(data);
    
        const prompt = `
        Summarize the following web content in 2-3 sentences:
        ${content.substring(0, 2000)}
        `;
    
        const response = await this.openai.chat.completions.create({
          model: "gpt-3.5-turbo",
          messages: [{ role: "user", content: prompt }],
          max_tokens: 150
        });
    
        return response.choices[0].message.content;
      }
    
      async categorizeContent(data) {
        const content = this.concatenateContent(data);
    
        const prompt = `
        Categorize the following content into one or more of these categories:
        - E-commerce/Product
        - News/Article
        - Blog
        - Forum/Discussion
        - Documentation
        - Social Media
        - Other
    
        Content: ${content.substring(0, 1000)}
    
        Return only the category names.
        `;
    
        const response = await this.openai.chat.completions.create({
          model: "gpt-3.5-turbo",
          messages: [{ role: "user", content: prompt }],
          max_tokens: 50
        });
    
        return response.choices[0].message.content.split(',').map(c => c.trim());
      }
    }
    

πŸ§ͺ Testing and Monitoring#

Comprehensive Testing#

  1. Scraping Test Suite
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    // Test scraping functionality
    class ScrapingTester {
      constructor() {
        this.testSites = this.loadTestSites();
      }
    
      async runAllTests() {
        const results = {};
    
        for (const site of this.testSites) {
          try {
            const result = await this.testSite(site);
            results[site.name] = result;
          } catch (error) {
            results[site.name] = {
              success: false,
              error: error.message
            };
          }
        }
    
        return results;
      }
    
      async testSite(siteConfig) {
        const startTime = Date.now();
    
        // Test basic scraping
        const scraper = new WebScraper();
        const data = await scraper.scrapeUrl(siteConfig.url, siteConfig);
    
        // Validate extracted data
        const validation = this.validateExtractedData(data, siteConfig);
    
        // Test performance
        const duration = Date.now() - startTime;
    
        return {
          success: true,
          items_extracted: data.length,
          validation_passed: validation.isValid,
          validation_errors: validation.errors,
          performance_ms: duration,
          data_sample: data.slice(0, 3)
        };
      }
    
      validateExtractedData(data, siteConfig) {
        const errors = [];
        const requiredFields = siteConfig.requiredFields || [];
    
        for (const item of data) {
          for (const field of requiredFields) {
            if (!item[field] || item[field].value === '') {
              errors.push(`Missing required field: ${field}`);
            }
          }
        }
    
        return {
          isValid: errors.length === 0,
          errors: errors
        };
      }
    }
    

πŸ” Troubleshooting#

Common Issues#

Blocked by Anti-Scraping - IP address blocked - CAPTCHA requirements - Request rate limiting - User agent detection

Content Not Loading - JavaScript dependencies - AJAX content loading - Dynamic content issues - Timeouts and delays

Data Extraction Problems - Incorrect selectors - Page structure changes - Missing elements - Malformed HTML

Debug Tools#

  1. Detailed Logging and Monitoring
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    // Comprehensive logging for debugging
    class ScrapingLogger {
      static logScrapingAttempt(url, config, result) {
        const logEntry = {
          timestamp: new Date().toISOString(),
          url: url,
          config_hash: this.hashConfig(config),
          success: result.success,
          items_extracted: result.items?.length || 0,
          duration_ms: result.duration,
          error: result.error || null,
          performance: result.performance
        };
    
        console.log(JSON.stringify(logEntry));
        this.sendToMonitoring(logEntry);
      }
    
      static async captureScreenshot(page, url) {
        const screenshot = await page.screenshot({
          fullPage: true,
          encoding: 'base64'
        });
    
        await this.saveScreenshot(url, screenshot);
        return screenshot;
      }
    }
    

πŸ“ˆ Performance Optimization#

Large-Scale Scraping#

  1. Distributed Scraping
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    // Coordinate multiple scraping instances
    class DistributedScraping {
      constructor() {
        this.instances = new Map();
        this.taskQueue = new PriorityQueue();
      }
    
      async distributeTasks(urls, config) {
        const chunks = this.chunkArray(urls, this.getOptimalChunkSize());
    
        for (const chunk of chunks) {
          const availableInstance = this.getAvailableInstance();
          if (availableInstance) {
            await this.assignTask(availableInstance, chunk, config);
          } else {
            // Wait or scale up
            await this.waitForInstance();
          }
        }
      }
    
      getOptimalChunkSize() {
        // Calculate optimal chunk size based on:
        // - Instance performance
        // - Target site rate limits
        // - Network conditions
        return Math.max(1, Math.floor(100 / this.instances.size));
      }
    }
    

Related Tutorials: - AI-Powered Web Scraping - Advanced AI integration - Data Processing - Data handling techniques

Resources: - Puppeteer Documentation - Cheerio Documentation - n8n Browser Automation