Website Scraping Automation#
Learn how to build intelligent web scraping workflows that can extract data from websites, monitor changes, and process information automatically.
π― What You'll Build#
A comprehensive web scraping system that: - Extracts structured data from websites - Handles JavaScript-heavy dynamic content - Monitors website changes and updates - Processes and cleans extracted data - Stores data in databases or spreadsheets - Implements anti-scraping protection bypassing
π Requirements#
- n8n instance with browser automation capabilities
- Puppeteer/Playwright for browser automation
- Data storage (database, Google Sheets, etc.)
- Proxy services (optional for large-scale scraping)
- Basic understanding of HTML and CSS selectors
π§ Workflow Overview#
Key Components#
- URL Manager - Manages scraping queue and URLs
- Browser Automation - Controls headless browser for content extraction
- Data Extractor - Parses and extracts structured data
- Data Processor - Cleans and transforms extracted data
- Change Detector - Identifies changes between scraping runs
- Storage Manager - Saves data to various destinations
π Step-by-Step Guide#
1. URL and Site Management#
-
URL Queue System
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
// Manage scraping queue with priorities class ScrapingQueue { constructor() { this.urls = new Map(); this.priorities = { high: 3, medium: 2, low: 1 }; } addUrl(url, options = {}) { const urlData = { url: url, priority: options.priority || 'medium', selector: options.selector, frequency: options.frequency || 'daily', lastScraped: options.lastScraped || null, metadata: options.metadata || {}, retryCount: 0, maxRetries: options.maxRetries || 3 }; this.urls.set(url, urlData); return urlData; } getNextUrls(limit = 10) { const now = new Date(); const readyUrls = Array.from(this.urls.values()) .filter(urlData => this.isReadyToScrape(urlData, now)) .sort((a, b) => this.priorities[b.priority] - this.priorities[a.priority]) .slice(0, limit); return readyUrls; } isReadyToScrape(urlData, currentTime) { if (urlData.retryCount >= urlData.maxRetries) { return false; } if (!urlData.lastScraped) { return true; } const timeSinceLastScrape = currentTime - new Date(urlData.lastScraped); const scrapeInterval = this.getScrapeInterval(urlData.frequency); return timeSinceLastScrape >= scrapeInterval; } getScrapeInterval(frequency) { const intervals = { hourly: 60 * 60 * 1000, daily: 24 * 60 * 60 * 1000, weekly: 7 * 24 * 60 * 60 * 1000, monthly: 30 * 24 * 60 * 60 * 1000 }; return intervals[frequency] || intervals.daily; } } -
Site Configuration Management
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
// Site-specific scraping configurations const siteConfigs = { 'ecommerce.example.com': { selectors: { products: '.product-item', title: '.product-title', price: '.price', description: '.product-description', availability: '.stock-status', images: '.product-image img' }, pagination: { nextSelector: '.next-page', maxPages: 10 }, delays: { pageLoad: 3000, betweenClicks: 1000 }, antiScraping: { userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', viewport: { width: 1920, height: 1080 } } }, 'news.example.com': { selectors: { articles: '.article', title: '.article-title', content: '.article-content', author: '.author', publishDate: '.publish-date', category: '.category' }, pagination: { infiniteScroll: true, scrollDelay: 2000 } } };
2. Browser Automation#
- Advanced Browser Control
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
// Sophisticated browser automation with Puppeteer class WebScraper { constructor(config = {}) { this.config = { headless: config.headless !== false, viewport: config.viewport || { width: 1920, height: 1080 }, userAgent: config.userAgent || 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', timeout: config.timeout || 30000, ...config }; } async scrapeUrl(url, siteConfig) { const browser = await puppeteer.launch({ headless: this.config.headless, args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-accelerated-2d-canvas', '--no-first-run', '--no-zygote', '--disable-gpu' ] }); try { const page = await browser.newPage(); await this.setupPage(page, siteConfig); // Navigate to URL await page.goto(url, { waitUntil: 'networkidle2', timeout: this.config.timeout }); // Handle anti-bot measures await this.handleAntiBotMeasures(page, siteConfig); // Wait for dynamic content await this.waitForContent(page, siteConfig); // Extract data const data = await this.extractData(page, siteConfig); // Handle pagination if needed if (siteConfig.pagination) { const additionalData = await this.handlePagination(page, siteConfig); data.push(...additionalData); } return data; } finally { await browser.close(); } } async setupPage(page, siteConfig) { // Set user agent and viewport await page.setUserAgent(this.config.userAgent); await page.setViewport(this.config.viewport); // Handle cookies and authentication if (siteConfig.cookies) { await page.setCookie(...siteConfig.cookies); } if (siteConfig.authentication) { await this.handleAuthentication(page, siteConfig.authentication); } // Inject custom scripts if needed if (siteConfig.customScripts) { await page.evaluateOnNewDocument(siteConfig.customScripts); } // Set up request interception for debugging await page.setRequestInterception(true); page.on('request', (request) => { // Block unnecessary resources for speed const resourceType = request.resourceType(); if (['image', 'stylesheet', 'font'].includes(resourceType)) { request.abort(); } else { request.continue(); } }); } async handleAntiBotMeasures(page, siteConfig) { // Random mouse movements await this.simulateHumanBehavior(page); // Solve CAPTCHAs if needed if (siteConfig.antiScraping?.captcha) { await this.solveCaptcha(page, siteConfig.antiScraping.captcha); } // Handle rate limiting if (siteConfig.antiScraping?.rateLimit) { await this.delay(siteConfig.antiScraping.rateLimit); } } async simulateHumanBehavior(page) { // Random mouse movements const mouse = page.mouse; const viewport = page.viewport(); for (let i = 0; i < 3; i++) { await mouse.move( Math.random() * viewport.width, Math.random() * viewport.height ); await this.delay(100 + Math.random() * 200); } // Random scrolling await page.evaluate(() => { window.scrollBy(0, Math.random() * 200); }); await this.delay(500); } }
3. Data Extraction#
- Smart Data Extraction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
// Extract structured data from web pages class DataExtractor { constructor(siteConfig) { this.config = siteConfig; this.extractors = { text: this.extractText.bind(this), attribute: this.extractAttribute.bind(this), html: this.extractHTML.bind(this), table: this.extractTable.bind(this), list: this.extractList.bind(this), image: this.extractImage.bind(this) }; } async extractData(page, config) { const results = []; // Find all main container elements const containers = await page.$$eval( config.selectors.container || 'body', elements => elements.map(el => el.outerHTML) ); for (const containerHTML of containers) { const itemData = await this.extractItemData(page, containerHTML, config); if (itemData && Object.keys(itemData).length > 0) { results.push(itemData); } } return results; } async extractItemData(page, containerHTML, config) { const itemData = {}; const tempContent = await page.evaluate((html, selectors) => { const tempDiv = document.createElement('div'); tempDiv.innerHTML = html; document.body.appendChild(tempDiv); const data = {}; // Extract each field based on configuration for (const [field, fieldConfig] of Object.entries(selectors)) { if (field === 'container') continue; try { const element = tempDiv.querySelector(fieldConfig); if (element) { data[field] = { text: element.textContent?.trim(), html: element.outerHTML, attributes: this.getElementAttributes(element) }; } } catch (error) { console.warn(`Error extracting ${field}:`, error); } } tempDiv.remove(); return data; }, containerHTML, config.selectors); // Process extracted data for (const [field, rawData] of Object.entries(tempContent)) { itemData[field] = this.processFieldData(field, rawData, config); } // Add metadata itemData.scraped_at = new Date().toISOString(); itemData.source_url = page.url(); return itemData; } processFieldData(fieldName, rawData, config) { let processedData = rawData.text; // Apply field-specific processing if (config.fieldProcessing && config.fieldProcessing[fieldName]) { const processing = config.fieldProcessing[fieldName]; // Clean and transform data if (processing.clean) { processedData = this.cleanText(processedData, processing.clean); } // Parse specific formats if (processing.parse) { processedData = this.parseData(processedData, processing.parse); } // Validate data if (processing.validate) { const isValid = this.validateData(processedData, processing.validate); if (!isValid) { return { error: 'Validation failed', raw: processedData }; } } } return { value: processedData, confidence: this.calculateConfidence(rawData), metadata: { html: rawData.html, attributes: rawData.attributes } }; } cleanText(text, cleaningConfig) { if (!text) return ''; let cleaned = text; // Remove extra whitespace if (cleaningConfig.whitespace !== false) { cleaned = cleaned.replace(/\s+/g, ' ').trim(); } // Remove HTML entities if (cleaningConfig.htmlEntities !== false) { cleaned = cleaned.replace(/&[a-zA-Z0-9#]+;/g, ''); } // Apply custom regex patterns if (cleaningConfig.regex) { for (const [pattern, replacement] of Object.entries(cleaningConfig.regex)) { cleaned = cleaned.replace(new RegExp(pattern, 'g'), replacement); } } return cleaned; } parseData(text, parseConfig) { switch (parseConfig.type) { case 'number': return this.parseNumber(text, parseConfig); case 'date': return this.parseDate(text, parseConfig); case 'currency': return this.parseCurrency(text, parseConfig); case 'url': return this.parseURL(text, parseConfig); case 'price': return this.parsePrice(text, parseConfig); default: return text; } } parsePrice(text, config) { // Extract price from text const priceRegex = /[\$,\β¬,\Β£]\s*([0-9,]+\.?[0-9]*)/g; const match = text.match(priceRegex); if (match) { const numericPrice = parseFloat(match[0].replace(/[^\d.]/g, '')); return { raw: match[0], value: numericPrice, currency: this.extractCurrency(match[0]) }; } return null; } }
4. Data Processing and Storage#
-
Data Processing Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
// Process and clean scraped data class DataProcessor { constructor(config) { this.config = config; this.transformations = this.loadTransformations(); } async processData(rawData, sourceConfig) { let processedData = rawData; // Apply transformations for (const transformation of this.transformations) { processedData = await this.applyTransformation(processedData, transformation); } // Deduplicate data processedData = await this.deduplicateData(processedData, sourceConfig); // Validate data quality const qualityReport = this.assessDataQuality(processedData); return { data: processedData, quality: qualityReport, processed_at: new Date().toISOString() }; } async applyTransformation(data, transformation) { switch (transformation.type) { case 'normalize': return this.normalizeData(data, transformation.config); case 'enrich': return await this.enrichData(data, transformation.config); case 'filter': return this.filterData(data, transformation.config); case 'aggregate': return this.aggregateData(data, transformation.config); default: return data; } } async enrichData(data, config) { if (config.geocoding && data.address) { data.coordinates = await this.geocodeAddress(data.address); } if (config.sentiment && data.reviews) { data.sentiment_analysis = await this.analyzeSentiment(data.reviews); } if (config.classification && data.content) { data.category = await this.classifyContent(data.content); } return data; } async deduplicateData(data, sourceConfig) { const seen = new Set(); const deduplicated = []; for (const item of data) { // Create unique key based on configurable fields const keyFields = sourceConfig.deduplicationKey || ['title', 'price']; const key = keyFields.map(field => item[field]).join('|'); if (!seen.has(key)) { seen.add(key); deduplicated.push(item); } } return deduplicated; } assessDataQuality(data) { const totalItems = data.length; const completeItems = data.filter(item => this.isComplete(item)).length; const validItems = data.filter(item => this.isValid(item)).length; return { total_items: totalItems, completeness_rate: (completeItems / totalItems) * 100, validity_rate: (validItems / totalItems) * 100, average_confidence: this.calculateAverageConfidence(data), issues: this.identifyQualityIssues(data) }; } } -
Multi-destination Storage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
// Store data to multiple destinations class StorageManager { constructor(config) { this.destinations = this.initializeDestinations(config); } async storeData(data, metadata) { const results = []; for (const [name, destination] of Object.entries(this.destinations)) { try { const result = await this.storeToDestination(data, destination, metadata); results.push({ destination: name, success: true, result }); } catch (error) { results.push({ destination: name, success: false, error: error.message }); } } return results; } async storeToDestination(data, destination, metadata) { switch (destination.type) { case 'database': return await this.storeToDatabase(data, destination, metadata); case 'google_sheets': return await this.storeToGoogleSheets(data, destination, metadata); case 'json_file': return await this.storeToJSONFile(data, destination, metadata); case 'api': return await this.storeToAPI(data, destination, metadata); default: throw new Error(`Unknown destination type: ${destination.type}`); } } async storeToGoogleSheets(data, config, metadata) { const { GoogleSpreadsheet } = require('google-spreadsheet'); const doc = new GoogleSpreadsheet(config.spreadsheetId); await doc.useServiceAccountAuth({ client_email: config.clientEmail, private_key: config.privateKey }); await doc.loadInfo(); const sheet = doc.sheetsByIndex[config.sheetIndex || 0]; // Prepare rows for Google Sheets const rows = data.map(item => this.flattenForSheets(item)); // Add rows in batches to avoid rate limits const batchSize = 100; for (let i = 0; i < rows.length; i += batchSize) { const batch = rows.slice(i, i + batchSize); await sheet.addRows(batch); } return { rows_added: rows.length, sheet_title: sheet.title, spreadsheet_url: doc.url }; } flattenForSheets(item) { const flattened = {}; for (const [key, value] of Object.entries(item)) { if (typeof value === 'object' && value !== null) { // Handle nested objects for (const [subKey, subValue] of Object.entries(value)) { flattened[`${key}_${subKey}`] = subValue; } } else { flattened[key] = value; } } return flattened; } }
5. Change Detection and Monitoring#
- Intelligent Change Detection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
// Detect changes between scraping runs class ChangeDetector { constructor(storage) { this.storage = storage; this.hashCache = new Map(); } async detectChanges(newData, sourceUrl) { const previousData = await this.getPreviousData(sourceUrl); const changes = { added: [], removed: [], modified: [], unchanged: [] }; if (!previousData || previousData.length === 0) { changes.added = newData; return changes; } const previousMap = this.createDataMap(previousData); const newMap = this.createDataMap(newData); // Find new items for (const [key, newItem] of newMap.entries()) { if (!previousMap.has(key)) { changes.added.push(newItem); } else { const previousItem = previousMap.get(key); const changeAnalysis = this.compareItems(previousItem, newItem); if (changeAnalysis.hasChanges) { changes.modified.push({ current: newItem, previous: previousItem, changes: changeAnalysis.changes }); } else { changes.unchanged.push(newItem); } } } // Find removed items for (const [key, previousItem] of previousMap.entries()) { if (!newMap.has(key)) { changes.removed.push(previousItem); } } return changes; } compareItems(item1, item2) { const changes = {}; let hasChanges = false; for (const [field, value] of Object.entries(item1)) { if (JSON.stringify(value) !== JSON.stringify(item2[field])) { changes[field] = { from: value, to: item2[field], type: this.getChangeType(value, item2[field]) }; hasChanges = true; } } return { hasChanges, changes }; } getChangeType(oldValue, newValue) { if (typeof oldValue !== typeof newValue) { return 'type_change'; } if (typeof oldValue === 'number') { const diff = newValue - oldValue; if (diff > 0) return 'increase'; if (diff < 0) return 'decrease'; return 'numerical_change'; } if (typeof oldValue === 'string') { if (oldValue.length === 0 && newValue.length > 0) return 'added'; if (oldValue.length > 0 && newValue.length === 0) return 'removed'; return 'text_change'; } return 'value_change'; } }
π Advanced Features#
Anti-Scraping Evasion#
- Proxy Rotation and User Agent Management
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
// Rotate proxies and user agents class AntiDetection { constructor() { this.proxies = this.loadProxyList(); this.userAgents = this.loadUserAgents(); this.currentProxyIndex = 0; this.currentUserAgentIndex = 0; } getProxy() { const proxy = this.proxies[this.currentProxyIndex]; this.currentProxyIndex = (this.currentProxyIndex + 1) % this.proxies.length; return proxy; } getUserAgent() { const userAgent = this.userAgents[this.currentUserAgentIndex]; this.currentUserAgentIndex = (this.currentUserAgentIndex + 1) % this.userAgents.length; return userAgent; } async getFingerprint() { // Generate realistic browser fingerprint return { userAgent: this.getUserAgent(), viewport: this.getRandomViewport(), timezone: this.getRandomTimezone(), language: this.getRandomLanguage(), platform: this.getRandomPlatform(), webgl: this.generateWebGLParams() }; } getRandomViewport() { const viewports = [ { width: 1920, height: 1080 }, { width: 1366, height: 768 }, { width: 1440, height: 900 }, { width: 1536, height: 864 } ]; return viewports[Math.floor(Math.random() * viewports.length)]; } }
AI-Powered Content Understanding#
- Content Analysis with AI
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
// Use AI to understand and categorize content class AIContentAnalyzer { constructor() { this.openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); } async analyzePageContent(data) { const analysis = {}; // Extract key information analysis.summary = await this.generateSummary(data); analysis.categories = await this.categorizeContent(data); analysis.entities = await this.extractEntities(data); analysis.sentiment = await this.analyzeSentiment(data); // Identify patterns and insights analysis.insights = await this.generateInsights(data); return analysis; } async generateSummary(data) { const content = this.concatenateContent(data); const prompt = ` Summarize the following web content in 2-3 sentences: ${content.substring(0, 2000)} `; const response = await this.openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: [{ role: "user", content: prompt }], max_tokens: 150 }); return response.choices[0].message.content; } async categorizeContent(data) { const content = this.concatenateContent(data); const prompt = ` Categorize the following content into one or more of these categories: - E-commerce/Product - News/Article - Blog - Forum/Discussion - Documentation - Social Media - Other Content: ${content.substring(0, 1000)} Return only the category names. `; const response = await this.openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: [{ role: "user", content: prompt }], max_tokens: 50 }); return response.choices[0].message.content.split(',').map(c => c.trim()); } }
π§ͺ Testing and Monitoring#
Comprehensive Testing#
- Scraping Test Suite
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
// Test scraping functionality class ScrapingTester { constructor() { this.testSites = this.loadTestSites(); } async runAllTests() { const results = {}; for (const site of this.testSites) { try { const result = await this.testSite(site); results[site.name] = result; } catch (error) { results[site.name] = { success: false, error: error.message }; } } return results; } async testSite(siteConfig) { const startTime = Date.now(); // Test basic scraping const scraper = new WebScraper(); const data = await scraper.scrapeUrl(siteConfig.url, siteConfig); // Validate extracted data const validation = this.validateExtractedData(data, siteConfig); // Test performance const duration = Date.now() - startTime; return { success: true, items_extracted: data.length, validation_passed: validation.isValid, validation_errors: validation.errors, performance_ms: duration, data_sample: data.slice(0, 3) }; } validateExtractedData(data, siteConfig) { const errors = []; const requiredFields = siteConfig.requiredFields || []; for (const item of data) { for (const field of requiredFields) { if (!item[field] || item[field].value === '') { errors.push(`Missing required field: ${field}`); } } } return { isValid: errors.length === 0, errors: errors }; } }
π Troubleshooting#
Common Issues#
Blocked by Anti-Scraping - IP address blocked - CAPTCHA requirements - Request rate limiting - User agent detection
Content Not Loading - JavaScript dependencies - AJAX content loading - Dynamic content issues - Timeouts and delays
Data Extraction Problems - Incorrect selectors - Page structure changes - Missing elements - Malformed HTML
Debug Tools#
- Detailed Logging and Monitoring
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
// Comprehensive logging for debugging class ScrapingLogger { static logScrapingAttempt(url, config, result) { const logEntry = { timestamp: new Date().toISOString(), url: url, config_hash: this.hashConfig(config), success: result.success, items_extracted: result.items?.length || 0, duration_ms: result.duration, error: result.error || null, performance: result.performance }; console.log(JSON.stringify(logEntry)); this.sendToMonitoring(logEntry); } static async captureScreenshot(page, url) { const screenshot = await page.screenshot({ fullPage: true, encoding: 'base64' }); await this.saveScreenshot(url, screenshot); return screenshot; } }
π Performance Optimization#
Large-Scale Scraping#
- Distributed Scraping
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
// Coordinate multiple scraping instances class DistributedScraping { constructor() { this.instances = new Map(); this.taskQueue = new PriorityQueue(); } async distributeTasks(urls, config) { const chunks = this.chunkArray(urls, this.getOptimalChunkSize()); for (const chunk of chunks) { const availableInstance = this.getAvailableInstance(); if (availableInstance) { await this.assignTask(availableInstance, chunk, config); } else { // Wait or scale up await this.waitForInstance(); } } } getOptimalChunkSize() { // Calculate optimal chunk size based on: // - Instance performance // - Target site rate limits // - Network conditions return Math.max(1, Math.floor(100 / this.instances.size)); } }
Related Tutorials: - AI-Powered Web Scraping - Advanced AI integration - Data Processing - Data handling techniques
Resources: - Puppeteer Documentation - Cheerio Documentation - n8n Browser Automation