reading-notes

Web Scraping

Notes

Reading Questions

  1. What are the key differences between scraping static and dynamic websites?

    A static website has all the content for a page rendered on load. A dynamic website renders content after a page has loaded.

  2. Explain at least three techniques or best practices that can be employed to avoid getting blocked while scraping websites.

    • Review a websites robot.txt file
    • Crawl a page slower
    • Use a headless browser like Playwright
  3. What is Playwright, and how does it assist in web scraping tasks? Provide an example of a use case where Playwright would be particularly beneficial.

    Playwright is a headless browser which automates the rendering and navigation of websites using a browser. It assists in web scraping by providing a browser session which can be automated to handle tasks like clicking on buttons on a page. It would be very useful in scraping content from pages that have buttons to be clicked or forms to be filled in before reaching the content you are trying to scrape on a dynamic page.

  4. Describe the purpose of using Xpath in web scraping, and provide an example of an Xpath expression to select a specific HTML element from a webpage.

    Xpath provides a standard syntax to select specific elements from the DOM with a query expression. This is helpful in web scraping to find particular content on a page or specify where to interact with a dynamic webpage when using automation.

Things I want to know more about

References