Web scraping is a powerful way to collect the vast amounts of data on the web. Data scientists, data journalists, marketers and others will inevitably face this data dilemma: To scrape or not to scrape?
What is Web Scraping?
Web scraping is a term that covers different methods for collecting information on the web. Generally, an automated piece of software extracts data from a website by scraping its pages.
If a website does not have a public API or if it’s not as maintained as the website data, you might be able to get more data from web scraping. Relying on a public API can be dangerous for a high stakes project, because they can change at any time – Twitter, for example, regularly changes their API, and has severely limited their publicly available data recently because they know they can make more money by privatizing and monetizing that data.
Some web scrapers write their own software using Python and a library like BeautifulSoup to obtain and parse the HTML of a page. Others use scraping tool providers, like Octoparse or Scrapy, or a scraping service producer, such as Scrapinghub or Datahen. Choosing between these options depends on your technical expertise, and the cost, complexity and scope of the project.
How is Web Scraping Used?
Web scraping is used for a variety of valid reasons: data analysis, customer information acquisition, price comparison and other competitor data, data visualization, market research, academic research, product catalog scraping, and more. Here are a few ways web scraping has been used and reported on:
- Discovering that 58% of New York City Airbnb listings were potentially illegal
- Finding the best-fitting jobs on Indeed
- Looking at Wikipedia pages and their connected pages
- Massive scraping by Uber to look at competitor data
- Publicly releasing dating profile data from OkCupid
Is Web Scraping Unethical?
Those last two examples represent where the legal and ethical lines start to blur. Study participants have the right to informed consent, and in the OkCupid case, their data was being used without that, and their names were not truly made anonymous. Web scraping can also be used for many other nefarious purposes, including plagiarism, spamming and identity theft.
Best Practices for Ethical Web Scraping
Web scraping can be unethical because it’s essentially a form of data mining, and many websites do not permit data mining. Along with making sure you aren’t violating the terms of service, here are a few other practices to scrape ethically:
- Do not pry into protected data or sensitive user information
- Contact the organization and do a Freedom of Information Act (FOIA) request as needed
- Request data at a reasonable rate, so as not to impair the site’s performance
- Strive to not be confused with a DDoS attack
- Only save data you need
- Do not pass off content/data as your own when it’s not
- Be sure not to violate copyright laws
- Identify yourself
- Try to provide value back to the site you are scraping, such as driving traffic by crediting the site in the article/post
- Be transparent and consider sharing/publishing your work