Web scraping with Python Setup instructions Install the Anaconda Python distribution If using your own computer please install the Anaconda Python distribution from https://www.anaconda.com/download/.
(Note that Python version ? 3.0 differs considerably from more recent releases. For this workshop you will need version ? 3.4.) Accepting the defaults proposed by the Anaconda installer is generally recommended. Workshop notes The class notes for this workshop are available on our website at dss.iq.harvard.edu under Workshop Materials ==> Python Workshop Materials => Python Web Scraping. Click the All workshop materials link to download the workshop materials. Extract the PythonWebScraping.zip directory
(Right-click => Extract All on Windows, double-click on Mac). Start the Jupyter Notebook application and open the Exercises.ipynb file in the PythonWebScraping folder you downloaded previously. You may also wish to start a new notebook for your own notes. Workshop goals and approach In this workshop you will learn basic web scraping principles and techniques, learn how to use the requests package in Python, practice making requests and manipulating responses from the server. This workshop is relatively informal, example-oriented, and hands-on. We will learn by working through an example web scraping project. Note that this is not an introductory workshop. Familiarity with Python, including but not limited to knowledge of lists and dictionaries, indexing, and loops and / or comprehensions is assumed. If you need an introduction to Python or a refresher, we recommend the IQSS Introduction to Python. Note also that this workshop will not teach you everything you need to know in order to retrieve data from any web service you might wish to scrape. You can expect to learn just enough to be dangerous. Preliminary questions What is web scraping? Web scraping is the activity of automating retrieval of information from a web service designed for human interaction. Is web scraping legal? Is it ethical? It depends. If you have legal questions seek legal counsel. You can mitigate some ethical issues by building delays and restrictions into your web scraping program so as to avoid impacting the availability of the web service for other users or the cost of hosting the service for the service provider. Example project overview and goals In this workshop I will demonstrate web scraping techniques using the Collections page at
and let you use the skills you’ll learn to retrieve information from other parts of the Harvard Art Museums website. The basic strategy is pretty much the same for most scraping projects. We will use our web browser (Chrome or Firefox recommended) to examine the page you wish to retrieve data from, and copy/paste information from your web browser into your scraping program.DIpesh text