A funny thing happened when I was playing with Poe, a chatbot aggregator from Quora. I selected the recently released free almost-open-source Llama 2 70B Chat model from Meta and gave it the prompt “Generate a Python program to scrape a website. Include tests for python.org and infoworld.com.” That’s a shorter version of a prompt I used to test CodeWhisperer, Bard, and Copilot X in June. None of those three models generated a working program.
To my utter surprise, Llama 2 70B Chat aced this test:
The bot went on to suggest some follow-up questions:
Tell me more. Can you explain how the BeautifulSoup library works for parsing HTML content? Are there any other libraries or tools commonly used for web scraping in Python? What are some best practices for web scraping to ensure compliance with website terms of service?
The generated code looked good to me, so I copied it into Visual Studio Code, saved it, and ran it. It ran perfectly:
% /usr/local/bin/python3 /Users/martinheller/Code/Llama/scrapesite.py martinheller@Martins-M1-MBP Code % /usr/local/bin/python3 /Users/martinheller/Code/Llama/scrapesite.py Title: Welcome to Python.org Link: #content Link: #python-network Link: / Link: https://www.python.org/psf/ Link: https://docs.python.org Link: https://pypi.org/ Link: /jobs/ Link: /community-landing/ Link: #top Link: / Link: https://psfmember.org/civicrm/contribute/transact?reset=1&id=2 Link: #site-map Link: # Link: javascript:; Link: javascript:; Link: javascript:; Link: # Link: https://www.facebook.com/pythonlang?fref=ts Link: https://twitter.com/ThePSF Link: /community/irc/ Link: /about/ Link: /about/apps/ Link: /about/quotes/ Link: /about/gettingstarted/ Link: /about/help/ Link: http://brochure.getpython.info/ Link: /downloads/ Link: /downloads/ Link: /downloads/source/ Link: /downloads/windows/ Link: /downloads/macos/ Link: /download/other/ Link: https://docs.python.org/3/license.html Link: /download/alternatives Link: /doc/ …
Comparing the Llama-generated code with the CodeWhisperer-generated code, the major difference is that Llama used the html.parser
model for Beautiful Soup, which worked, while CodeWhisperer used the lxml
model, which choked.
Llama 2 code explanation
I also asked Llama 2 70B Chat to explain the same sample program I had given to CodeWhisperer, Bard, and Copilot X. CodeWhisperer doesn’t currently have a chat window, so it doesn’t do code explanations, but Bard did a great job on this task and Copilot X did a good job.
Llama’s explanation (shown above) is as good, or possibly better, than what Bard generated. I don’t completely understand why Llama stopped in item 12, but I suspect that it may have hit a token limit, unless I accidentally hit the “stop” button in Poe and didn’t notice.
For more about Llama 2 in general, including discussion of its potential copyright violations and whether it’s open source or not, see “What is Llama 2? Meta’s large language model explained.”
Coding with Code Llama
A couple of days after I finished working with Llama 2, Meta AI released several Code Llama models. A few days after that, at Google Cloud Next 2023, Google announced that they were hosting Code Llama models (among many others) in the new Vertex AI Model Garden. Additionally, Perplexity made one of the Code Llama models available online, along with three sizes of Llama 2 Chat.
So there were several ways to run Code Llama at the time I was writing this article. It’s likely that there will be several more, and several code editor integrations, in the next months.
Poe didn’t host any Code Llama models when I first tried it, but during the course of writing this article Quora added Code Llama 7B, 13B, and 34B to Poe’s repertoire. Unfortunately, all three models gave me the dreaded “Unable to reach Poe” error message, which I interpret to mean that the model’s endpoint is busy or not yet connected. The following day, Poe updated, and running the Code Llama 34B model worked:
As you can see from the screenshot, Code Llama 34B went one better than Llama 2 and generated programs using both Beautiful Soup and Scrapy.
Perplexity is website that hosts a Code Llama model, as well as several other generative AI models from various companies. I tried the Code Llama 34B Instruct model, optimized for multi-turn code generation, on the Python code-generation task for website scraping:
As far as it went, this wasn’t a bad response. I know that the requests.get()
method and bs4
with the html.parser
engine work for the two sites I suggested for tests, and finding all the links and printing their HREF
tags is a good start on processing. A very quick code inspection suggested something obvious was missing, however:
Now this looks more like a command-line utility, but different functionality is now missing. I would have preferred a functional form, but I said “program” rather than “function” when I made the request, so I’ll give the model a pass. On the other hand, the program as it stands will report undefined functions when compiled.
Returning JSON wasn’t really what I had in mind, but for the purposes of testing the model I’ve probably gone far enough.
Llama 2 and Code Llama on Google Cloud
At Google Cloud Next 2023, Google Cloud announced that new additions to Google Cloud Vertex AI’s Model Garden include Llama 2 and Code Llama from Meta, and published a Colab Enterprise notebook that lets you deploy pre-trained Code Llama models with vLLM with the best available serving throughput.
If you need to use a Llama 2 or Code Llama model for less than a day, you can do so for free, and even run it on a GPU. Use Colab. If you know how, it’s easy. If you don’t, search for “run code llama on colab” and you’ll see a full page of explanations, including lots of YouTube videos and blog posts on the subject. Note that while Colab is free but time-limited and resource-limited, Colab Enterprise costs money but isn’t limited.
If you want to create a website for running LLMs, you can use the same vLLM library as used in the Google Cloud Colab Notebook to set up an API. Ideally, you’ll set it up on a server with a GPU big enough to hold the model you want to use, but that isn’t totally necessary: You can get by with something like a M1 or M2 Macintosh as long as it has enough RAM to run your model. You can also use LangChain for this, at the cost of writing or copying a few lines of code.
Running Llama 2 with Ollama
If you are using an Arm-based Macintosh as your workstation, you can run Llama models locally as a command-line utility. The invaluable Sharon Machlis explains how to use Ollama; it’s easy, although if you don’t have enough RAM memory for the model it’ll use virtual memory (i.e. SSD or, heaven forfend, spinning disk) and run really slow. (Linux and Windows support is planned for Ollama.)
I tried out Ollama with several models (of the many it supports) on my M1 MacBook Pro, which unfortunately has only 8GB of RAM. I started with my standard Python web-scraping code generation task using Llama 2, apparently one of the smaller models (7B?). The result is similar to what I got from the Llama 2 70B model running on Poe, although not as well-structured. Note that Ollama only downloads the model the first time it needs it.
martinheller@Martins-M1-MBP ~ % ollama run llama2 pulling manifest pulling 8daa9615cce3... 100% |███████████████| (3.8/3.8 GB, 51 MB/s) pulling 8c17c2ebb0ea... 100% |██████████████████| (7.0/7.0 kB, 39 MB/s) pulling 7c23fb36d801... 100% |█████████████████| (4.8/4.8 kB, 5.0 MB/s) pulling 578a2e81f706... 100% |████████████████████| (95/95 B, 129 kB/s) pulling e35ab70a78c7... 100% |████████████████████| (90/90 B, 550 kB/s) pulling 96862bb35d77... 100% |██████████████████| (447/447 B, 2.8 MB/s) verifying sha256 digest writing manifest success >>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com Sure! Here is an example of how you can use the `requests` and `BeautifulSoup` libraries in Python to scrape a website: ``` import requests from bs4 import BeautifulSoup # Example website to scrape url = "https://www.python.org" # Send a request to the website and get the HTML response response = requests.get(url) # Parse the HTML content using Beautiful Soup soup = BeautifulSoup(response.content, 'html.parser') # Find the specific information you want to scrape on the website title = soup.find('h1', class_='main-title').text print(title) # Output: "Python.org" # Find all the links on the page links = soup.find_all('a') for link in links: print(link.get('href')) # Output: ["/python", "/docs", "/learn", etc.] # Test the code with different websites url2 = "https://www.infoworld.com" response2 = requests.get(url2) soup2 = BeautifulSoup(response2.content, 'html.parser') title2 = soup2.find('h1', class_='main-title').text print(title2) # Output: "Infoworld" # Find all the links on the page links2 = soup2.find_all('a') for link in links2: print(link.get('href')) # Output: ["/news", "/features", "/opinions", etc.] ``` This code sends a request to the specified website, parses the HTML content using Beautiful Soup, and then finds specific information on the page (in this case, the title and all the links). You can modify the `url` variable at the top of the script to specify the website you want to scrape. For example, if you want to scrape the Python documentation, you can set `url = "https://www.python.org/doc/` and then use the same code as before to find the title and links on the page. To test the code with different websites, you can modify the `url2` variable at the top of the script to specify a different website. For example, you can set `url2 = "https://www.infoworld.com"` to scrape that website, or any other website you want to test. Note: Make sure you have the necessary permissions and legal rights to scrape a website. Web scraping can be illegal in some cases, so be sure to check the terms of service for the website you're scraping and respect their rules.
With that baseline established, I tried the same prompt using Code Llama. Again, I didn’t specify the model size, but it looks like it is 7B.
martinheller@Martins-M1-MBP ~ % ollama run codellama pulling manifest pulling 8268242df6f0... 100% |████████████████████████████████████████████| (3.8/3.8 GB, 50 MB/s) pulling 2c8743bdc4ad... 100% |███████████████████████████████████████████████| (7.0/7.0 kB, 40 MB/s) pulling 38fa20ee7daa... 100% |███████████████████████████████████████████████| (4.8/4.8 kB, 28 MB/s) pulling 578a2e81f706... 100% |█████████████████████████████████████████████████| (95/95 B, 993 kB/s) pulling 404e21afdc6a... 100% |█████████████████████████████████████████████████| (30/30 B, 187 kB/s) pulling 86b68a7bb45d... 100% |███████████████████████████████████████████████| (507/507 B, 4.0 MB/s) verifying sha256 digest writing manifest success >>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com To write a Python program that scrapes a website, you will need to use a library such as BeautifulSoup or Scrapy. These libraries allow you to navigate and parse the HTML structure of a webpage, and extract specific data from it. Here is an example of how you might use BeautifulSoup to scrape the Python website at python.org: ``` import requests from bs4 import BeautifulSoup # Send a request to the Python website url = 'https://www.python.org' response = requests.get(url) # Parse the HTML content using Beautiful Soup soup = BeautifulSoup(response.content, 'html.parser') # Find all links on the page links = soup.find_all('a') # Print each link to the console for link in links: print(link.get('href')) ``` To write tests for this program, you can use a testing framework such as unittest or pytest. These frameworks allow you to define test functions that can be run automatically to verify that your code is working correctly. Here is an example of how you might write tests using the unittest module: ``` import unittest from bs4 import BeautifulSoup class TestPythonOrg(unittest.TestCase): def setUp(self): # Send a request to the Python website url = 'https://www.python.org' response = requests.get(url) # Parse the HTML content using Beautiful Soup soup = BeautifulSoup(response.content, 'html.parser') self.soup = soup def test_title(self): title = self.soup.find('title').text self.assertEqual(title, 'Python Documentation - Python v3.10') def test_description(self): description = self.soup.find('meta', {'name': 'description'})['content'] self.assertIn('The official homepage for the Python Programming Language', description) def tearDown(self): pass if __name__ == '__main__': unittest.main() ``` This test program sends a request to the Python website, parses the HTML content using Beautiful Soup, and then runs two separate tests: one that verifies that the page title is correct, and another that verifies that the page description contains certain keywords. You can run this test program by saving it to a file (e.g. `test_python_org.py`), opening a terminal or command prompt, and then running the following command: ``` python test_python_org.py ``` This will run all of the test functions in the program, and print a message indicating whether each test passed or failed.
This model took my request for tests a little more seriously, and demonstrated how to use the Python Unit Test module.
I wondered how different the output of the Python-tuned Code Llama 7B model would be:
martinheller@Martins-M1-MBP ~ % ollama run codellama:7b-python pulling manifest pulling 1dd2448f61e7... 100% |████████████████████████████████████████████| (3.8/3.8 GB, 48 MB/s) pulling 2c8743bdc4ad... 100% |███████████████████████████████████████████████| (7.0/7.0 kB, 52 MB/s) pulling 38fa20ee7daa... 100% |███████████████████████████████████████████████| (4.8/4.8 kB, 64 MB/s) pulling b507b9c2f6ca... 100% |██████████████████████████████████████████████████| (13/13 B, 60 kB/s) pulling 404e21afdc6a... 100% |█████████████████████████████████████████████████| (30/30 B, 240 kB/s) pulling b413591f8542... 100% |███████████████████████████████████████████████| (507/507 B, 3.2 MB/s) verifying sha256 digest writing manifest success >>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com """ import urllib.request, html.parser class Website(object): """ Class used to store the data scraped from the website :param str title: Title of article/page :param str author: Author of article/page :param str date: Date that the article was published on :param str content: Body text of article """ def __init__(self, title=None, author=None, date=None, content=None): self.title = title #title of post/article self.author = author #author of the post or article self.date = date #date that the article was published on self.content = content #the body text of the post/article class Article(object): """ Class used to store data scraped from an infowars article :param str title: Title of article/page :param str author: Author of article/page :param str date: Date that the article was published on :param str content: Body text of article """ def __init__(self, title=None, author=None, date=None, content=None): self.title = title #title of post/article self.author = author #author of the post or article self.date = date #date that the article was published on self.content = content #the body text of the post/article def scrape(website_url, htmlparser=html.parser): """ Function used to gather all the data from a website and return it in a form for use later :param str url: The URL of the website you are scraping from :param object parser: Defaults to the built-in html.parser but can be changed depending on the website's code style """ #create website object and assign it to a variable called 'website' website = Website() #open url using urlopen function from urllib.request module with urllib.request.urlopen(website_url) as url: #use read function to get the contents of the page and store it in a variable called 'content' content = url.read() #use decode_content=True argument for html parsing module to be able to parse the url content parsed_content = htmlparser(markup=content, convert_charrefs=True, decode_entities=False, decode_content=True) print("\n") #search through content of webpage and find all instances where a tag is used to bold/italicize text in the article body text title = parsed_content.find("h1", {"class": "entry-title"}) author = parsed_content.find("div", {"class": "author-info"}).text date = parsed_content.find("span", {"class": "updated"}).text content = parsed_content.findAll("p")[3].text + '\n' + parsed_content.findAll("p")[4].text + '\n' + parsed_content.findAll("p")[5].text #store all the data gathered from the website into variables in the 'website' object created earlier website.title = title.text website.author = author website.date = date website.content = content return website
All of a sudden, we see Python classes and a function that uses them. This run forgot to generate the unit tests. Let’s try the same prompt and model again and see if we get different output.
martinheller@Martins-M1-MBP ~ % ollama run codellama:7b-python >>> Generate a Python program to scrape a website. Include tests for python.org and infoworld.com . #Python program to scrap data from website# import requests from bs4 import BeautifulSoup as soup url_1 = 'http://python.org/' reqs_1 = requests.get(url_1) url_2 = 'https://www.infoworld.com/article/3067955/programming-language/python-the-best-programming-language.html' reqs_2 = requests.get(url_2) #HTML parsing from website python.org# page_1 = reqs_1.text #HTML parsing from website infoworld.com# page_2 = reqs_2.text #BeautifulSoup object for website python.org# soup_1 = soup(page_1, 'html.parser') #BeautifulSoup object for website infoworld.com# soup_2 = soup(page_2,'html.parser') def scrape(): #Dictionary to store data from both websites python.org and infoworld.com# website_data = {} #Scraping all the text from headings of a website# headlines = soup_1.find_all('div',{'class':'headline'}) for line in headlines: data = line.text if "Python" in data or 'Programming' in data: website_data['headlines'] = [data] #Scraping all the text from paragraphs of a website# paragrphs = soup_1.find_all('p') for line in paragrphs: data = line.text if "Python" in data or 'Programming' in data: website_data['paras'] = [data] #Scraping all the text from headings of a website# headlines2 = soup_2.find_all('p') for line in headlines2: data = line.text if "Python" in data or 'Programming' in data: website_data['headlines'] = [data] #Scraping all the text from paragraphs of a website# paragrphs2 = soup_1.find_all('p') for line in paragrphs2: data = line.text if "Python" in data or 'Programming' in data: website_data['paras'] = [data] #Return the dictionary of all text scraped from both websites python.org and infoworld.com# return website_data
This time the model generated code in a functional style instead of an object-oriented style. It did something ugly, though: Instead of using the home page of infoworld.com for its second test, it used the URL of an article about the Python programming language. Alas, that page does not currently exist, so we may have either bumped up against old content in the model, or the model had a hallucination.
Two Llamas for coding
As you’ve seen, Llama 2 Chat can generate and explain Python code quite well, right out of the box. There’s no need to fine-tune it further on code-generation tasks—although Meta has done exactly that for Code Llama.
Llama 2 Chat is not without controversy, however. Meta says that it’s open source, but the OSI begs to disagree, on two counts. Meta says that it’s more ethical and safer than other LLMs, but a class action lawsuit from three authors says that its training has violated their copyrights.
It’s nice that Llama 2 Chat works so well. It’s troubling that to train it to work well Meta may have violated copyrights. Perhaps, sooner rather than later, someone will find a way to train generative AIs to be effective without triggering legal problems.
Code Llama’s nine fine-tuned models offer additional capabilities for code generation, and the Python-specific versions seem to know something about Python classes and testing modules as well as about functional Python.
When the bigger Code Llama models are more widely available online running on GPUs, it will be interesting to see how they stack up against Llama 2 70B Chat. It will also be interesting to see how well the smaller Code Llama models perform for code completion when integrated with Visual Studio Code or another code editor.