Go beyond Twitter API with proxycrawl
Photo by Pete Willis on Unsplash https://unsplash.com/@peetwillis

Go beyond Twitter API with proxycrawl

Each API has its limitations, if this will be rate limit(how many times in frame window we can use it) or simply missing functionality. Luckily we can solve this problem with web scrapping.

Web Scrapping

You can write Your own spiders with well known framework Scrapy, there are cloud providers which take care about infrastructure You just need to upload Your spider logic and that's it.

Scrapping is not always an easy task, although things like that help https://schema.org/, schema.org is an initiative to give structure to web pages so these can be more easily scrapped by spiders.

Proxycrawl

Proxycrawl is an application that does scrapping on Your behalf, so You don't need to worry.

Scrape while being anonymous and bypass any restriction, blocks or captchas.

source: https://proxycrawl.com/

Due to many restrictions on web pages like linkedin, twitter or facebook, scrapping may take a lot of time, that's why i really like crawler feature in proxycrawl, which allow You to queue Your request and wait for response on webhook endpoint You registered in proxycrawl.

Webhook is a great idea, it allows You to decouple Your application logic. For example, You can run a python script which queue requests to proxycrawl and than have small microservice which is responsible only for handling incoming requests from proxycrawl, with this approach You can easily scale in case You need to increase the amount of scrapped data.

Twitter suggestions page

We will try to scrape twitter suggestion page like this one https://twitter.com/i/connect_people?user_id=1292861704339169283. More about twitter on proxycrawl here https://proxycrawl.com/how-to-scrape-twitter. We are using scraper instead of raw proxycrawl api, this is because scraper will return already structured data but it works not for all pages.

You can always use raw api and parse html on Your own but web page is an ever changing structure. If You daily depend on scrapped data it is better to trust proxycrawl and take them care of parsing the data and extracting information from it.

import requests
twitter_page = "https%3A%2F%2Ftwitter.com%2Fi%2Fconnect_people?user_id=1292861704339169283"
response = requests.get(f'https://api.proxycrawl.com/scraper?token=NORMAL_TOKEN&url={twitter_page}')

I see that proxycrawl has online demo, if You will try to scrape our twitter page, proxycrawl will fail because we haven't selected any valid scraper, unfortunately scrapper that could help us was not created yet, for twitter there is only twitter-tweet.

Let's try to use proxycrawl default api

In proxycrawl You have 2 tokens

You have two tokens; one for normal requests and another one for javascript requests (real browsers).

source: https://proxycrawl.com/

I have never used the javascript token but it seems that it try act and real browser, could be useful in cases like google trends. Twitter suggestion page has lazy loading behaviour and I wonder if proxycrawl know how to handle that, will headless browser scroll until bottom of the list of suggestions ?

Use the javascript token when the content you need to crawl is generated via javascript, either because it's a javascript built page (React, Angular, etc.) or because the content is dynamically generated on the browser.

source: https://proxycrawl.com/docs/crawling-api/headless-browsers/#headless-browsers-javascript-rendering

import requests
twitter_page = "https%3A%2F%2Ftwitter.com%2Fi%2Fconnect_people?user_id=1292861704339169283"
response = requests.get(f'https://api.proxycrawl.com/?token=JAVASCRIPT_TOKEN&url={twitter_page}')

As result we are force to login, so there is not data we can scrape.

Demo

Let's try to scrape sth different, just to find out if proxycrawl is working. It is my personal page on twitter.

import requests
twitter_page = "https%3A%2F%2Ftwitter.com%2Fi%2Fconnect_people?user_id=1292861704339169283"
response = requests.get(f'https://api.proxycrawl.com/?token=JAVASCRIPT_TOKEN&url={twitter_page}')

As result You get valid html as the personal page doesn't need authentication unless is protected. You can extract avatar https://pbs.twimg.com/profile_images/1292871249312600064/eCqrwnVS_400x400.jpg, number of followings 27, number of tweets 8.

In html You can search for data-model attribute which refer to repeatable structure on the page, read more on schema.org. You can get some information about trends and topics but I am not sure if these are personalized, maybe only suggested based on Your geolocation.

Summary

For research purposes You could try to create a bot which authenticate with twitter, but just to let You know, scrapping pages when pretending to be real user and being a bot is not allowed by twitter policy. In this case we haven't solved the problem but i must say that proxycrawl is really great tool, it helped me a lot to scrape linkedin profiles which was done like that:

Brak alternatywnego tekstu dla tego zdjęcia

For this task we used FastApi which handled incoming data from proxycrawl with webhook, we scale it on demand if more data is coming from proxycrawl. But there is one problem with google search, even proxycrawl set rate limit on some pages, so it is difficult to scale this part, in our case this is never ending process which try to extract new pages based on our search criteria, we are using google search operators.

I haven't thought about that but we could rotate with different browsers, nice thing is that we could get more differentiated results, some browsers like dukcduckgo.com offer free api for search.




To view or add a comment, sign in

More articles by Igor Miazek

Others also viewed

Explore content categories