How to scrape data from Twitter Using python and Twint

13/02/2019 at 13 h 57 min

Do you need to scrap a thousand of tweets in a few lines of code ?

One word : Twint !

Twint is a really nice and simple python library for scraping Twitter account or keywords. There are a lot of python modules that can do this task, so what are the benefits of using Twint ? Two pros :

  • Can be used anonymously and without Twitter sign up
  • No rate limitations

Take of look at TwintProject github for more informations : https://github.com/twintproject/twint

As a presentation to this incredible module, we are going to get all tweets of some famous french politicians and analyze which words are the most commonly used by each of them.

Installation of Twint

Nothing complicated here, we are using pip for installing Twint:

pip3 install twint  

Get a list of the most famous french politicians and scrap all their account names on Twitter

In order to automate the process, I looked for a webpage where all the Twitter accounts were gathered. Fortunately, I discovered this website: http://ymobactus.miaouw.net/labo-top-politiques.php?mode=followers&liste=personnalites&tendance= , a prefect ressource for starting our work !

For scraping all the Twitter name accounts, I used the python library BeautifulSoup.

By examining closely the HTML code of the website, we can see that the information we need can be found inside <a></a> tag :


This is the little piece of code I used for scraping all the urls :

#Déclaration de la liste des comptes
compte_twitter = []
stopwords = []
# Make the GET request to a url
r = requests.get('http://ymobactus.miaouw.net/labo-top-politiques.php?mode=followers&liste=personnalites&tendance=')
# Extract the content
d = r.content
# Create a soup object
soup = BeautifulSoup(d)
# Récupération du href contenant le lien url redirigeant vers le compte Twitter de chaque personnalité
handles = [ a["href"] for a in soup.find_all("a", href=True) if("twitter" in a["href"])]
# On boucle afin de ne récupérer que la fin de url (nom du compte)
for url in handles: 
     compte = re.search('[^/]+(?=/$|$)',url)
     stopwords.append(compte.group().lower())
     compte_twitter.append(compte.group())

In output, we have all the url links contained in a list, in this form : https://twitter.com/EmmanuelMacron

However, we only need the Twitter name accounts in order to scrap all the tweets of each politician. A little REGEX can to the trick :

for url in handles: 
    compte = re.search('[^/]+(?=/$|$)',url)
#Ajout des noms à la liste compte_twitter
    compte_twitter.append(compte.group())

In output, we get this list :

compte_twitter = [‘lists’, ‘topolitiq’, ‘EmmanuelMacron’, ‘NicolasSarkozy’, ‘fhollande’, ‘MLP_officiel’, ‘JLMelenchon’, ‘Anne_Hidalgo’, ‘manuelvalls’, ‘alainjuppe’, ‘najatvb’, ‘benoithamon’, ‘nk_m’, ‘ChTaubira’, ‘RoyalSegolene’, ‘bayrou’, ‘fleurpellerin’, ‘FrancoisFillon’, ‘Lagarde’, ‘BrunoLeMaire’, ‘montebourg’, ‘CecileDuflot’, ‘EPhilippePM’, ‘jf_cope’, ‘laurentwauquiez’, ‘MarionMarechal’, ‘jeanmarcayrault’, ‘vpecresse’, ‘LaurentFabius’, ‘BGriveaux’, ‘BCazeneuve’, ‘pierremoscovici’, ‘nadine__morano’, ‘JLBorloo’, ‘dupontaignan’, ‘axellelemaire’, ‘GilbertCollard’, ‘f_philippot’, ‘aurelifil’, ‘CCastaner’, ‘jpraffarin’, ‘MarisolTouraine’, ‘MartineAubry’, ‘EvaJoly’, ‘cestrosi’, ‘gerardcollomb’, ‘xavierbertrand’, ‘PhilippePoutou’, ‘ECiotti’, ‘SLeFoll’, ‘datirachida’, ‘lepenjm’, ‘ymobactus’, ‘ymobactus’]

Twint, it’s up to you !

Now we are using Twint for scraping all the tweets of our dear friends contained inside our list compte_twitter :

 for compte in compte_twitter:
#nom du fichier csv qui sera produit
    filename = compte+".csv"
#lancement de l’objet twint
    c = twint.Config()
#nom du compte que nous recherchons
    c.Username = compte
#Nous désirons stocker un fichier csv contenant l’ensemble des tweets récupérés
    c.Store_csv = True
    # Nombre de tweets récupérés
    c.Count = True
#Si nous avons déjà un fichier csv des tweets, nous n’analysons pas le compte Tweeter de la personnalité
    if os.path.exists(filename):
        print('déjà présent :' + compte )
#Sinon, nous récuperons la liste des tweets
    else:
        c.Custom["tweet"] = ["tweet"]
        c.Output = filename
#Nous ne souhaitons pas voir les tweets apparaitre dans la console Python
        c.Hide_output = True 
#Nous ne récupérons pas les hashtags
        c.Show_hashtags = False    
        c.Store_object = True # nous stockons également les tweets dans un objet 
#Lancement de la recherche
        twint.run.Search(c)
#Tous les tweets seront conservés dans la variable tweets_as_objects
        tweets_as_objects = twint.output.tweets_object

Now you can go outside and get some air because it may take a while to achieve the taskt… However, if you take a look to the console, you might be able to see Twint collecting successfully some tweets…

Voilà ! After few hours of processing, a bunch of .csv files have been created, each file containing hundreds of tweets !

BONUS : analyze the words most used by each politician on Twitter.

Here is a little case of what can we do with this amount of data. The idea here, is to analize the words most used by each politician on Twitter.

1. Delete all the “polluting words”

We are going to create a list containing A LOT of little french little words (like of, who, we, is, have…) in order to clear our dataset.

Stopwords = ['personnes','dire','autres','lui','note','blog','monsieur','dai','via','seront','jour','car','instagr','suivez','veut','player','fortes','dire','mai','tour','cet','donner','dailymotion','actualites','grand','grande','encore','vive','video','bit','mettre','youtube','lui','cela','videos','frontnational','mlp','donnons','interview','toute','celui','retrouvez','parce','doivent','aura',"moments",'retrouvons','propose','souhaite','sarkozy','nicolas','pen','marine','html','revenu','generationsmvt','leurs','mois','leur','our','place','premier','ici','marche','construire','toujours','premier','facebook','chaque','vie','for''direct','macron',’sera','pscp','and','live','devons','the','entre','toutes','plan','belle', 'lefigaro', 'accueil', 'deux', 'serai','temps','soit','bravo','rien','aussi','visite','ceux','depuis', 'sans','trop','non','partir','votre','bien','sommes','ils','hier','tinyurl','soutien','quand','status','grand','nouvelle','jamais','avoir','rendez','face','mes','suis','veux','comme','contre','bourdindirect','fait','https','pays','aujourd',"notre","de","en","alpes","ladroitederetour","politique","elle","politique","leur","tout","ont","de","la","à","le","les","et","pour","aux","pic","des","com","en","un","pas","une","je","ce","a","est","sur","twitter","cette","hui","être","sont","faut","faire","il","ne","avec","au","qui","que","du","dans","plus","on","http","se","y","nous","mais","fr","son","ai","avons","notre","vous","nos","par","www","tous","merci","ses","fait","doit","soir","demain","matin","soir","très","ces","ans","mon"] 

2. Calculate the most numerous occurrences of each Twitter account

Now, in each .csv files, we are going to count the number of tweets. If this number of tweets is highter than 2 (because later when I’ve scraped Twitter URLs, we’ve collected URL which were not Twitter accounts), we will open the file :

 # Si Len de l'objet supérieur à 2 (car les deux premiers items de la list sont des comptes inexistants), alors on ouvre le fichier csv
        if len(tweets_as_objects) > 2:
            frequency = {}
            p = {}
#ouverture du fichier csv
            document_text = open(filename, 'r', encoding="utf8")
               #nous lisons le fichier et tous les mots seront en minuscule
            text_string = document_text.read().lower()

Next, we modificate some things to our .csv files : the text will be convert in lower case, we’re going to ignore hashtags and mentions, and also we are going to delete all words who will match with our words present in the “polluting words” list:

                 #nous supprimons également tous les hashtags ou noms de compte cités dans les tweets 
            text_string = re.sub(r'(\A|\s)@(\w+)','',text_string)
            text_string = re.sub(r'(\A|\s)#(\w+)','',text_string)
#on supprime chaque mot présent dans notre liste des mots bannis et apparaissant dans le fichier csv
            regex = re.compile('|'.join(r'\b{}\b'.format(word) for word in stopwords)) 

The last step is to create a loop for finding all the most numerous occurrences of each politician :

             for mot in match_pattern:
               #si le mot est present, on ne fait rien
                if regex.match(mot):
                    p = mot
               #s’il n’est pas present dans la liste, alors on regarde s’il est composé de plus de 2 charactères. Si oui, on l’ajoute dans notre objet frequency. 
                else:
                    if len(mot) > 2:
                        count = frequency.get(mot,0)
                        frequency[mot] = count + 1
            txt_name = compte+'_occurrence.txt'          
            print(txt_name)
            print(Counter(frequency).most_common(50))

And TADAAAAA ! If you understand french, the result is very tasteful :

François Fillon :

[(‘france’, 1262), (‘ump’, 609), (‘projet’, 393), (‘europe’, 288), (‘militants’, 286), (‘national’, 234), (‘gouvernement’, 217), (‘travail’, 214), (‘redressement’, 206), (‘droite’, 204), (‘moi’, 188), (’emploi’, 187), (‘etat’, 184), (‘gauche’, 174), (‘besoin’, 170), (‘candidat’, 167), (‘francois’, 163), (‘campagne’, 158), (‘pouvoir’, 154)]

Emmanuel Macron :

[(‘france’, 902), (‘europe’, 562), (‘projet’, 236), (‘monde’, 218), (‘ensemble’, 202), (‘travail’, 196), (‘avenir’, 172), (‘femmes’, 139), (‘besoin’, 127), (‘histoire’, 125), (‘engagement’, 113), (‘terrorisme’, 112), (’emploi’, 108), (‘confiance’, 103), (‘chacun’, 102)]

Gilbert Collard :

[(‘france’, 528), (‘loi’, 163), (‘migrants’, 155), (‘passage’, 137), (‘ministre’, 133), (‘saint’, 131), (‘gauche’, 129), (‘gouvernement’, 129), (‘medias’, 125), (‘justice’, 113), (‘mots’, 109), (‘droit’, 107), (‘policiers’, 105), (‘question’, 103), (‘mort’, 92), (‘police’, 92), (‘pourquoi’, 91), (‘europe’, 91), (‘terrorisme’, 88), (‘homme’, 86)]

Laurent Wauquier :

[(‘france’, 429), (’emploi’, 257), (‘droite’, 233), (‘europe’, 178), (‘auvergne’, 117), (‘travail’, 114), (‘entreprises’, 107), (‘valeurs’, 91), (‘gouvernement’, 90), (‘jeunes’, 81), (‘famille’, 76), (‘immigration’, 74), (‘taxes’, 67), (‘monde’, 61), (‘classes’, 59), (‘militants’, 59), (‘moyennes’, 58), (‘questions’, 57), (‘pouvoir’, 57), (‘droit’, 57)]

Christiane Taubira :

[(‘monde’, 123), (‘justice’, 112), (‘gauche’, 74), (‘droit’, 61), (‘enfants’, 58), (‘france’, 52), (‘victimes’, 48), (‘loi’, 47), (‘femmes’, 46), (‘voix’, 41), (‘rencontre’, 40), (‘avenir’, 39), (‘etat’, 34), (‘jeunes’, 33), (‘esprit’, 33), (‘droits’, 32), (‘ensemble’, 30), (‘travail’, 29), (‘courage’, 28), (‘valeurs’, 28), (‘histoire’, 28)]

Partager cet article