Languages of Swiss tweeters
Scraping a few days of Swiss tweets to see what languages are used in what Cantons.
Where I live, at the meeting point of France, Switzerland and Germany, German is the primary language, but on the street you constantly hear French, English, Spanish and many other languages I don’t recognise. As someone with pretty terrible German - I’m usually saved by the fact that everyone seems to speak English. The question is - what does the linguistic map of Switzerland look like if you pull an easily accessible source of social media data?
Languages of Switzerland
The following map is taken from Wikipedia, and shows what the linguistic map of Switzerland should look like.
I want to compare this to what the map would look like, if based of tweets sent from Switzerland. There are is one big caveat here surrounding the tweets:
Twatters (a slightly derogatory term for people that tweet my old supervisor Prof. Griffin favoured) are not representative
People using twitter are a specific sub group of people in Switzerland. Compared to the general population of Switzerland, I expect twitter users to be much younger, and more likely to be using English. Many people also disable location tracking, although I am less certain about what bias that issue may introduce.
Capturing the tweets
I used an R package to make a connection to the twitter streaming API, then just left it open for several days scraping all the tweets it detected as being from within Switzerland. It’s really important to note this means all the tweets where the user has disabled location tracking (which is I think the majority of users), are not present in my data.
Languages of tweets
In the following table I present the languages detected two ways. Users refers to the language the twitter user has set in the website, while Tweets is the algorithmically determined language of the tweet. This coding is done by twitter. Tweets where the language cannot be determined are usually tweets of links, and these have been discarded.
Language | Users | Tweets |
---|---|---|
English | 1,806 (41%) | 7,559 (40%) |
Deutsch | 1,092 (25%) | 2,529 (13%) |
French | 802 (18%) | 2,767 (15%) |
Other | 456 (10%) | 3,522 (18%) |
Italian | 209 (5%) | 473 (2%) |
So it looks like English was the most common language people had the twitter website set to, and English is the most common language that was tweeted within Switzerland. Surprisingly, Italian featured rarely (Spanish gave Italian a run for it’s money).
The fact that 5% of the users had twitter set to Italian, but only 2% of tweets were Italian, got me wondering if either some languages tweet more - or do users often tweet in a different language to what they have the twitter website set to. To look at that I tabulated the proportion of users that sent a tweet in a language that was different to their twitter settings.
UI language | Tweeted in a different language |
---|---|
Italian | 56% |
Deutsch | 52% |
French | 45% |
Other | 44% |
English | 37% |
And so it begins to become clearer - users often tweet in a language that is different to that one that they set in Twitter. In fact over half of the Italian speaking people sending tweets in Switzerland use a language that is not Italian. I also looked by language. When Italian, German or French speakers sent a tweet in a different language, it usually tended to be English. Things got more interesting with English speakers, as when they tweet in a different language, it was sometimes German (25%) or French (14%), but there was a greater split across other languages like Arabic, Spanish and Dutch.
Plotting density of tweets
In the plot below, I took one tweet from each user, and plotted the location and language of the actual tweet. It looks like the pattern follows what we expect from Wikipedia.
Plotting proportion of languages
In the plot below, I plot the proportion that speak each language. I was worried that the proportions could be misleading if the count of tweets was low in a canton, so if a canton had less than 10 tweets I disabled the plot for that canton.
Plotting tweet density
The plots above don’t take into account the quantity of tweets in each region. The following plots show the density of tweets by language, to help highlight the fact that most tweets come from the Swiss cities.
Interestingly, it looks like a lot of Italian speakers are tweeting from Zurich - I would love to know if these are Italians, or Swiss people that speak Italian, but unfortunately that level of demography goes beyond what is present in the twitter API.
Overall it looks like twitter reflects the distribution of German, French and Italian we expect to see.
Unfortunately for Romansh, that language is almost completely absent from tweets (Portuguese and Indonesian seem more popular, although Romansh may at times been miscoded as another East European language by the algorithm).
English is the most common language overall, and while I expected twitter to be skewed towards English, I was not expecting it to be this strong. Some of these English tweets are likely to be tourists, and potentially it will be interesting to rerun this in a few months, and only include users present at both time points (to remove the tourists).
Notes
All of the code used for this analysis is is my github account in the swiss_twitter_languages
repo.
While the code for my plots is in the repo - the plots with the relief in the background are based off a tutorial by a Swiss data journalist timogrossenbacher.ch.