We’re all a little on edge about data privacy these days. Perhaps that’s why, when an Iranian data scientist posted a file of Clubhouse user information, the accusations started flying about a “leak” or a “hack.”
News organizations lumped it in with recent privacy breaches at Facebook and Linkedin, the latter of which also involved data scraping, prompting audio-chat app Clubhouse to jump into the fray and try to defend its honor.
“Clubhouse has not been breached or hacked,” the company tweeted on Saturday, responding to a claim that the file was posted to a “hacker forum.” “The data referred to is all public profile information from our app, which anyone can access via the app or our API.”
Indeed, the site where the Clubhouse file was uploaded was not a shady, clandestine “hacker forum,” but an "online community of data scientists and machine learning practitioners” called Kaggle, which is owned by Google. The site was used by about 500,000 data scientists from around the world as of 2017, according to TechCrunch.
Clubhouse’s explanations did not, of course, stop the online freakout over potential privacy concerns. A story by Politico cited Clubhouse’s comments in a piece Tuesday with the headline “‘This was not a breach:’ How Big Tech gaslights the world on data leaks.”
In any event, people using the voice-chat app cannot do so anonymously. Anyone tuning in to its discussion forums can see profiles of everyone else in virtual “room,” including names, photos, Instagram account and Twitter account information, as well as who invited the person to the app.
Launched in March 2020, Clubhouse remains in “beta” mode, meaning people must be invited by an existing user to the app. The file posted to Kaggle included data from about 1.3 million profiles.
Although the data came from profiles available to anyone on Clubhouse, grouping the information together in one file could make it much easier to "out" people who have linked Clubhouse accounts to private or anonymous Instagram or Twitter accounts, some Twitter users said.
“My file allows me to see exactly what Twitter users have an Instagram account. I already found a banker in Clubhouse who has a financial bio on Twitter and a very raunchy Instagram account..Sigh,” tweeted social media researcher Henk Van Ess.
“I stopped counting, but the identity of hundreds of anonymous Twitter accounts can be revealed thanks to the Instagram link Clubhouse gave you for free…And the other way around: anonymous Instagram accounts lead to public Twitter accounts,” Van Ess continued.
I tracked down Vahid Baghi, the data scientist who scraped Clubhouse, now a graduate student at the University of Tehran, and asked him what his motives were, and what he thought about the controversy. (Baghi had explained on Kaggle that he wanted to “extract the hierarchical structure of invitations). Here are his responses.
Business of Business: What made you interested in learning the hierarchical structure of the invitations on Clubhouse, and what did you learn about that from the data?
Baghi: My major is Algorithms and Computing at the University of Tehran. It was here that I became interested in data science. In data science, there is a course called “network science,” which is about social network analysis. When I signed up for Clubhouse, I saw that my profile said ‘nominated by NAME.’ It was interesting to see how these invitations were linked together. When I saw the public Clubhouse API, I was interested in extracting user information to view the hierarchical structure. I put an example of this structure in Kaggle. Since the information was not sensitive I published it in Kaggle.
Hacking is not what I am and I don't do hacking! I’m just a junior data scientist. The person who invited me to Clubhouse doesn’t know me. As the app became a trend, I searched Twitter to see who had an invite. I texted someone, and he was kind and invited me.
Have you done projects like this before using publicly available social media data? Has it ever sparked controversy like this?
I have done a similar project before and I have extracted IMDB data. The IMDB project was as simple as the Clubhouse project. However, since Clubhouse has become a trend, the news about its data has also spread.
Has Clubhouse been in touch with you at all? It sounds like they see nothing wrong. Perhaps it fits with their creator ethos.
No, I have not been contacted by Clubhouse yet. Honestly, this project wasn’t really a big or difficult one. Only data was scraped. I even suggested that Clubhouse publish their own data without user specifications to be used for academic analysis, such as the Netflix and MovieLens data sets. [Data sets involved in an academic project.]
Do you have any thoughts on the assertions that this is a "breach" or a “hack?”
This is not a hack. Hacking occurs when the server which stores user information is infiltrated and all information (not only the users’ profiles) gets stolen.
Do you see any negative consequences for data scientists, such as yourself, when there is controversy like this over analyzing data for educational purposes?
Data is power. This power may have negative consequences if the data scientist does not use it properly, depending on the country where he or she lives, because the laws of different countries differ.