
Letโs start with a short disclaimerโโโDoiT International does not promote, condone or advocate licit or illicit drug use.
The use of marijuana for recreational purposes is illegal in Israel. But where is a demand, there is always a supply. Telegrass is a network for purchasing marijuana in Israel. It operates on top of Telegramโโโa free chat application with communications being encrypted end-to-endย . The network offers quick deliveries with the option to write reviews about the sellers. The monthly turnover of Telegras is estimated at $17 million.
Users can read reviews about the sellers so they can make a better decision from who they are going to make a purchase. There is a conventional bot that helps the user to write a review and rate the seller on 4 points while each ratings is between 0 to 5.
- How was communication with the seller?
- What was the quality of the product?
- Was the price fair?
- Was it delivered on time?
Data Acquisition
Letโs start by grabbing the reviews from Telegram. We will use the telethonโโโa python library to work with Telegram API. We are only interested in the message body, since the reviews are submitted by bots (after some human review) so the message metadata is not particularly interesting.
https://gist.github.com/avivl/884ec584b19f4c079153e2f5891181a6
Here is the config file
https://gist.github.com/avivl/007efb0db4a2ebcf5539f054bc82a5d8
Lets run it:
python geth.py >reviews.txt
The output would look something like this:
[โืืืงืืจืช ืขื ืืกืืืจ: @weed1614\nืืืืืจ ืคืขืืืืช: ืืืจ ืฉืืขโโโืืืจ ืฉืืขโ, โืืืืช ืืขืช ืืกืคืจ: #71584\nื ืฉืื ืืืช: @Hoover656\nื ืฉืื ืืชืืจืื: 15:25 22/12/17โ, โืืืืช ืืืขืช:\nืฉืืื ืื ืฉืงื,ืืจืง ืืจืื ืืืฉ ืืืืื,ืืืฉ,ืืขืื,ืืคืืฆืฅ ืืืืงื ืื.โ, โ\nืชืงืฉืืจืช: ๐๐๐๐๐ (5/5)\nืืืืืช: ๐๐๐๐๐ (5/5)\nืืืืจ: ๐๐๐๐๐ (5/5)\nืืืชื ื: ๐๐๐๐๐ (5/5)โ, โืืืืฉืช ืืืืช ืืขืช: @TelegrassBot ืืฉืืจืืชืื!โ]
As you can see (even if you donโt read modern Hebrew) there is a mixture of letters, symbols, and numbers without any apparent structure (also remember the fact Hebrew is right to left language as opposed to English which is left to right). So the next step that we need to apply is prepared the data for analysis
Data Preparation
Before we can query and analyze the data, we would need to prepare it for the analysis i.e. clean it and make it more structured.
However, writing the code that will to this is a very time consuming and relatively dull task with a lot of trial and error. Lucky for us there is a new service by Google (still in beta though)โโโCloud Dataprep. Quoting from the productโs websiteย :
โGoogle Cloud Dataprep is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. Cloud Dataprep is serverless and works at any scale. There is no infrastructure to deploy or manage. Easy data preparation with clicks and no code.โ
Well, Iโve figured out that the reviews data could give Dataprep a run for itโs money. So in just 60 trivial steps or so (listed below) and without writing a single line of code I was able to create a structured data from the ratings. The DataPrep UI allows you to see in real time the transformation and therefore to visually debug and adjust the process. Some steps involved splitting the data based on charters or patterns and dropping rows with invalid values.
Finally, Cloud Dataprep natively supports writing the resulting data to a large variety of data sources. For my purpose, I have opted for Google BigQueryโโโGoogleโs serverless analytical database.
https://gist.github.com/avivl/3ec0d5fb38f054c8a7e8b673148321c7
Now the data is nice and clean.

Analysis
Once we have the data all nice and tidy, itโs time to analyze it. Letโs check how the price is related to the other ratings using BigQueryโs CORR() function which calculates Pearson Correlation coefficient. The coefficient of -1 indicates strong negative correlation, the coefficient of +1 indicates strong positive correlation and finally coefficient of zero means there is no correlation.
https://gist.github.com/avivl/ed30a0c2d1978ea4eaa0f6b9c51ec5b6
price_wait,price_communication,price_quality
0.679, 0.774, 0.793
As you can see the highest correlation to the satisfaction is the price and the quality of the product. Should we trust the rating? Well, the online rating is often subjected to self-selection bias. Letโs see if the reviews in Telegrass have the same issue.
https://gist.github.com/avivl/e9ecfdb9342a23af9a6515b40d581e4e
As we can see for the results, either our sellers are superb in every possible manner or we have some bias in the user ratings or perhaps maybe after using the product people tend to be more positiveโบ๏ธ

Now letโs see what are the most popular bigrams and are they correlated to the high ratings
https://gist.github.com/avivl/0e6ddbfdf1d7a4ee3d16c60bb0d18708
The most popular bigrams are:
- Amazing man
- Awesome product
- Highly recommended
We can see that there is a high correlation between the high scores and the most common bigram in the reviews.
Can you guess whatโs the most popular time for submitting a review? If I had to imagine, I would say sometime during the late evening, or the night, however, the actual data proved me wrong:
https://gist.github.com/avivl/8e7e828509407676ba129c1f376a06a6

Now letโs see if there is a correlation between the city size and the number of reviews. We will start by getting the top 20 reviews generating cities. The results will be added to a new table (by_city)
https://gist.github.com/avivl/84b0b258e47236b30d298e05f0335e74
Now will will add a new coulmen population and we will populate (pun intended) it with data from the the Israel Central Bureau of Statistics (data is for end of 2016). So letโs see the results:
https://gist.github.com/avivl/0efbadcf2683f14b8c036f6362f4a46chttps://gist.github.com/avivl/0a76b373152e4fcedb1fffdce34a2e72
Before we start to compare the results, one thing is standing out. Almost 1% of the Tel-Aviv population is writing reviews in Telegrass!, not using the product or just buying, but actively participating.
We will take a look at the first and the last results. If youโre a bit familiar with the demographics of Israel, it shouldnโt be a big surprise. Tel-Aviv is a city with young population and a center for many high tech companies, it is the most liberal city in Israel. People are very tech savvy and like to state their opinion. At the bottom we will find Bnei Brak, the city is dominated by low income religious people so itโs pretty obvious why not so many people are writing reviews.
Now it is time to play a bit with ML using TensorFlow. First I created a training set and a test data set.
https://gist.github.com/avivl/92454358fc144da18e0f3a0c962fe5ea
I took the code written by Akshay Pai and adjusted it a bit to my needs.
After loading the data and preparing it the ML code is this:
https://gist.github.com/avivl/3f3e876286670e19698a4bcc60f2313d
Once the training was done I used model.predict in order to predict the score.
https://gist.github.com/avivl/70a83e35793d3800e6c1d25fc55f97e0
I loaded the test set and calculated the RSME. The RMSE is 1.4, the values are between 0 to 20, so I results are not amazing but can predict pretty well from the text what will be the numerical rating.
You donโt have to be a full fledge data scientist or data engineer in order to get some interesting insights out of an un structure text, without writing a single line of code to prepare it for analysis. Then you can analyze it using BigQuery and TensorFlow to get some interesting insights.


