All posts by Roman Egger

ChatGTP as a recommender system for the tourism insustry

by Roman Egger

ChatGTP is currently on everyone’s lips. Currently, there is no API for ChatGTP, but I tried to use the possibilities offered by OpenAI for a simple recommender system in tourism. Here is an example, which I implemented in Python and with Flask (it doesn’t win any beauty contests, but shows the idea behind it) using OpenAI´s GTP-3. GPT-3, the latest version of the Generative Pre-training Transformer (GPT) model, has the ability to generate human-like text, making it a powerful tool for creating a recommender system in the tourism industry.

A recommender system in tourism is a tool that helps visitors plan their trip by suggesting places to visit, things to do, and accommodations based on their preferences. GPT-3 can be used to create a more personalized and accurate recommendation system by analyzing the user’s input and providing relevant suggestions.

One way GPT-3 can be used for a recommender system in tourism is by creating a chatbot that can understand the user’s needs and preferences. For example, if a user says they are interested in history and culture, the chatbot can recommend historical sites and museums to visit. In this video, you can see a simple input field where a user enters his question and how GTP-3 outputs the result.

GPT-3 can also be used to generate personalized itineraries for visitors based on their preferences and the time they have available. The generated itineraries can include suggested activities, accommodations, and transportation options, making it easy for visitors to plan their trip.

In a second attempt, I downloaded a list of about 700 destinations in Austria. This dataset contains the name of the place and the geocoordinates. With a Python script, I iterated over the list and got the answers to the question: “Describe {ctiy} as a tourist destination for each place.”

Afterwards, the answers were text-preprocessed (lowercasing, stopword removal, etc.) and embedded with BERT. The text was thus converted into a high-dimensional vector that can be further computed.

Afterwards, the vectors were clustered using the Louvain community detection algorithm. Thus, texts with similar descriptions could be grouped together. Finally, I visualized all locations on a map, with each location colored according to the assigned cluster. One can clearly see the places in Tyrol, which can largely all be assigned to the red cluster, as well as places in the Salzburg Salzach Valley. These are all typical winter and ski destinations. The locations in the blue cluster, primarily around Vienna and in Lower Austria, behave in a thematically similar manner.

It is only a small demo, but it shows the potential that the use of GTP-3 will have for tourism in the future.

In conclusion, GPT-3 is a powerful tool that can be used to create a more personalized and accurate recommender system in the tourism industry. By understanding the user’s needs and preferences, GPT-3 can provide relevant suggestions and make it easier for visitors to plan their trip.

GAIA-X and Tourism

by Roman Egger

Have you ever heard of Gaia-X? No, it is an ambitious project supported by the European Union with the goal of creating an open, secure, and interoperable platform for cloud services and data infrastructure in Europe. It is supported by a consortium of companies and organizations from various industries across Europe and is part of the EU’s strategy to promote digitalization in Europe. The platform aims to enable users to access and connect services and data from different providers, creating new business opportunities and strengthening the competitiveness of the European economy.

So, what does this have to do with the tourism industry?

The tourism industry is a significant contributor to the European economy, employing millions of people. In recent years, however, the industry has undergone significant changes due to technological advancements and digitalization, making it increasingly important for businesses to adapt and use new technologies to stay competitive.

This is where Gaia-X comes in. By leveraging the capabilities of Gaia-X, businesses in the tourism industry can connect and use data from different providers to create personalized offers for tourists. For example, by linking data from airlines, hotels, and other service providers, businesses can offer customized travel routes and packages tailored to the needs and preferences of tourists.

In addition to this, businesses in the tourism industry can also benefit from the security and interoperability of Gaia-X. By using Gaia-X, they can ensure that their data is processed securely and can be easily connected to other services and providers, improving the efficiency and performance of their businesses.

Overall, Gaia-X offers a unique opportunity for businesses in the tourism industry to adapt and use new technologies to stay competitive and create new business opportunities. It is important for businesses in this industry to take advantage of the opportunities provided by Gaia-X and get involved in the project to reap the benefits

Whisper, whisper….

by Roman Egger

Have you heard about WHISPER, the OpenAI language model? If not, you’re in for a treat! I recently tried out WHISPER for myself and was blown away by its capabilities. WHISPER is a large language model developed by OpenAI. It was trained on a massive dataset and trained on 680,000 hours of multilingual and multitask supervised data. You can choose between the small, basic and large model.

I was particularly impressed with the speech recognition of the large model, which, however, takes looooong if you don´t run it on a GPU until you receive the output. As input, I used an expert interview I recently gave for a PhD student. I recognized everything correctly, even a mixture between German and Englisch at the beginning of the interview – amazing! So I think of setting up Whisper to be used for my students. Whenever they do qualitative studies, it can be used for transcription.

Overall, I highly recommend giving WHISPER a try if you have the opportunity. It´s super easy to use (but I struggled with ffmpeg, which you also need to install).

import whisper
model = whisper.load_model(“base”)options = whisper.DecodingOptions(fp16=False)
result = model.transcribe(“test.mp3”)

Will ChatGTP write our scientific papers soon?

by Roman Egger & ChatGPT

ChatGPT is a large language model trained by OpenAI that can generate human-like text. This technology has the potential to assist with writing scientific articles by providing suggestions for words and phrases, as well as by generating entire sentences or paragraphs based on a given prompt. However, there are also some potential disadvantages and challenges associated with using ChatGPT for writing scientific articles.

One of the main advantages of ChatGPT is that it can help writers overcome writer’s block by providing suggestions for words and phrases. This can be especially useful for writers who are struggling to find the right words to express their ideas. ChatGPT can also assist with the editing and proofreading process by generating alternative word choices and suggesting grammatically correct sentences.

Another potential advantage of ChatGPT is that it can help writers generate ideas for their articles. By providing a prompt or topic, writers can use ChatGPT to generate a list of potential ideas or subtopics to explore in their article. This can be a useful way to brainstorm and come up with new and interesting ideas for scientific articles.

However, there are also some potential disadvantages and challenges associated with using ChatGPT for writing scientific articles. One of the main disadvantages is that ChatGPT is not capable of understanding the context or meaning of the text it generates. This means that the suggestions and sentences generated by ChatGPT may not always be relevant or accurate, and may require additional editing and proofreading by the writer.

Another potential disadvantage is that ChatGPT is not capable of generating original content. This means that writers who rely heavily on ChatGPT may be at risk of inadvertently committing plagiarism. Plagiarism is a serious concern in the scientific community, and can have serious consequences for writers who are found to have copied the work of others without proper attribution.

Overall, ChatGPT has the potential to assist with the writing of scientific articles by providing suggestions for words and phrases, as well as by generating ideas and sentences based on a given prompt. However, there are also some potential disadvantages and challenges associated with using ChatGPT, including the risk of generating irrelevant or inaccurate content, and the potential for plagiarism. Writers who use ChatGPT should be aware of these potential drawbacks and take steps to ensure that their use of the technology does not compromise the quality or integrity of their work.

The following video shows ChatGPT in action, writing a short essay about topic modeling in tourism.

The whole article was AI generated by ChatGTP

It only remains to mention that the quotes are also “fictitious”. Because none of the given sources actually exists. So if our students should use ChatGTP in the future to write their papers, you will find out quickly with a Google-Scholar comparison.

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

by Joanne Yu & Roman Egger

Full paper published by: Roman Egger and Joanne Yu, Frontiers in Sociology,

Have you been wondering what would be the best way to analyse short, text-heavy, and unstructured content from social media? Certainly, social media has opened an entirely new path for social science research, especially when it comes to the overlap between human relations and technology. In the 21st century, data-driven approaches provide brand-new perspectives on interpreting a phenomenon. Yet, methodological challenges emerge in both the data collection and analysis process. To shed light on the efficacy of different algorithms, his article takes tweets with #covidtravel as well as the combination of #covid and #travel as the reference points and evaluates the performance of four topic modeling; namely latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Top2Vec, and BERTopic.

Introduction to the four models

LDA is a generative probabilistic model for discrete datasets. It is a three-level hierarchical Bayesian model, where each collection item is represented as a finite mixture over an underlying set of topics, and each topic is represented as an infinite mixture over a collection of topic probabilities. Since the number of topics need not be predefined, LDA provides researchers with an efficient resource to obtain an explicit representation of a document.

In contrast to LDA, NMF is a decompositional, non-probabilistic algorithm using matrix factorization and belongs to the group of linear-algebraic algorithms. NMF works on TF-IDF-transformed data by breaking down a matrix into two lower-ranking matrices. Specifically, NMF decomposes its input, which is a term-document matrix (A), into a product of a terms-topics matrix (W) and a topics-documents matrix (H). W contains the basis vectors, and H contains the corresponding weights.

Top2Vec is a comparatively new algorithm that uses word embeddings. That is, the vectorization of text data makes it possible to locate semantically similar words, sentences, or documents within spatial proximity. For example, words like “mom” and “dad” should be closer than words like “mom” and “apple.” Since word vectors that emerge closest to the document vectors seem to best describe the topic of the document, the number of documents that can be grouped together represents the number of topics.

BERTopic builds upon the mechanisms of Top2Vec and provides document embedding extraction with a sentence-transformers model for more than 50 languages. BERTopic also supports UMAP for dimension reduction and HDBSCAN for document clustering. The main difference between Top2Vec is the application of a class-based c-TF-IDF algorithm, which compares the importance of terms within a cluster and creates term representation.

Results explanation of LDA and NMF

Starting from LDA, three hyperparameters are required. A grid search was performed for the number of topics (K) as well as for beta and alpha. The search for an optimal number of topics in our study started with a range from two to 15, with a step of one. During the process, only one hyperparameter varied, and the other remained unchanged until reaching the highest coherence score. To facilitate a clear interpretation of the extracted information from a fitted LDA topic model, pyLDAvis was used to generate an intertropical distance map.

As for NMF, an open-source Python library, Gensim, was used to estimate the optimal number of topics. By computing the highest coherence score, 10 topics could be identified in our research. Due to a clear distinction between all the identified topics in the NMF model (see detailed results in our paper), we conclude that the results obtained from NMF are more in line with human judgment, thereby outperforming LDA in general. Yet, as both models do not allow for an in-depth understanding of the phenomenon, the next section focuses on the topic models that use embedding representations.

Results explanation of BERTopic and Top2Vec

By relying on an embedding model, BERTopic and Top2Vec require an interactive process for topic inspection. Both algorithms allow researchers to discover highly relevant topics revolving around a specific term for a more in-depth understanding. Using Top2Vec for demonstration purposes, presuming that we are interested in the term “cancel” during COVID-19, the Top2Vec produces relevant topics based on the order of their cosine similarity, ranging from 0 to 1. Thereafter, the most important keywords for a particular topic can be retrieved. But, ultimately, an inspection of individual tweets is also highly recommended. For example, the keywords for the topic “cancel” include the following:

[“refund,” “booked,” “ticket,” “cancelled,” “tickets,” “booking,” “cancel,” “flight,” “my,” “hi,” “trip,” “phone,” “email,” “myself,” “hello,” “couldn,” “pls,” “having,” “guys,” “am,” “sir,” “supposed,” “hopefully,” “me,” “excited,” “postpone,” “so,” “days,” “dad,” “paid,” “option,” “customers,” “request,” “bihar,” “thanks,” “amount,” “due,” “waiting,” “to,” “got,” “back,” “impossible,” “service,” “hours,” “complete,” “before,” “wait,” “nice,” “valid,” “book”].

Turning to BERTopic, since some of the topics are close in proximity, visualization and topic reduction would provide a better understanding of how the topics truly relate to each other. To reduce the number of topics, hierarchical clustering can be performed based on the cosine distance matrix between topic embeddings. Our study took 100 topics as an example to provide an overview of how and to which extent topics can be reduced.


For an overall evaluation based on human interpretation, this study supports the potency of BERTopic and NMF, followed by Top2Vec and LDA, in analyzing Twitter data. While, in general, both BERTopic and NMF provide a clear cut between any identified topics, the results obtained from NMF can still be considered relatively “standard.” The table below summarizes the pros and cons of applying LDA, NMF, BERTopic, and Top2Vec in order to help facilitate social scientists in the necessary preprocessing steps, proper hyperparameter tuning, and comprehensible evaluation of their results. Please refer to our study for a complete step-by-step guide and detailed results.

How to cite: Egger, R., & Yu, J. (2022). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Frontiers in Sociology, 7.

Looking behind the scenes at dark tourism: a comparison between academic publications and user-generated-content using natural language processing

by Joanne Yu & Roman Egger

Full paper published by: Joanne Yu and Roman Egger, Journal of Heritage Tourism

Transferring knowledge between academia and tourists has never been an easy task. Proportionally, the number of journal articles that describe empirical research is not increasing as rapidly as the amount of information shared on the Internet. With the rise of today’s visual-based social media, such as Instagram, shared content has the potential to offer new aspects of how visitors comprehend and appreciate a particular experience. Thus, our study takes the emerging phenomenon of dark tourism as the research context to provide solutions in revealing the knowledge discrepancies between academic publications and user-generated content.

Data extraction for academic publications and Instagram posts

Academic publications relevant to dark tourism were selected based on the top 10% of most frequently cited literature according to citation frequency on the Web of Science from the 1990s and onwards. Specifically, ‘dark tourism’ was used as a keyword, and the result was refined based on hospitality, leisure, and tourism categories. In total, 26 journal papers and two book chapters in English were selected. Thereafter, the whole paper (in pdf format) was parsed, and reference lists were removed using PyPDF2 in Python. Turning to Instagram data, we extracted Instagram posts captioned with the hashtag #darktourism. Data was crawled in March 2020 based on 26,581 public posts. After the removal of content shared by business accounts and non-English posts, the final dataset contained 12,835 posts published by 4,711 personal accounts. Extracted data included captions, the date of the post, the check-in location, and the post URL.

Both academic texts and Instagram captions were pre-processed using numerous natural language processing modules in Python. A list of stopwords was prepared, and non-informative texts were removed. Unknown characters, numbers, and usernames were excluded. Slang words were reformed, and diacritics were converted to the basic format. Finally, texts were tokenised into small units.

Geographic flow map

The study’s results start from a geographic flow map to visualise tourist movements and provide an overview of dark tourism spots based on the data. Based on the collected Instagram data, entity recognition was done in Python using Spacy to extract tourists’ country of origin from users’ Instagram profiles. The check-in location of a post is considered a tourist’s travel destination. The corresponding geographic coordinates of locations were identified using Google’s geocoding API, which resulted in 2,954 trips between pairs of tourists’ origin and travel destinations.

Concentric circles show where the tourists come from, and solid circles refer to the dark sites where tourists travelled to. The two major dark sites are in Ukraine (e.g., Pripyat and Chernobyl), with the major tourists to Ukraine coming from the UK, Argentina, and the USA. Readers who are interested in more details can refer to the interactive map on

Scatterplot: Comparison between the viewpoints of scholars and tourists

This study visualises the highly associated terms that appeared in the top 10% of academic publications and Instagram posts using statistical natural language processing. The total word count for Instagram posts was 400,567, and for academic publications was 161,726. A model-based term scoring algorithm was applied to compute the association of the terms based on precision, recall, non-redundancy, and characteristicness. The terms were assigned to either one of the categories on a two-dimensional scatterplot (see Kessler (2017) for the complete mathematical procedure:

Each point represents the usage of a word based on its term frequencies, suggesting the plot coordinate for the word. The scatterplot also features a scaled F-score, ranging from −1 to +1. Words with scores near +1 are closer to the y-axis and are used more on Instagram (darker shades of blue); words with scores closer to −1 and the x-axis appear more in the literature (darker shades of red). If the terms present themselves more equally in both categories, the colour becomes more transparent. The top scoring terms unique in either one category are listed on the right side of the plot. Readers who are interested can refer to the interactive scatterplot URL:

The scatterplot provides a snapshot of large and unstructured datasets by using statistical NLP to automatically compare documents in a language-independent way. In doing so, the interactive plot allows researchers to investigate topics and locations that are worth being looked at in a particular discourse based on the disclosure of authentic materials. Thus, it is shown whether the reality in tourists’ minds matches the research agendas and where unsolved thematic issues are revealed as relevant for future research directions.

Holistically, since texts are the dominant form of user-generated content, the example applied in our study (i.e., user-generated content vs academic publications) can be used to reveal common and uncommon ground across two categories in other contexts. For example, texts that can be contrasted and analysed in this way include the comparison between texts of tourists vs residents or domestic vs international travellers, amongst others.

How to cite: Yu, J., & Egger, R. (2022). Looking behind the scenes at dark tourism: a comparison between academic publications and user-generated content using natural language processing. Journal of Heritage Tourism, 1-15.

Measuring tourism with big data?

by Dirk Schmücker and Julian Reif
Paper: Measuring tourism with big data? Empirical insights from comparing
passive GPS data and passive mobile data. In: Annals of Tourism Research Empirical Insights 3(2022)

Currently, there are two valid approaches for measuring tourism frequencies and flows: (a) locally installed one-spot sensors and (b) tracking solutions based upon signal chains coming from GNNS-equipped smartphones or the mobile network they are connected to. There are some more variants of data-sources, for example, „mini-signal chains“ (constructed from local sensors catching the Bluetooth or wifi signal or public social media postings) or using water consumption in a destination, but basically, there are these two approaches: Local sensors and smartphone-based tracking. In this article, we deal with smartphone-based tracking data.

In the first step, we tried to identify and classify the different data sources relevant to tourism research. For this purpose, we propose four categories in the paper:

  • Cat. A: Multi-Spot Measurements
  • Cat. B: Coupled Spots
  • Cat. C: Single Spot Measurements
  • Cat. D: Other Measurements

In order to better compare the data sources, we have also developed a set of 13 evaluation indicators summarized in four dimensions (Figure 1):

Figure 1: Categories of data sources

  • Specific tourist dimensions
  • Time and Space
  • Generic dimensions
  • Social and organizational dimensions

In addition to this more theoretical look at the subject, we work with empirical data and more precise Tracking Data. Tracking data are Big Data in terms of being 3V and also in terms of being „exhaust data“: These data are not being generated with the goal of tracking users, but for billing, technical operations or finding one’s place to get the correct weather forecast. However, once the data is there, why not reuse them? This idea has turned into a vibrant industry, and datasets are commercialized for considerable sums of money.

There is a growing body of academic research on such data sources, and researchers usually do fancy things within the data, mostly applying advanced ARIMA models or some machine learning algorithms. We were more interested in the relation of the data sources to the real world: Would they be able to reflect the results from reference data sources, and would they be able to identify different types of mobility?

Therefore, we use two data sources (passive mobile phone data and passive GPS location events) to get empirical insights on day and overnight tourists in four different destinations in Schleswig-Holstein, Germany (St. Peter-Ording, Büsum, Amrum, Multimar Wattforum). We compare Big Data with local reference data from tourist destinations. Figure 2 shows an example of a visual comparison of different data sources for overnight visitors. Results show that mobile network data are on a plausible level compared to the local reference data and are able to predict the temporal pattern to a very high degree. GPS app-based data also perform well but are less plausible and precise than mobile network data.

Figure 2: An example of comparing different data sources for overnight tourists

Here is a link to the open access paper:

Moving Patterns – a GIS Analysis

by Roman Egger

By its very nature, tourism is closely linked to geospatial data. Nevertheless, publications dealing with geo-analytical issues are rather the exception in tourism literature. This is probably due to the fact that geographers if they publish in a tourism context, tend to publish in their geo-community rather than in tourism-related journals. This is also the reason why I have included this topic in the book under chapter 24. Andrei Kirilenko provides insights into basic terminology, typical problems and methodological approaches to make them accessible to tourism researchers. I, too, have been dealing with geodata to a greater or lesser extent for some time and would like to present an exciting method for tourism at this point. Geodata is available to us either through observation, GPS tracking (see our study in the city of Salzburg or the study of the “Freilichmuseum”, I did with my students), recently also increasingly through the use of mobile phone data. In this example, I use GPS data from mobile devices of tourists in Salzburg (3 Months in 2019), I received from NEAR. NEAR sells GPS data from 1,6 billion users in 44 countries, with 5 billion events processed per day. So this easily becomes real big data!

On the one hand, these data can be segmented according to the markets of origin in order to identify a possibly different behavior of the markets. For city destinations, however, the movement patterns of tourists are of particular interest. So what are the destination’s “beaten tracks” – questions that are particularly relevant in destinations struggling with the overtourism phenomenon.

The first graphic shows the available GPS data of tourists in the city of Salzburg. In addition, red dots can be seen.

German Tourists (2019)

Here I used the Python Library “MovingPandas” by Anita Graser. Cluster points are created and subsequently, the frequency between the points can be visualized.

The thicker the line, the more frequented the distance between the two clusters. Thus, it can be visualized very well how tourists navigate through the city. Unfortunately, MovingPandas still lacks the possibility to adjust the edges to the streets, so that the lines go across the river or over buildings. Nevertheless, it is an exciting approach to capturing typical moving patterns of tourists.

And this is how it´s done (code adapted from Anita´s examples):

df = # load your dataset with columns “Latitude”, “Longitude”

crs = {‘init’: ‘epsg:4326’} # assign CRS

geometry = [Point(xy)for xy in zip(df[‘Longitude’], df[‘Latitude’])]

geo_df = gpd.GeoDataFrame(df, crs = crs, geometry = geometry) # build GEO-Dataframe

geo_df[‘Date’] = pd.to_datetime(geo_df[‘Date’]+ ‘ ‘ + geo_df[‘Time of Day’])

geo_df = geo_df.set_index(‘Date’) # the date needs to be your index

traj = mpd.Trajectory(geo_df, traj_id=”DeviceID”, obj_id=’DeviceID’, t=”Date”, x=”Latitude”, y=”Longitude”) # generate trajectories


traj_collection = mpd.TrajectoryCollection(geo_df, traj_id_col = “DeviceID”, obj_id_col=”DeviceID”, t=’Date’, x=’Latitude’, y=’Longitude’) # build the trajectory collection

trips = mpd.ObservationGapSplitter(traj_collection).split(gap=timedelta(minutes=5))

pts = aggregator.get_significant_points_gdf()
clusters = aggregator.get_clusters_gdf()
( pts.hvplot(geo=True, tiles=’OSM’, frame_width=800) *  
clusters.hvplot(geo=True, color=’red’) )

flows = aggregator.get_flows_gdf()

 flows.hvplot(geo=True, hover_cols=[‘weight’], line_width=dim(‘weight’)*0.01, alpha=0.5, color=’#1f77b3′, tiles=’OSM’, frame_width=500) *  
clusters.hvplot(geo=True, color=’red’, size=5) )

Analyzing the Aesthetic Perception of Destination Pictures

by: Roman Egger

In my research seminar “eTourism Research” at our study program “Innovation and Management in Tourism” at the Salzburg University of Applied Sciences I always manage to write a publishable paper with a group. Together with my students Diana Hauser, Antonia Leopold, Hasini Ganewita, and Leonie Herrgesell we published the paper “Aesthetic perception analysis of destination pictures using #beautifuldestinations on Instagram” in the “Journal of Destination Marketing & Management”.

In this article are focusing on the perception of aesthetics of destination pictures using machine learning to cluster photos from Instagram with the hashtag “beautifuldestination” according to their themes (Mountain, Architecture, Coast, City, Nature, Bird-Eye View etc.) and then analyzed them with regard to the relevance of design elements such as color, light, line, angle of view and focus. We used machine learning to cluster photos from Instagram with the hashtag “beautifuldestination” according to their themes (Mountain, Architecture, Coast, City, Nature, Bird-Eye View etc.) and then analyzed them with regard to the relevance of design elements such as color, light, line, angle of view and focus.

The results clearly show that different design elements are relevant to the perception of aesthetics for different topics.

The paper is available via open access – download and read the full paper here

TourBERT: A pretrained language model for the tourism industry.

by Veronika Arefieva and Roman Egger


The Bidirectional Encoder Representations from Transformers (BERT) is currently the most important and state-of-the-art natural language model (Tenney et al., 2019) since its launch in 2018 by Google. BERT Large, which is based on a Transformer architecture, is considered one of the most powerful language models with 24 layers, 16 attention heads, and 340 million parameters (Lan et al. 2019). BERT is a pretrained model and can be fine-tuned to perform numerous downstream tasks such as text classification, question answering, sentiment analysis, extractive summarization, named entity recognition, or sentence similarity (Egger, 2022). The model was pretrained on a huge English corpus in a self-supervised way. Raw texts from a BookCorpus of over 11,000 books and English Wikipedia were used to generate this state-of-the-art model. From this, it can already be concluded that BERT was trained on a huge generic corpus (Edwards et al., 2020) . However, it has been shown in the past that for domain-specific applications and downstream tasks, it is helpful to pretrain BERT on a large domain-specific corpus to enable it to learn the linguistic peculiarities better (Gururangan et al., 2020). For example, BERT variants have been pretrained for the financial sector (FinBERT) (Araci, 2019), the medical sector (Clinical BERT) (Alsentzer et al., 2019), for biomedical texts (BioBERT) (Lee et al., 2020), or SciBERT (Beltagy and Cohan, 2019) for biomedical and computer science.

Tourism is one of the most important economic sectors in the world (Hollenhorst et al., 2014), and its services have many characteristics that distinguish them from other products. Services are not tangible and cannot be tested in advance, which is why the customer assumes an increased risk before starting the trip. The service is co-created together with the customer, so the customer is an active co-creator of the service. Services are subject to the uno-actu principle, which means they are produced at the same time as they are consumed, and they are considered bilateral, i.e. a reciprocal relationship between persons (Chehimi, 2014). In addition, tourism services are relatively expensive compared to everyday products and have an intercultural dimension. All this means that tourism services are extremely description-intensive (Dooolin et al., 2002). In addition to detailed descriptions by the supply side, user-generated content is becoming increasingly important (Yu and Egger, 2021). Whether on review platforms such as TripAdvisor or social media channels such as Twitter, Facebook or Instagram, people everywhere are sharing their travel experiences, thus influencing other users (Daxböck et al, 2021). This content is of particular importance for tourism providers, as they are losing control over UGC (Saraiva, 2013).

The automated analysis of texts using natural language processing methods is therefore becoming increasingly important for both academia and the tourism industry (Egger and Gokce, 2022).

In order to meet the requirements of tourism, we introduce TourBERT in this paper. It was pretrained on 3.6 million tourist reviews and about 50k descriptions of tourist services and attractions and sights from more than 20 different countries around the world. The intercultural context, in particular, leads to linguistic peculiarities that BERT as a general language model cannot cope with. In the following, we introduce TourBERT and describe how it was trained and evaluated.

We, therefore, pretrained TourBERT from scratch with 1million steps using BERT-Base architecture with WordPiece tokenizer and our crawled, tourism-specific vocabulary with the same vocabulary size as BERT-Base. For the pretraining procedure, we followed official BERT repo recommendations.

Technical description

TourBERT has BERT-Base-uncased as an underlying architecture and was trained from scratch. So no initial checkpoints were used like it was done for BioBERT or FinBERT. The whole corpus was pre-processed by lowercasing the data, splitting them into sentences using punctuation as separators. A custom WordPice tokenizer was trained to create a custom input for the TourBERT model using 30.522 tokens in total which is the same number as BERT-Base has.

The pretraining was performed for 1M steps on a single TPU instance provided by Google Colab Pro which took about three days in total.

TourBERT Model evaluation

In order to evaluate TourBERT, both quantitative and qualitative measurements were applied. A supervised task, namely sentiment classification, was used for the quantitative evaluation.

Evaluation Task 1: Sentiment Classification

To perform classification using BERT a number of different options are available. First, a softmax layer on top of the BERT architecture could be used, which is one of the most widely used approaches. Second, an LSTM neural network can be used as a separate classification model. This approach is useful if an input example cannot be represented as a single vector, i.e. the length of an input text is significantly greater than the maximum input length allowed by the BERT-model. In our case, we decided, however, to use a single feed-forward layer on top of the BERT architecture, as this is a simple way to construct a classifier, and this approach is widely used to benchmark different BERT models against each other. After a single feed-forward layer is attached on top of the BERT architecture, all layers of the BERT model itself are frozen, i.e. only the classifier is trained.

The authors are aware that this does not usually yield state-of-the-art results, but the goal of this evaluation was not to achieve the highest score on a given dataset but to show that the quality of TourBERT embeddings surpasses the BERT-Base model.

The sentiment task was performed on two different datasets. First, on a Tripadvisor hotel review dataset from Ray et al. (2021) which is available at:

This dataset has three Labels: {-1: “negative”, 0:”neutral”, 1: “positive”} and a total of  69.308 reviews.

The second dataset is the “515k reviews from Europe hotels” dataset available at; we used only reviews which have either negative or positive labels and thus turned this problem into a binary classification with the two following labels:

{-1: “negative”, 1: “positive”}. We sampled 35.000 positive and 35.000 negative reviews resulting in 70.000 samples in total. Table 1 shows the evaluation results for both datasets and that TourBERT outperforms BERT-Base in that tasks.

Table 1. Results of supervised evaluation for TourBERT and BERT-Base on two sentiment classification datasets.

Figure 1. The Area under ROC-Curve (AUC) scores for TourBERT and BERT-Base

For the qualitative evaluation, unsupervised tasks and a user study was performed. On the one hand, a topic modeling task, synonyms search, and a within-vocabulary words similarity distribution task were designed.

Evaluation Task 2: Unsupervised Evaluation / Visualization of Photos

The first unsupervised evaluation was the visualization of photos in the Tensorboard Projector. Therefore a dataset of 48 photos showing different tourism activities like sports activities, visiting sights, shopping, and others were used. Next, a sample of 622 people was engaged to perform manual labeling of these photos by assigning two bi-gram tags to each foto. These annotations were then visualized using the TensorBoard projector API, which allows visualizing original photos on a 2D- or a 3D-plot centered at their respective cluster centers. Finally, the evaluation was done after performing UMAP by inspecting and comparing the group’s separation quality on the plot.

Figure 2. BERT-Base / TensorBoard Projector

Figure 3. TourBERT / TensorBoard Projector

The purpose of such a visualization is to evaluate the separation of clusters which naturally result from the down-projection method. We can see that TourBERT vectors result in better separated groups and pictures within the same group have similar contents. By looking at the results produced using BERT-Base vectors, it can be observed that photos are heavily mixed and do not allow to identify well-separated groups.

Evaluation Task 3: Unsupervised Evaluation / Topic Modeling

A subsequent and unsupervised evaluation was done by applying a topic modeling approach. For this, 5000 Instagram posts from public accounts with the hashtag #wanderlust were crawled using the Python ScraPy library. Instagram is based on photos which are the primary source of information, while the textual description of Instagram posts is often either limited to hashtags, emojis and unrelated to the photo or completely missing. Therefore images were annotated using Google Cloud Vision API, and a TouBERT vector was generated for each photo annotation. Then a K-means clustering approach was used to group the annotation vectors. K was chosen using the silhouette score, which resulted in 25 clusters. Then a PCA down-projection method was used to visualize the points on a plot which allowed us to determine how distinct topic words are and whether they overlap. This approach did not aim to develop a new topic modeling approach as there are embedding-based approaches with a similar logic available like Top2Vec (Angelov, 2020) or BERTopic (Grootendorst, 2020), but only to compare the performance of BERT-Base vs. TourBERT. For the evaluation, an interactive visualization board using Python’s Sklearn and Altair libraries was developed, similar to pyLDAVis for LDA topic modeling, containing two blocks. The block on the lift visualizes cluster centers on a 2D plot, where the size of a cluster center is proportional to the cluster population size. Each point is clickable and defines the output on the second block on the right. This is a horizontal bar chart, visualizing the top 15 most frequent words for a topic. The darker bars represent the word distribution within the entire dataset. The lighter bars show the word distribution within the selected topic. Evaluation can now be done on two aspects: the goodness of the topic separation (i.e. cluster centers do not overlap and are placed far away from each other) and how similar are words within the same cluster.

Figure 4. Topic Modeling – Results for BERT-Base

Figure 5. Topic Modeling – Results for TourBERT

From figures 4 and 5, we can see that cluster centers produced using down-projected TourBERT vectors are clearly better separated compared to those produced with BERT-Base ones. Some of the topic words can be seen in tables 2 and 3 for BERT-Base and TourBERT, respectively:

Table 2. Topic words for 25 topics produced with BERT-Base vectors.

Table 3. Topic words for 25 topics produced with TourBERT vectors.

To have a better understanding of the topics’ quality, we output the top-10 nearest samples for each cluster and look at the photos for which the samples were produced for like shown in figures 6 and 7 below. Each figure contains a table with the first column showing words for a given topic and all subsequent columns showing top-10 most similar samples, i.e. photos for that topic.

Figure 6. The first six topics with cluster words and top-10 most similar images, produced by the K-Means model using TourBERT vectors.

Figure 7. The first six topics with cluster words and top-10 most similar images, produced by the K-Means model using BERT-Base vectors.

Comparing both models’ results, we can see that clusters achieved with TourBERT vectors are much more homogenous within the clusters and heterogenous between the clusters compared to those for BERT-Base, which sometimes include relatively dissimilar photos belonging to the same topic, like in topic 3.

To further investigate the quality of each topic produced by the model and prove our assumptions, we conducted a user study to statistically evaluate the results, which is described in detail in the next section.

Evaluation Task 4: Unsupervised Evaluation / User Study

For the same dataset of images and annotations, a user study was designed. Therefore a set of the 10 most similar photos for each of the 25 clusters of BERT-Base and TourBERT was created, and users were asked to evaluate how similar the photos within each of the 50 clusters are on a 7-point Likert scale with possible answers ranging from “very similar” to “very different”. This evaluation approach allows for getting an intersubjective perception of the cluster qualities, similar to measuring the intercoder reliability in qualitative studies. The image clusters were shown to the participants in a rotating manner, i.e. randomly alternating.

Figure 6. Two examples of image clusters

For the evaluation of study results, a pairwise t-test was performed with SPSS. The coding ranged from [1 – very similar] to [7 – very different], and the mean values were 3,75 and 2,5 for BERT-Base and TourBERT respectively at a highly significant level (Sig. two-sided ,000). The effect size was measured with Cohen´s d and yielded with 0,517 a medium level effect.

Table 4. Results of the paired samples mean comparison

These results show that the similarity of the annotated images was perceived significantly better with TourBERT than with BERT-Base.

Evaluation Task 5: Synonyms Search

According to the assumption that BERT-Base, due to its general trained corpus, achieves more generic results than the TourBERT model trained on a tourism-specific corpus, it was assumed that a similarity search of tourism-related terms would turn out better with TourBERT than with BERT-Base. For similarity search, we choose words displayed in the first row of the table, which are: “authenticity”, “experience”, “entrance”, and so on. We output the top-8 most similar words for each word, which can be seen in the tables below.

Figure 7. Synonyms Search with BERT-Base

Figure 8. Synonyms Search with TourBERT

When comparing synonyms produced by BERT-Base and TourBERT, one can see that TourBERT almost perfectly captures a tourism-specific meaning of the word. On the contrary, BERT-Base captures a more generic meaning of the same words. For example, TourBERT associates the word “destination” with words like “spot” and “attraction” and place, whereas BERT-Base considers the same word “destination” to be similar with terms like “dying”, “choice”, and “lame”.

Summary and model usage

All evaluation tasks have proven the performance of TourBERT for use in the tourism-specific context. TourBERT outperforms BERT-Base in all tasks and thus represents a suitable language model for both academia and the tourism industry. The Tensorflow checkpoint of TourBERT has been converted to the PyTorch binary format and is released on Hugging Face Hub at the following URL: One can simply load the TourBERT model and the tokenizer using the following three lines of code:

TourBERT Release V1: 16.01.2022


Alsentzer, E., Murphy, J. R., Boag, W., Weng, W. H., Jin, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 .

Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.

Araci, D. (2019). Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063.

Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.

Chehimi, N. (2014). Tourist Information Search. In The Social Web in the Hotel Industry (pp. 49-70). Springer Gabler, Wiesbaden.

Daxböck, J., Dulbecco, M. L., Kursite, S., Nilsen, T. K., Rus, A. D., Yu, J., & Egger, R. (2021). The Implicit and Explicit Motivations of Tourist Behaviour in Sharing Travel Photographs on Instagram: A Path and Cluster Analysis. In Information and Communication Technologies in Tourism 2021 (pp. 244-255). Springer, Cham.

Doolin, B., Burgess, L., & Cooper, J. (2002). Evaluating the use of the Web for tourism marketing: a case study from New Zealand. Tourism management23(5), 557-561.

Edwards, A., Camacho-Collados, J., De Ribaupierre, H., & Preece, A. (2020, December). Go simple and pre-train on domain-specific corpora: On the role of training data for text classification. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 5522-5529).

Egger, R. (2022) Text Representations and Word Embeddings. Vectorizing Textual Data. In: Egger, R. (Ed.) Applied Data Science in Tourism. Interdisciplinary Aprochaches, Methodologies and Applicaitions. Springer (forthcoming).

Egger, R. and Gokce, E. (2022) Natural Language Processing: An Introduction. In: Egger, R. (Ed.) Applied Data Science in Tourism. Interdisciplinary Aprochaches, Methodologies and Applicaitions. Springer (forthcoming).

Grootendorst, M. (2020) BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. Available at:

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.

Hollenhorst, S. J., Houge-Mackenzie, S., & Ostergren, D. M. (2014). The trouble with tourism. Tourism Recreation Research39(3), 305-319.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234-1240.

Saraiva, J. P. D. P. M. (2013). Web 2.0 in restaurants: insights regarding TripAdvisor’s use in Lisbon (Doctoral dissertation).

Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT rediscovers the classical NLP pipeline.” arXiv preprint arXiv:1905.05950 (2019).

Yu, J., & Egger, R. (2021). Tourist Experiences at Overcrowded Attractions: A Text Analytics Approach. In Information and Communication Technologies in Tourism 2021 (pp. 231-243). Springer, Cham.