Last update: Monday, October 19, 2015. This is a list of some available lexicons and corpora for Sentiment Analysis (also called Opinion Mining). Il will try to keep this list updated as much as possible. Have fun!
Lexicons
- Opinion Lexicon by Bing Liu
- MPQA Subjectivity Lexicon
- SentiWordNet
- Harvard General Inquirer
- Linguistic Inquiry and Word Counts (LIWC)
- Vader Lexicon
Datasets
- MPQA Datasets
- Sentiment140
- STS-Gold
- Customer Review Dataset
- Pros and Cons Dataset
- Comparative Sentences
- Sanders Analytics Twitter Sentiment Corpus
- Spanish tweets
- SemEval 2014
- Various Datasets
- Various Datasets #2
Lexicons
-
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
- PAPERS: Mining and summarizing customer reviews
- NOTES: Included in the NLTK Python platform.
-
- URL: https://github.com/aesuli/SentiWordNet
- NOTES: Included in the NLTK Python platform.
-
Linguistic Inquiry and Word Counts (LIWC)
- URL: http://www.liwc.net
Datasets
- MPQA Datasets
- URL: http://mpqa.cs.pitt.edu
- NOTES: GNU Public License.
- Political Debate data
- Product Debate data
- Subjectivity Sense Annotations
- Sentiment140 (Tweets)
- STS-Gold (Tweets)
- URL: http://www.tweenator.com/index.php?page_id=13
- PAPERS: Evaluation datasets for twitter sentiment analysis (Saif, Fernandez, He, Alani)
- NOTES: As Sentiment140, but the dataset is smaller and with human annotators. It comes with 3 files: tweets, entities (with their sentiment) and an aggregate set.
- Customer Review Dataset (Product reviews)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Mining and summarizing customer reviews
- NOTES: Title of review, product feature, positive/negative label with opinion strength, other info (comparisons, pronoun resolution, etc.). Included in the NLTK Python platform.
- Pros and Cons Dataset (Pros and cons sentences)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Mining Opinions in Comparative Sentences (Ganapathibhotla, Liu 2008)
- NOTES: A list of sentences tagged
<pros>
or<cons>
. Included in the NLTK Python platform.
- Comparative Sentences (Reviews)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Identifying Comparative Sentences in Text Documents (Nitin Jindal and Bing Liu), Mining Opinion Features in Customer Reviews (Minqing Hu and Bing Liu)
- NOTES: Sentence, POS-tagged sentence, entities, comparison type (non-equal, equative, superlative, non-gradable). Included in the NLTK Python platform.
- Sanders Analytics Twitter Sentiment Corpus (Tweets)
-
URL: http://www.sananalytics.com/lab/twitter-sentiment
“5513 hand-classified tweets wrt 4 different topics. Because of Twitter’s ToS, a small Python script is included to download all of the tweets. The sentiment classifications themselves are provided free of charge and without restrictions. They may be used for commercial products. They may be redistributed. They may be modified.”
-
- Spanish tweets (Tweets)
- SemEval 2014 (Tweets)
-
URL: http://alt.qcri.org/semeval2014/task9
“You MUST NOT re-distribute the tweets, the annotations or the corpus obtained” (from the readme file)
-
- Various Datasets (Reviews)
- Various Datasets #2 (Reviews)