Awesome Linguistics Resources for Spanish
Curated list of Linguistic Resources for doing Spanish NLP & CL.
Clustering
Speech
Part of Speech Taggers (POS Taggers)
Name Entity Recognition (NER)
Corpora
Shared tasks
Corpora
-
Multilingual Aligned Annotated Corpus (CRATER)
-
UAM Treebank - 1,500 syntactically annotated sentences extracted from
newspapers (El País Digital and Compra Maestra
-
POSTagged/syntactic dependencies - European Corpus Initiative
Multilingual Corpus I
-
The Corpus of Contemporary Spanish(POStags, lemmas)
-
Lemmas Dictionary
-
esTenten Spanish (POSTagged)
-
Europarl Corpus (Parallel Corpus English-Spanish)
-
Colombian Political Speeches
-
South American Slang Expressions/MTWE
-
Syntax and Semantic Annotations (Subset Ancora Corpus)
-
Plurilingual Specific Corpus on Economics, Medicine, Computer
Science
-
Copenhagen Treebank (Dependency Parsing)
-
Reuters Corpora RCV2 - New Corpora
-
MolinoLabs Corpus - News Corpora from Spain, Argentina and Mexico
-
PANACEA- Legislation Corpus
-
PANACEA- Legislation Ngram Corpus
-
PANACEA- Dependency Parsed Corpus
-
PANACEA- Monolingual Lexica (MWE, Frames, Semantic Classes)
-
Opinion Mining - User reviews on Cars, Hotels, Washing machines,
Books, Cell phones, Music..
-
Cross Lingual Textual Entailment (CLTE) Corpus (English-Spanish)
-
Ngram Frequencies out of Colombia News Corpora
-
Sagan Textual Entailment Test Suite
-
Garcia, Marcos and Pablo Gamallo, 2013 - Portuguese and Spanish
biographical relation extraction corpora (Garcia, Marcos and Pablo
Gamallo, 2013. Exploring the Effectiveness of Linguistic Knowledge for
Biographical Relation Extraction. Natural Language Engineering,
CJO2013. doi:10.1017/S1351324913000314.)
-
Garcia, Marcos and Pablo Gamallo, 2014 - Portuguese, Spanish and
Galician coreference corpora (Garcia, Marcos and Pablo Gamallo, 2014.
Multilingual corpora with coreferential annotation of person entities.
In Proceedings of the 9th edition of the Language Resources and
Evaluation Conference (LREC 2014), Reykjavik: 3229-3233.)
-
COW(Corpora From the Web) Ngram/Annotated People’s Name Corpora
-
Wikicorpus- Portion of 2006’s wikipedia annotated with WordNet
Synsets and POS
-
Spanish Billion Words Corpus with word2vec Embeddings
Misc
Contribute
Contributions welcome! Read the
contribution guidelines first.
License
To the extent possible under law,
David Przybilla has waived all
copyright and related or neighboring rights to this work.