The language resources available in this catalogue are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".
1/ Spoken LRs
a - Telephone recordings
The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.
b - Desktop/Microphone recordings
The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.
c - Broadcast Resources
The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.
d - Speech Related Resources
You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.
2/ Written LRs
a - Corpora
This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.
b - Monolingual lexicons
The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.
c - Multilingual lexicons
Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.
3/ Terminological LRs
Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.
4/ Multimodal/Multimedia LRs
The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.
LATEST UPDATES :
|
 |
New Resources |
 |
W0050 : The CINTIL Corpus – International Corpus of Portuguese CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials.
|
M0050 : The MWN.PT - MultiWordnet of Portuguese MWN.PT - MultiWordnet of Portuguese (version 1) spans over 17,200 manually validated concepts/synsets, linked under the semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word senses/word forms and 16,000 lemmas from both European and American variants of Portuguese. They are aligned with the translationally equivalent concepts of the English Princeton WordNet and, transitively, of the MultiWordNets of Italian, Spanish, Hebrew, Romanian and Latin.
|
M0049 : Basque WordNet The Basque WordNet models nouns, verbs and adjectives. Each sense is linked to a so-called synset (for a total of 30,281 Synsets). Every synset encodes the synonymy relation between (possibly) several words (synonyms), having a unique meaning, belonging to one and the same part of speech (specified in the POS tag value), and expressing the same lexical meaning. Each synset is related to the corresponding synset in the English WordNet 1.6. via its identification number ID, which includes the synset number and the POS tag. The only exceptions are newly created synsets to account for cultural concepts not present in WordNet 1.6.
|
S0300 : SIGNUM Database The SIGNUM Database contains both isolated and continuous utterances of various signers. The corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images. The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. Based on this vocabulary, overall 780 sentences were constructed. Each sentence ranges from two to eleven signs in length. The entire corpus was performed once by 25 native signers of different sexes and ages. One of them was chosen to be the so-called reference signer. His performances were recorded three times.
|
M0048 : LatinWordNet LatinWordNet contains information about the following aspects of the Latin and English lexicon: lexical relations between words, semantic relations between lexical concepts, correspondences between Latin and English lexical concepts. LatinWordNet covers nouns, verbs, adjectives and adverbs, and contains 8,978 synsets in correspondence with the English equivalents (and with all the MultiWordNet-based wordnets).
|
| (last update: July 2009) |
|
|