WebCelex Help - Dutch uit den Boogaart

Dutch uit den Boogaart contains the following items:

Code1grammatical code for the word or first part of the wordnumeric
Code2grammatical part of the second part of the wordnumeric
FreqSspoken frequency for the wordnumeric
FreqWwritten frequency for the wordnumeric
Word1word or first part of multiwordstring
Word2second part of a multiword if applicablestring

Uit den Boogaart Dutch written and spoken frequencies


In the early 1970s, a survey of contemporary, general-domain Dutch written and spoken word frequencies was conducted by a team of researchers cooperating in the Working Group Dutch Frequency Research. The idea was to emulate English-language efforts similar in method and scope, such as the Brown corpus. The results were published in a book called "Woordfrequenties: in Geschreven en Gesproken Nederlands", by P.C. uit den Boogaart (Utrecht: Oosthoek, Scheltema & Holkema (1975)). Yet, as the various statistical treatments of the data presented in the book were generated on the basis of a computer file, it is useful to return to this file for purposes of flexible data manipulation.

At the then Technical College in Eindhoven (now: University) the original tape, encompassing 727,302 tokens (605,733 written, 121,569 spoken), was preserved. From Eindhoven, a great number of copies were distributed, usually going under the name of the 'Eindhoven Corpus'. Some versions of the corpus incorporate additional material, such as Jan Renkema's survey of government Dutch ('De Taal van Den Haag'), while more substantial extensions were undertaken within the framework of the Esprit project, under the auspices of the European Union. This 'Esprit Corpus' also contains articles from the Dutch regional newspaper 'De Gelderlander', fragments from novels, etc. The Esprit version amounts to roughly 1.5 million tokens.

The Uit den Boogaart file is offered for historical reasons, but also because no spoken frequencies for Dutch are represented in CELEX as yet. In this sense, this file will probably be quickly superseded by the forthcoming results of the Dutch Spoken Corpus Project (1998-2003). Note that there is *no correspondence* between the Uit den Boogaart list and the other CELEX lists (e.g. and in terms of vocabulary, frequency measures or grammatical categories, as these were gathered and encoded on the basis of different corpora and distinct criteria.

The CELEX version

The CELEX version of the Eindhoven word list contains only the original Uit den Boogaart corpora and corresponds to the A1 and A2 lists of the printed version. Hapax legomena (words appearing only once in the full corpus) from the A2 list have been merged with the more frequent A1 items.

Two types of tokens have been removed from the list, however:

1. All proper nouns (codes 010 and 012).
2. All 'rest'-items: foreign language citations, non-lexical speech sounds, punctuation marks and meta-data (codes 990, 991, 998 and 999).

Please take note, too, that there are two character representation differences between the CELEX version and the Uit den Boogaart list:

1. All capital letters have been reduced to lowercase (Uit den Boogaart retains capital letters that were used to indicate the start of sentences in running text).
2. All divided words have been linked up, while Uit den Boogaart retains hyphenated transcriptions of words at the end of sentences.

Frequencies for these variant spellings have been merged with the default forms.

So, where Uit den Boogaart has:

organisatie 000 49 2
Organisatie 000 3
or- ganisatie 000 3
orga- nisatie 000 2
organi- satie 000 3

the CELEX version lists a single entry with cumulative frequency:

organisatie 000 60 2

This adaptation reduces the number of tokens in the CELEX version from 727,302 to 676,969 (comprising 57,539 unique type-grammatical code pairs).

Minor adjustments

- One item 'hij' with obviously incorrect code 380 was subsumed under code 300.
- One item "d'r ... bij" with code 545 was subsumed under 540 ... 610.
- One item 'er' with code 516 was subsumed under 510.
- One item 'hoor' with code 587 was subsumed under 287.
- One item 'keizersgracht', being a proper noun, but coded as an adverb, was removed.

Explanation of the three-digit grammatical codes

First of all, note that the Uit den Boogaart file contains 108 items where a grammatical code was not assigned at all or could only be partly assigned, mainly due to transcription problems in the spoken domain. In these cases, one or more dots replace the numerical codes, as in:


Grammatical code table
First DigitMeaningSecond DigitMeaningThird DigitMeaning
0Noun0Common noun ('aardappel')0Base form ('aardappel')
2Noun used as adjective ('plastic')1Plural ('aardappels')
8Noun used as interjection ('hemel')2Genitive ('aanschijns')
9Word used in noun-like citation or apposition ('het woord "jouw" gebruik je...'3Other inflection ('bate')
1adjective0Ordinary adjective ('boos')0Base form ('boos')
2Adjective used as noun ('het groen')1Plural ('bejaarden')
5Adjective used as adverb ('aanstekelijk')2Genitive ('belachelijks')
8Adjective used as interjection ('juist')3Other inflection ('boze', 'voorbedachten')
4Uninflected comparative ('bozer')
5Genitive comparative ('betere', 'ouderen')
6Other comparative inflection ('betere', 'ouderen')
7Uninflected superlative ('best')
8Other superlative inflection ('beste', 'besten')
2Verb0Infinitive/participle, intransitive verb0Ordinary infinitive ('krijgen')
1Infinitive/participle, transitive verb1Infinitive used as noun ('het krijgen')
2Infinitive/participle, reflexive verb2Uninflected present participle ('durend')
3Infinitive/participle, auxiliary or copula3Inflected present participle ('kwetsende')
4Plural infl. present participle ('levenden')
5Present participle, used as adverb ('doortastend')
6Uninflected past participle ('gedroogd')
7Inflected past participle ('gedroogde')
8Plural infl. past participle ('afgevaardigden')
9Past participle, used as adverb ('gejaagd')
2Verb4Finite verb form, intransitive verb1First person singular present ('doe')
5Finite verb form, transitive verb2Second person singular present ('hebt')
6Finite verb form, reflexive verb3Third person singular present ('is')
7Finite verb form, auxiliary or copula4All persons plural present ('gaan')
8Finite verb form, used as interjection5All persons singular past ('antwoordde')
6All persons plural past ('antwoordden')
7Imperative without subject ('ga')
8Imperative with subject ('gaat u')
9Subjunctive ('moge)
3Pronoun A0Personal pronoun ('zij')0Base
2Possessive pronoun, pronomial use ('hare')1Plural ('degenen')
3Possessive pronoun, determinative use ('haar')2Genitive ('diens')
4Reflexive pronoun, pronomial use ('zichzelf')3Other inflection ('onze', 'dezen')
5Reflexive pronoun,determinative use ('hetzelfde', 'eigen')
6Demonstrative pronoun, pronomial use ('deze', 'degene')
7Demonstrative pronoun, determinative use ('deze', 'zodanige')
4Pronoun B0Interrogative pronoun, pronomial use ('wie')0 Base form ('ieder')
1Interrogative pronoun, determinative use ('welk')1Plural ('meesten')
2Relative pronoun, pronomial use ('die')2Genitive ('aller', 'niemands')
3Relative pronoun,determinative use ('veel' (mensen))3Other inflection, ('iedere')
4Indefinite pronoun, pronomial use ('welk')4Uninflected comparative ('meer')
5Indefinite pronoun, determinative use ('veel' (mensen))6
6Cardinal numeral, pronominal use ('drie' (waren aanwezig))7Uninflected superlative ('meest')
7Cardinal numeral, determinative use ('drie' (boeken))9Other superlative inflection ('meeste', 'meesten')
8Ordinal numeral, pronominal use ((de) 'derde' (was er al))
9Ordinal numeral, determinative use ((de) 'derde' (man))
5Adverb0Ordinary adverb ('alleen')0Base form ('achteraf')
1Demonstrative or indefinite adverb ('daar')3Other inflection ('hele')
2Interrogative adverb ('hoe')4Uninflected comparative ('vaker')
3Relative adverb ('waar')7Uninflected superlative ('vaakst')
4Demonstrative or indefinite pronominal adverb ("voornaamwoordelijk bijwoord") ('daarmee', 'er'(...vandoor))9Other superlative inflection ('zeerste')
5Interrogative pronominal adverb ('waarmee')
6Relative pronominal adverb ('waaraan')
8Adverb used as interjection ('nou')
6Preposition and postposition0Preposition ('boven')0Base form ('aan')
1Second part split pronominal adverb ((er...)'mee')3Other inflection ('ter')
2Non-verbal part separable compound verb ('toe'(kijken))
3Second part discontinuous preposition ((door...)'heen')
4Postposition (het kanaal)'langs')
5(Preposition followed by) 'te'-infinitive ('te', 'door' (te stoppen))
6Preposition followed by subordinate clause ('zonder' (dat hij het wist))
7Conjunction0Coordinating conjunction ('maar')0Base form ('alsof')
1Subordinating conjunction ('hoewel')
2Comparative subordinator ('als', 'behalve')
3Conjunction with verb-initial clause ('al' (ga ik nog zo vaak...))
4First part of correlative coordinator ('noch', 'zowel')
8Interjection0Ordinary interjection ('ach', 'foei')0Base form ('boem')
1Noun-like onomatopoeia ((ze doen maar) 'klik-klik')