Skip to content

Commit f2f8133

Browse files
committed
initial project version
0 parents  commit f2f8133

File tree

12 files changed

+645387
-0
lines changed

12 files changed

+645387
-0
lines changed

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
The MIT License (MIT)
2+
3+
Copyright (c) 2016 dohliam
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# ipa-dict - Monolingual wordlists with pronunciation information in IPA
2+
3+
This project aims to provide a series of dictionaries consisting of wordlists with accompanying phonemic pronunciation information in International Phonetic Alphabet (IPA) transcription for as many words as possible in as many languages / dialects / variants as possible.
4+
5+
The dictionary data is available in a number of human- and machine-readable [formats](#formats), in order to make it as useful as possible for various other [applications](#applications).
6+
7+
## Background
8+
9+
There is no existing central, standardized location for checking the correspondence between orthography and pronunciation in any given language.
10+
11+
Furthermore, IPA information even for large languages can be surprisingly difficult to find, and is generally not provided for each form of a word. In many languages, reference works only carry pronunciation notation for lemmas (headwords), and very little information is available on conjugations and forms of word classes other than the dictionary form. For highly inflected languages (e.g. French), each verb may have 40 or more conjugated forms, but pronunciation will only be listed for the dictionary form.
12+
13+
In fact, many languages do not have any significant amount of IPA information available at all, even in dictionaries, and this is even more likely to be the case for language variants and non-standard varieties.
14+
15+
This project aims to resolve these problems by compiling wordlists for each language along with accompanying IPA transcription.
16+
17+
A combination of manual and semi-automatic generation has been used to compile the pronunciations. Whenever possible, pronunciations have been checked manually by consulting multiple reference works, particularly for lemmas (which are usually more easily available). Inflected forms have been either added manually or with semi-automatic guidance when multiple pronunciations can be pre-determined with some certainty.
18+
19+
## Formats
20+
21+
For convenience, the IPA data is provided here in several different formats:
22+
23+
* tab delimited
24+
* JSON
25+
* CSV
26+
* XML
27+
28+
All filenames refer to the [ISO language code](http://en.wikipedia.org/wiki/ISO_639-1) of the relevant language (e.g. `sw.json` is a JSON file containing pronunciations for Swahili.
29+
30+
### Raw data
31+
32+
The raw data in this repository is provided as a series of text files with each word and its corresponding pronunciation in IPA on a separate line delimited by tab characters. The tab delimited files are plain text UTF-8 encoded files with the filename suffix `.txt` in the following format:
33+
34+
[ENTRY][TAB][IPA]
35+
36+
This file format is simple, lightweight, human- and machine-readable, and is also easily convertible to other common formats. Several of those formats (e.g. JSON, XML, CSV) are provided as downloads in the [Releases](https://github.com/dohliam/ipa-dict/releases) section.
37+
38+
### JSON
39+
40+
The JSON files are in the following format:
41+
42+
```json
43+
{
44+
"LANG":
45+
[{
46+
"ENTRY1":"IPA1",
47+
"ENTRY2":"IPA2",
48+
"ENTRY3":"IPA3",
49+
"ENTRY4":"IPA4"
50+
}]
51+
}
52+
```
53+
54+
### XML
55+
56+
XML files have been generated for all the word lists in the following format:
57+
58+
```xml
59+
<IpaEntry EntryID="1">
60+
<Item>ENTRY</Item>
61+
<Ipa>/IPA/</Ipa>
62+
</IpaEntry>
63+
```
64+
65+
### CSV
66+
67+
There are comma-separated files available for use with spreadsheet programs and so on. These are in some ways similar to the raw data files, with the exception that they are delimited by commas rather than tabs. In most spreadsheet programs, you should be able to open these directly from the file menu.
68+
69+
### Other formats
70+
71+
There is also a concurrent project to convert the data into DSL format dictionary files for use with dictionary software such as ABBY Lingvo or Goldendict.
72+
73+
If there is another format not listed here that would be useful to you, please feel free to open an issue or PR to add it.
74+
75+
## Applications
76+
77+
This project provides an accessible source for IPA pronunciation information that other dictionary projects (e.g. Wiktionary) could draw on rather than manually adding pronunciations to each entry.
78+
79+
Apart from this, there are several ways that this data could (and has been applied):
80+
81+
* Providing pronunciation information for a series of learner's grammars currently being compiled by the Open Grammar Project
82+
* Cross-language comparison of common phonemes
83+
* Intra-language analysis of phoneme patterns
84+
* Automatic generation of homonym lists (a selection of these is now available for download in the releases section)
85+
86+
## Notes
87+
88+
* Pronunciations provided are broadly phonemic, and should represent what one might expect to find in a dictionary or other popular reference work.
89+
* Some familiarty with basic IPA is assumed, however since variation frequently exists among reference works, the transcriptions here try to maximize readability and usefulness for learners (rather than, say linguists, who might prefer to make finer distinctions).
90+
* Pronunciation is provided where possible for each inflected form of a given lexeme, so _run_, _ran_, _runs_, and _running_ for example would each be separate entries.
91+
* The emphasis is on the correspondence between orthography and phonemic pronunciation, so separate entries are given for homonyms that are written or spelled differently.
92+
* Where multiple possible pronunciations exist for a given entry, they should all be listed (separated by commas), even if they have different senses. For example, the word _est_ has two different pronunciations in French (/ɛst/ and /ɛ/), depending on whether it is a noun or an (unrelated) verb, so the entry for _est_ lists both of these pronunciations.
93+
* Conversely, words with different orthographies are considered separate entries, even if they have the same pronunciation. This is because the lists are primarily meant to provide possible pronunciations for unique spellings rather than dictionary information for the possible spellings of unique words.
94+
95+
## License
96+
97+
MIT.

0 commit comments

Comments
 (0)