joshdavham commited on
Commit
f2e6a0e
·
1 Parent(s): 34cc53e

add more copy

Browse files
Files changed (1) hide show
  1. app.py +81 -54
app.py CHANGED
@@ -1351,15 +1351,13 @@ st.markdown("[Code and data can be found [here](https://github.com/joshdavham/ci
1351
 
1352
  st.markdown("# What makes comprehensible input *comprehensible*?")
1353
 
1354
- st.markdown("**Comprehensible input** (or CI, for short) is a language teaching technique where teachers \
1355
- speak in a way that is understandable to their students. \
1356
- It is believed by many that CI is one of the most optimal and natural \
1357
- ways to acquire a foreign language \
1358
- ...but, what exactly is about CI that makes it comprehensible?")
1359
 
1360
- st.markdown("To answer this question, I'll be analyzing the videos on \
 
 
1361
  [cijapanese.com](https://cijapanese.com/) (CIJ), a \
1362
- video platform for learning Japanese.")
1363
 
1364
  ###
1365
  # RATE OF SPEECH
@@ -1368,7 +1366,7 @@ st.markdown("## How fast is CI?")
1368
 
1369
  st.markdown("If we measure how fast the teachers speak on CIJ, we find that \
1370
  they speak more slowly in videos meant for beginners and more quickly \
1371
- for advanced learners.")
1372
 
1373
  st.markdown("**(THESE GRAPHS ARE CLICKABLE)**")
1374
 
@@ -1379,10 +1377,13 @@ else:
1379
 
1380
  st.altair_chart(layered_chart, use_container_width=True)
1381
 
1382
- st.markdown("To put this data into perspective, native Japanese speakers \
1383
  can speak at rates of over 200 wpm, meaning that most of the videos \
1384
  on CIJ have been adapted to be a lot slower than that!")
1385
 
 
 
 
1386
  if st.checkbox('Enable zooming and panning ( ↕ / ↔️ )'):
1387
  wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=True)
1388
  else:
@@ -1390,22 +1391,19 @@ else:
1390
 
1391
  st.altair_chart(wpm_vs_sps_chart, use_container_width=True)
1392
 
1393
- st.markdown("We can also measure the rate of speech in syllables per second (SPS) \
1394
- and compare it to words per minute.")
1395
-
1396
  ###
1397
  # STATISTICS LESSON
1398
  ###
1399
  st.markdown("## A quick statistics lesson")
1400
 
1401
- st.markdown("Before we continue this analysis, there's some basic things you should know.")
1402
 
1403
  st.markdown("### The data")
1404
 
1405
  st.markdown("The dataset we'll be analyzing comprises of just under 1,000 videos. \
1406
  In particular, we'll be analyzing the subtitles of the videos.")
1407
 
1408
- st.markdown('Every video has a Level: **Complete Beginner**, **Beginner**, \
1409
  **Intermediate**, or **Advanced**.')
1410
 
1411
  st.markdown("### The statistics")
@@ -1414,7 +1412,7 @@ st.markdown("The goal of this analysis is to find features in the video data tha
1414
  to a specific pattern called an \"ordering\".")
1415
 
1416
  st.markdown("We're specifically looking for *any* statistic that can lead to an \
1417
- ordering of the levels in one of the two following orders:")
1418
 
1419
  st.markdown("> Complete Beginner < Beginner < Intermediate < Advanced")
1420
  st.markdown("or")
@@ -1423,7 +1421,7 @@ st.markdown("> Complete Beginner > Beginner > Intermediate > Advanced")
1423
  st.markdown("For example: if a statistic is small for Complete Beginnner videos, but gets bigger \
1424
  for Beginner, Intermediate, then Advanced videos, it suggests \
1425
  that this is a good statistic for determining what makes a video comprehensible. \
1426
- In fact, we already saw this above when measuring the **words per minute** statistic.")
1427
 
1428
  st.markdown("Okay! Now we can continue.")
1429
 
@@ -1441,8 +1439,8 @@ else:
1441
 
1442
  st.altair_chart(sentence_length_hist, use_container_width=True)
1443
 
1444
- st.markdown("This makes sense because long sentences generally tend to be more complex and packed with information \
1445
- whereas short sentences are usually easier to understand.")
1446
 
1447
  ###
1448
  # AMOUNT OF REPETITION
@@ -1459,7 +1457,7 @@ else:
1459
  st.altair_chart(repetition_hist, use_container_width=True)
1460
 
1461
  st.markdown("If you don't catch a word the first time it's said, there's more opportunities \
1462
- in the easier videos to hear that word again.")
1463
 
1464
  ###
1465
  # HOW MANY WORDS
@@ -1467,15 +1465,15 @@ st.markdown("If you don't catch a word the first time it's said, there's more op
1467
  st.markdown("## How many words you need to know")
1468
 
1469
  st.markdown("A popular statistic in language learning circles is that you generally \
1470
- need to know around 98% of words in a given piece of content to understand it well. \
1471
- This statistic is known as 'word coverage', the percentage of words you know in a given text.")
1472
 
1473
- st.markdown("How many words do you need to know to understand 98% of the words in each level?")
1474
 
1475
- st.markdown("If we take all the words in CIJ, count them then order them from most common, to least common, \
1476
  we can calculate the word coverage you get at different vocabulary sizes. \
1477
  For example, if we learn the top 500 words from CIJ, then we'll know around 80% of the words in the \
1478
- Complete Beginner videos. And if we learn the top 4,295 words, then we'll know 98% of the words in that category.")
1479
 
1480
  if st.checkbox('Zoom in'):
1481
  word_coverage_chart = get_word_coverage_chart(zoom=True)
@@ -1484,9 +1482,9 @@ else:
1484
 
1485
  st.altair_chart(word_coverage_chart, use_container_width=True)
1486
 
1487
- st.markdown("Using the same method of calculating word coverage as before, \
1488
- we can also calculate how many of the top words you need to know \
1489
- to achieve 98% word coverage in each video.")
1490
 
1491
  if st.checkbox('Show medians', value=True, key='ne_spot'):
1492
  ne_spot_hist = get_ne_spot_hist(show_medians=True)
@@ -1503,7 +1501,7 @@ st.markdown("In general, easier videos require smaller vocabulary sizes to under
1503
  ###
1504
  st.markdown("## Word rareness")
1505
 
1506
- st.markdown("More advanced videos tend to use rare/uncommon words more often than easier videos.")
1507
 
1508
  if st.checkbox('Show medians', value=True, key='tfplr'):
1509
  # tfplr stands for "twenty fifth percentile log rank"
@@ -1515,24 +1513,23 @@ st.altair_chart(tfplr_hist, use_container_width=True)
1515
 
1516
  st.markdown("How common a word is, is known as its 'rank'. The most common word \
1517
  in a text would be rank 1 and the fifth most common would be rank 5. \
1518
- A word with a low rank is a commonly used word (e.g., 'it', 'walk', 'up') whereas a word with a high rank \
1519
- is an uncommon or 'rare' word (e.g., 'esoteric', 'gauche', 'gallant').")
 
1520
 
1521
- st.markdown("The words in the videos were compared to the ranks of words generated from a frequency list made from over 4,000 Japanese Netflix \
1522
- TV episodes and movies. Duplicate ranks in the videos were removed, scaled with a log \
1523
- function then used to compute the 25th percentile. This was necessary due \
1524
- to power-law nature of word frequency distributions.")
1525
 
1526
- st.markdown("(It's okay ff the above didn't quite make sense to you - just know that the above graph \
1527
- demonstrates that easier videos tend to use more common words whereas \
1528
- advanced videos tend to use more rare words!)")
1529
 
1530
  ###
1531
  # GRAMMAR
1532
  ###
1533
  st.markdown("## Grammar")
1534
 
1535
- st.markdown("Easier videos tend to use less [subordinating conjunctions](https://universaldependencies.org/ja/pos/SCONJ.html) than harder videos.")
1536
 
1537
  if st.checkbox('Show medians', value=True, key='sconj'):
1538
  sconj_hist = get_sconj_hist(show_medians=True)
@@ -1550,13 +1547,13 @@ st.markdown(
1550
  ###
1551
  # WORD ORIGIN
1552
  ###
1553
- st.markdown("## What type of word")
1554
 
1555
  st.markdown("There are three main categories of words in Japanese:")
1556
  st.markdown("(1) Wago (和語), (2) Kango (漢語) and (3) Gairaigo (外来語)")
1557
  st.markdown("Wago are native Japanese words, Kango are Chinese words and Gairaigo are foreign words.")
1558
 
1559
- st.markdown("Harder videos tend to use more Kango than easier videos")
1560
 
1561
  if st.checkbox('Show medians', value=True, key='kango'):
1562
  kango_hist = get_kango_hist(show_medians=True)
@@ -1565,7 +1562,7 @@ else:
1565
 
1566
  st.altair_chart(kango_hist, use_container_width=True)
1567
 
1568
- st.markdown("In Japanese, Kango are somewhat analogous to French words in English. \
1569
  These words tend to be more technical or sophisticated than other words.")
1570
 
1571
  st.markdown("We also notice orderings when counting the percentage of Wago and Gairaigo as well.")
@@ -1579,20 +1576,21 @@ st.markdown(
1579
  ###
1580
  st.markdown("## Which factors matter the most?")
1581
 
1582
- st.markdown("We've just found a number of statistics that lead to orderings in the data \
1583
  but which statistics matter the most?")
1584
 
1585
  st.markdown("To answer this, we can look at a correlation heatmap between each of the variables \
1586
- and observe which statistics correlate the most strongly with the video's level.")
 
1587
 
1588
  render_vanilla_heatmap()
1589
 
1590
  st.markdown("In case you're not familiar with stuff like this, numbers close to 1 or -1 \
1591
- represent a high level or correlation and numbers close to 0 represent a low level of correlation. \
1592
  Positive numbers represent a positive relationship between the variables and negative numbers represent a \
1593
  reverse relationship between the variables.")
1594
 
1595
- st.markdown("Using a statistics rule of thumb and removing all variables that have correlations \
1596
  weaker than 0.3 (and more than -0.3), we can identify the variables with the strongest correlations.")
1597
 
1598
  if st.checkbox('Flip and sort by correlation strength'):
@@ -1601,12 +1599,12 @@ else:
1601
  render_level_row_unordered()
1602
 
1603
 
1604
- st.markdown("To summarize (and simplify), this suggests that the most important factors in comprehensibility are:")
1605
 
1606
  st.markdown("1. Rate of Speech")
1607
  st.markdown("2. Sentence length")
1608
  st.markdown("3. Amount of repetition of words")
1609
- st.markdown("4. How common/rare the words are")
1610
  st.markdown("5. Amount of subordinating conjunctions")
1611
  st.markdown("6. Vocabulary size")
1612
  st.markdown("7. Amount of pronouns")
@@ -1614,23 +1612,40 @@ st.markdown("8. Amount of adverbs")
1614
  st.markdown("9. Amount of auxiliaries")
1615
  st.markdown("10. Amount of Chinese words")
1616
 
1617
- st.markdown("## Dicussion")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1618
 
1619
  #st.markdown('')
1620
 
1621
- st.markdown("### Thanks for reading ✌️")
1622
 
1623
  st.markdown("---")
1624
 
1625
  st.markdown("#### Futher discussion for hardcore nerds")
1626
 
1627
  st.markdown("- No tests of statistical significance were conducted. This was purely meant as an EDA. \
1628
- However, you can get the data from the repo linked at the top and conduct them yourself if you'd like. \
1629
  I'd recommend starting with non-parametric tests like Kruskal-Wallis and moving on to pairwise tests \
1630
- with a bonferonni correction if it's significant. Parametric tests may also be interesting.")
1631
 
1632
  st.markdown("- Technically, I computed 'moras per second' - not syllables per second. I'm aware that this \
1633
- is technically linguistically incorrect, but it still serves as close approximation and is easier \
1634
  to understand for readers unfamiliar with Japanese linguistics.")
1635
 
1636
  st.markdown("- The Mecab and Sudachi parsers (through Fugashi and Spacy) were used to analyze the transcripts. These parsers are not always 100% accurate.")
@@ -1640,9 +1655,16 @@ st.markdown("- When computing the statistics for repetition, word coverage and w
1640
  st.markdown("- Of the parsed words, while I did remove punctuation, I didn't otherwise verify that each token was an actual word. \
1641
  There is likely some amount of noise in the data such as mis-parses, etc.")
1642
 
 
 
 
1643
  st.markdown("- If you're like me, the word coverage plots also probably evoked a resemblance to Heap's Law. \
1644
  More research would need to be done, but I suspect one may be able to find a link between word coverage and Heap's Law.")
1645
 
 
 
 
 
1646
  st.markdown("- One should bare in mind that the learner levels were labelled by a small group of experts and not a large number of learners. \
1647
  In other words, the difficulty levels are not objective, but rather an approximation of difficulty / natural acquistion order.")
1648
 
@@ -1653,7 +1675,7 @@ st.markdown("1. **Audibility** - My hypothesis was that the teachers would speak
1653
  and the original transcript to katakana and compared the character error rate. I found no differences in the levels. \
1654
  Furthermore I can't tell if this moreso invalidates my original hypothesis or if whisper is just that good.")
1655
 
1656
- st.markdown("2. **Word length** - At least in English and French (the languages I know the best), longer words are generally considered harder. \
1657
  My hypothesis was that the easier videos would use shorter words while the harder videos would use bigger words. \
1658
  To test this, I parsed the transcripts and converted all words to katakana \
1659
  to get a measure of how long the words were orally. I found no differences between the levels.")
@@ -1663,4 +1685,9 @@ st.markdown("3. **Range of vocabulary** - I suspected that easier videos may lim
1663
 
1664
  st.markdown("4. **Other parts of speech** - I did test for orderings between the levels for other parts of speech such as: \
1665
  proportion of adjectives, adpositions, coordinating conjunctions, interjections, particles and proper nouns \
1666
- but ultimately didn't find any obvious orderings.")
 
 
 
 
 
 
1351
 
1352
  st.markdown("# What makes comprehensible input *comprehensible*?")
1353
 
1354
+ st.markdown("**Comprehensible input** (or CI, for short) is a language teaching method where teachers provide their students with lots of language “input” that has been adapted to a level that they can understand. It is believed by many that CI is one of the most natural and effective ways to acquire a foreign language.")
 
 
 
 
1355
 
1356
+ st.markdown("…but what exactly is it about CI that makes it so *comprehensible*?")
1357
+
1358
+ st.markdown("To answer this question, we'll be analyzing the videos on \
1359
  [cijapanese.com](https://cijapanese.com/) (CIJ), a \
1360
+ CI platform for learning Japanese.")
1361
 
1362
  ###
1363
  # RATE OF SPEECH
 
1366
 
1367
  st.markdown("If we measure how fast the teachers speak on CIJ, we find that \
1368
  they speak more slowly in videos meant for beginners and more quickly \
1369
+ in videos meant for more advanced learners.")
1370
 
1371
  st.markdown("**(THESE GRAPHS ARE CLICKABLE)**")
1372
 
 
1377
 
1378
  st.altair_chart(layered_chart, use_container_width=True)
1379
 
1380
+ st.markdown("To put the above data into perspective, native Japanese speakers \
1381
  can speak at rates of over 200 wpm, meaning that most of the videos \
1382
  on CIJ have been adapted to be a lot slower than that!")
1383
 
1384
+ st.markdown("We can also measure the rate of speech in syllables per second (SPS) \
1385
+ and compare it to words per minute.")
1386
+
1387
  if st.checkbox('Enable zooming and panning ( ↕ / ↔️ )'):
1388
  wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=True)
1389
  else:
 
1391
 
1392
  st.altair_chart(wpm_vs_sps_chart, use_container_width=True)
1393
 
 
 
 
1394
  ###
1395
  # STATISTICS LESSON
1396
  ###
1397
  st.markdown("## A quick statistics lesson")
1398
 
1399
+ st.markdown("Before we continue the analysis, there's some basic things you should know.")
1400
 
1401
  st.markdown("### The data")
1402
 
1403
  st.markdown("The dataset we'll be analyzing comprises of just under 1,000 videos. \
1404
  In particular, we'll be analyzing the subtitles of the videos.")
1405
 
1406
+ st.markdown('Also, every video has a level: **Complete Beginner**, **Beginner**, \
1407
  **Intermediate**, or **Advanced**.')
1408
 
1409
  st.markdown("### The statistics")
 
1412
  to a specific pattern called an \"ordering\".")
1413
 
1414
  st.markdown("We're specifically looking for *any* statistic that can lead to an \
1415
+ ordering of the levels in either of the two following directions:")
1416
 
1417
  st.markdown("> Complete Beginner < Beginner < Intermediate < Advanced")
1418
  st.markdown("or")
 
1421
  st.markdown("For example: if a statistic is small for Complete Beginnner videos, but gets bigger \
1422
  for Beginner, Intermediate, then Advanced videos, it suggests \
1423
  that this is a good statistic for determining what makes a video comprehensible. \
1424
+ In fact, we already saw this above when measuring the [words per minute statistic](#how-fast-is-ci).")
1425
 
1426
  st.markdown("Okay! Now we can continue.")
1427
 
 
1439
 
1440
  st.altair_chart(sentence_length_hist, use_container_width=True)
1441
 
1442
+ st.markdown("This makes sense because long sentences can be more complex and packed with information \
1443
+ whereas short sentences are usually simpler.")
1444
 
1445
  ###
1446
  # AMOUNT OF REPETITION
 
1457
  st.altair_chart(repetition_hist, use_container_width=True)
1458
 
1459
  st.markdown("If you don't catch a word the first time it's said, there's more opportunities \
1460
+ in the easier videos to hear that word repeated again.")
1461
 
1462
  ###
1463
  # HOW MANY WORDS
 
1465
  st.markdown("## How many words you need to know")
1466
 
1467
  st.markdown("A popular statistic in language learning circles is that you generally \
1468
+ need to know around 98% of the words in a given piece of content in order to be able to understand it well. \
1469
+ This statistic is known as 'word coverage' - the percentage of words you know in a given text.")
1470
 
1471
+ st.markdown("How many words do you need to know in order to understand 98% of the words in each level?")
1472
 
1473
+ st.markdown("If we take all of the words in CIJ, count them and then order them from most common to least common, \
1474
  we can calculate the word coverage you get at different vocabulary sizes. \
1475
  For example, if we learn the top 500 words from CIJ, then we'll know around 80% of the words in the \
1476
+ Complete Beginner videos. And if we learn the top 4,295 words, then we'll know 98% of the words in the Complete Beginner videos.")
1477
 
1478
  if st.checkbox('Zoom in'):
1479
  word_coverage_chart = get_word_coverage_chart(zoom=True)
 
1482
 
1483
  st.altair_chart(word_coverage_chart, use_container_width=True)
1484
 
1485
+ st.markdown("Using this same method of calculating word coverage, \
1486
+ we can also calculate how many of the top words from CIJ you need to know \
1487
+ in order to achieve 98% word coverage in each video.")
1488
 
1489
  if st.checkbox('Show medians', value=True, key='ne_spot'):
1490
  ne_spot_hist = get_ne_spot_hist(show_medians=True)
 
1501
  ###
1502
  st.markdown("## Word rareness")
1503
 
1504
+ st.markdown("Harder videos use rarer words.")
1505
 
1506
  if st.checkbox('Show medians', value=True, key='tfplr'):
1507
  # tfplr stands for "twenty fifth percentile log rank"
 
1513
 
1514
  st.markdown("How common a word is, is known as its 'rank'. The most common word \
1515
  in a text would be rank 1 and the fifth most common would be rank 5. \
1516
+ A word with a low rank is a commonly used word (e.g., 'and', 'work', 'that') whereas a word with a high rank \
1517
+ is an uncommon or 'rare' word (e.g., 'esoteric', 'gauche', 'opprobrium'). Furthermore, \
1518
+ a list of word ranks is known as a 'frequency list'.")
1519
 
1520
+ st.markdown("The ranks of the words in the videos were compared with a larger, independent frequency list and then scaled with a log function \
1521
+ before computing the twenty fifth percentile. This was done to make for a better visualization.")
 
 
1522
 
1523
+ st.markdown("Note: it's okay if the above values don't quite make sense to you - just know that the above graph \
1524
+ demonstrates that easier videos tend to use common words more often whereas \
1525
+ advanced videos tend to use more rare words more often.")
1526
 
1527
  ###
1528
  # GRAMMAR
1529
  ###
1530
  st.markdown("## Grammar")
1531
 
1532
+ st.markdown("Easier videos use less [subordinating conjunctions](https://universaldependencies.org/ja/pos/SCONJ.html) than harder videos.")
1533
 
1534
  if st.checkbox('Show medians', value=True, key='sconj'):
1535
  sconj_hist = get_sconj_hist(show_medians=True)
 
1547
  ###
1548
  # WORD ORIGIN
1549
  ###
1550
+ st.markdown("## Word origin")
1551
 
1552
  st.markdown("There are three main categories of words in Japanese:")
1553
  st.markdown("(1) Wago (和語), (2) Kango (漢語) and (3) Gairaigo (外来語)")
1554
  st.markdown("Wago are native Japanese words, Kango are Chinese words and Gairaigo are foreign words.")
1555
 
1556
+ st.markdown("Harder videos use more kango than easier videos")
1557
 
1558
  if st.checkbox('Show medians', value=True, key='kango'):
1559
  kango_hist = get_kango_hist(show_medians=True)
 
1562
 
1563
  st.altair_chart(kango_hist, use_container_width=True)
1564
 
1565
+ st.markdown("In Japanese, kango are somewhat analogous to French words in English. \
1566
  These words tend to be more technical or sophisticated than other words.")
1567
 
1568
  st.markdown("We also notice orderings when counting the percentage of Wago and Gairaigo as well.")
 
1576
  ###
1577
  st.markdown("## Which factors matter the most?")
1578
 
1579
+ st.markdown("We've just found a number of statistics that lead to orderings in the data, \
1580
  but which statistics matter the most?")
1581
 
1582
  st.markdown("To answer this, we can look at a correlation heatmap between each of the variables \
1583
+ and observe which statistics correlate the most strongly with the video's level. \
1584
+ In particular, we'll want to look at the first row (or first column).")
1585
 
1586
  render_vanilla_heatmap()
1587
 
1588
  st.markdown("In case you're not familiar with stuff like this, numbers close to 1 or -1 \
1589
+ represent a high level or correlation while numbers close to 0 represent a low level of correlation. \
1590
  Positive numbers represent a positive relationship between the variables and negative numbers represent a \
1591
  reverse relationship between the variables.")
1592
 
1593
+ st.markdown("If we use a statistics rule of thumb and remove all the variables that have correlations \
1594
  weaker than 0.3 (and more than -0.3), we can identify the variables with the strongest correlations.")
1595
 
1596
  if st.checkbox('Flip and sort by correlation strength'):
 
1599
  render_level_row_unordered()
1600
 
1601
 
1602
+ st.markdown("To summarize (and simplify), the factors with the strongest correlations with the Level are:")
1603
 
1604
  st.markdown("1. Rate of Speech")
1605
  st.markdown("2. Sentence length")
1606
  st.markdown("3. Amount of repetition of words")
1607
+ st.markdown("4. How rare the words are")
1608
  st.markdown("5. Amount of subordinating conjunctions")
1609
  st.markdown("6. Vocabulary size")
1610
  st.markdown("7. Amount of pronouns")
 
1612
  st.markdown("9. Amount of auxiliaries")
1613
  st.markdown("10. Amount of Chinese words")
1614
 
1615
+ st.markdown("In other words, as the videos get harder, the speech gets faster, the sentences get longer, words are repeated *less* \
1616
+ and so on and so forth!")
1617
+
1618
+ st.markdown("## Dicussion / Conclusion")
1619
+
1620
+ st.markdown("I find comprehensible input absolutely fascinating. The fact that\
1621
+ at any stage of the language acquisition process, the language can\
1622
+ be made into a form that anyone can understand, even without formal instruction.")
1623
+
1624
+ st.markdown("In the above analysis, we saw that there exist a number of patterns that help \
1625
+ explain what CI is made of and the various factors that change \
1626
+ when CI is targeted at new vs. experienced learners.")
1627
+
1628
+ st.markdown("The findings in this analysis are not meant to be conclusive or to tell CI educators\
1629
+ how to teach their students, but rather just to get us thinking more analytically about the factors\
1630
+ that help or hurt comprehensibility. Most of us know intuitively that slow speech is easier to understand than fast\
1631
+ speech, but how many of us think about the importance of repetition when trying to make ourselves understood? \
1632
+ I think it's interesting and important to think about these things as both language learners and educators.")
1633
 
1634
  #st.markdown('')
1635
 
1636
+ st.markdown("## Thanks for reading ✌️")
1637
 
1638
  st.markdown("---")
1639
 
1640
  st.markdown("#### Futher discussion for hardcore nerds")
1641
 
1642
  st.markdown("- No tests of statistical significance were conducted. This was purely meant as an EDA. \
1643
+ However, you can get the data from the repo linked at the top and conduct tests yourself if you'd like. \
1644
  I'd recommend starting with non-parametric tests like Kruskal-Wallis and moving on to pairwise tests \
1645
+ with a bonferonni correction if there's a significant result. Parametric tests may also be interesting.")
1646
 
1647
  st.markdown("- Technically, I computed 'moras per second' - not syllables per second. I'm aware that this \
1648
+ is technically linguistically incorrect, but it still serves as a close approximation and is easier \
1649
  to understand for readers unfamiliar with Japanese linguistics.")
1650
 
1651
  st.markdown("- The Mecab and Sudachi parsers (through Fugashi and Spacy) were used to analyze the transcripts. These parsers are not always 100% accurate.")
 
1655
  st.markdown("- Of the parsed words, while I did remove punctuation, I didn't otherwise verify that each token was an actual word. \
1656
  There is likely some amount of noise in the data such as mis-parses, etc.")
1657
 
1658
+ st.markdown("- I am slightly abusing the 98% statistic in this analysis. The original research applies \
1659
+ mainly to written text whereas the content on CIJ is mainly meant to listened to rather than read.")
1660
+
1661
  st.markdown("- If you're like me, the word coverage plots also probably evoked a resemblance to Heap's Law. \
1662
  More research would need to be done, but I suspect one may be able to find a link between word coverage and Heap's Law.")
1663
 
1664
+ st.markdown("- The frequency list used to calculate the word ranks was created from over 4,000 Japanese TV episodes and movies on Netflix. \
1665
+ Furthemore, the 25th percentile was computed on the ranks of unique words in each video's subtitles. Getting a decent visualization for \
1666
+ something like this is actually a bit tricky due to the highly exponential nature of word-frequency distributions which are power laws.")
1667
+
1668
  st.markdown("- One should bare in mind that the learner levels were labelled by a small group of experts and not a large number of learners. \
1669
  In other words, the difficulty levels are not objective, but rather an approximation of difficulty / natural acquistion order.")
1670
 
 
1675
  and the original transcript to katakana and compared the character error rate. I found no differences in the levels. \
1676
  Furthermore I can't tell if this moreso invalidates my original hypothesis or if whisper is just that good.")
1677
 
1678
+ st.markdown("2. **Word length** - At least in English and French (the languages I know best), longer words are generally considered harder. \
1679
  My hypothesis was that the easier videos would use shorter words while the harder videos would use bigger words. \
1680
  To test this, I parsed the transcripts and converted all words to katakana \
1681
  to get a measure of how long the words were orally. I found no differences between the levels.")
 
1685
 
1686
  st.markdown("4. **Other parts of speech** - I did test for orderings between the levels for other parts of speech such as: \
1687
  proportion of adjectives, adpositions, coordinating conjunctions, interjections, particles and proper nouns \
1688
+ but ultimately didn't find any obvious orderings.")
1689
+
1690
+ st.markdown("5. **Other word frequency metrics** - You can probably guess from reading '25th percentile log rank', that this was not the first statistic I tried.\
1691
+ I also tried computing the un-logged ranks, the mean, median, 75th percentile and non-unique (repeated) word ranks from the videos, and while some of these led to\
1692
+ orderings, they were generally not very nice to visualize. I'm certain that there's got to be a nicer statistic for representing how rare the overall vocabulary in a text is. \
1693
+ But Zipf's law makes this a challenge.")