joshdavham commited on
Commit
2c256e8
·
1 Parent(s): c21347b

reorder functions

Browse files
Files changed (1) hide show
  1. app.py +295 -289
app.py CHANGED
@@ -20,6 +20,7 @@ st.markdown(
20
  """, unsafe_allow_html=True
21
  )
22
 
 
23
  @st.cache_data
24
  def load_dataframes():
25
 
@@ -99,38 +100,7 @@ def get_word_origin_table():
99
 
100
  return styled_df
101
 
102
-
103
- video_df, word_coverage_df, num_video_df = load_dataframes()
104
- grammar_table = get_grammar_table()
105
- word_origin_table = get_word_origin_table()
106
-
107
- st.markdown("Note: this analysis is meant to viewed on a computer and not a phone (sorry!)")
108
-
109
- st.markdown("[Code can be found [here](https://github.com/joshdavham/cij-analysis)]")
110
-
111
- st.markdown("# What makes comprehensible input *comprehensible*?")
112
-
113
- st.markdown("**Comprehensible input** (or CI, for short) is a language teaching technique where teachers \
114
- speak in a way that is understandable to their students. \
115
- It is believed by many that CI is one of the most optimal and natural \
116
- ways to acquire a foreign language \
117
- ...but, what exactly is about CI that makes it comprehensible?")
118
-
119
-
120
-
121
- st.markdown("To answer this question, I'll be analyzing the videos on \
122
- [cijapanese.com](https://cijapanese.com/) (CIJ), a \
123
- video platform for learning Japanese.")
124
-
125
- ###
126
- # RATE OF SPEECH
127
- ###
128
- st.markdown("## How fast is CI?")
129
-
130
- st.markdown("If we measure how fast the teachers speak on CIJ, we find that \
131
- they speak more slowly in videos meant for beginners and more quickly \
132
- for advanced learners.")
133
-
134
  @st.cache_data
135
  def get_wpm_chart(show_medians=False):
136
 
@@ -270,21 +240,6 @@ def get_wpm_chart(show_medians=False):
270
 
271
  return layered_chart
272
 
273
-
274
- if st.checkbox('Show medians'):
275
-
276
- layered_chart = get_wpm_chart(show_medians=True)
277
-
278
- else:
279
-
280
- layered_chart = get_wpm_chart(show_medians=False)
281
-
282
- st.altair_chart(layered_chart, use_container_width=True)
283
-
284
- st.markdown("To put this data into perspective, native Japanese speakers \
285
- tend to speak at rates of over 200 wpm, meaning that most of the videos \
286
- on CIJ have been adapted to be a lot slower than that!")
287
-
288
  @st.cache_data
289
  def get_wpm_vs_sps_chart(interactive=False):
290
 
@@ -368,60 +323,6 @@ def get_wpm_vs_sps_chart(interactive=False):
368
  return scatter_plot.interactive()
369
  else:
370
  return scatter_plot
371
-
372
- if st.checkbox('Enable zooming and panning ( ↕ / ↔️ )'):
373
- wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=True)
374
- else:
375
- wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=False)
376
-
377
- st.altair_chart(wpm_vs_sps_chart, use_container_width=True)
378
-
379
- st.markdown("We can also measure the rate of speech in syllables per second (SPS) \
380
- and compare it to words per minute.")
381
-
382
- st.markdown("(Also, FYI, most of these **graphs are \
383
- interactive** so please click around.)")
384
-
385
- ###
386
- # STATISTICS LESSON
387
- ###
388
- st.markdown("## A quick statistics lesson")
389
-
390
- st.markdown("Before we continue this analysis, there's some basic things you should know.")
391
-
392
- st.markdown("### The data")
393
-
394
- st.markdown("The dataset we'll be analyzing comprises of just under 1,000 videos. \
395
- In particular, we'll be analyzing the subtitles of the videos.")
396
-
397
- st.markdown('Every video has a Level: **Complete Beginner**, **Beginner**, \
398
- **Intermediate**, or **Advanced**.')
399
-
400
- st.markdown("### The statistics")
401
-
402
- st.markdown("The goal of this analysis is to find features in the video data that lead \
403
- to a specific pattern called an \"ordering\".")
404
-
405
- st.markdown("We're specifically looking for *any* statistic that can lead to an \
406
- ordering of the levels in one of the two following orders:")
407
-
408
- st.markdown("> Complete Beginner < Beginner < Intermediate < Advanced")
409
- st.markdown("or")
410
- st.markdown("> Complete Beginner > Beginner > Intermediate > Advanced")
411
-
412
- st.markdown("For example: if a statistic is small for Complete Beginnner videos, but gets bigger \
413
- for Beginner, Intermediate, then Advanced videos, it suggests \
414
- that this is a good statistic for determining what makes a video comprehensible. \
415
- In fact, we already saw this above when measuring the **words per minute** statistic.")
416
-
417
- st.markdown("Okay! Now we can continue.")
418
-
419
- ###
420
- # SENTENCE LENGTH
421
- ###
422
- st.markdown("## Sentence length")
423
-
424
- st.markdown("Videos meant for beginners tend to have shorter sentences on average.")
425
 
426
  @st.cache_data
427
  def get_sentence_length_hist(show_medians=False):
@@ -565,26 +466,6 @@ def get_sentence_length_hist(show_medians=False):
565
 
566
  return layered_chart
567
 
568
- if st.checkbox('Show medians', key='sentence_length'):
569
-
570
- sentence_length_hist = get_sentence_length_hist(show_medians=True)
571
-
572
- else:
573
-
574
- sentence_length_hist = get_sentence_length_hist(show_medians=False)
575
-
576
- st.altair_chart(sentence_length_hist, use_container_width=True)
577
-
578
- st.markdown("This makes sense because long sentences generally tend to be more complex and packed with information \
579
- whereas short sentences are usually easier to understand.")
580
-
581
- ###
582
- # AMOUNT OF REPETITION
583
- ###
584
- st.markdown("## Amount of repetition")
585
-
586
- st.markdown("Words are repeated more often in easier videos.")
587
-
588
  @st.cache_data
589
  def get_repetition_hist(show_medians=False):
590
 
@@ -735,36 +616,6 @@ def get_repetition_hist(show_medians=False):
735
 
736
  return layered_chart
737
 
738
- if st.checkbox('Show medians', key='repetition'):
739
-
740
- repetition_hist = get_repetition_hist(show_medians=True)
741
-
742
- else:
743
-
744
- repetition_hist = get_repetition_hist(show_medians=False)
745
-
746
- st.altair_chart(repetition_hist, use_container_width=True)
747
-
748
-
749
- st.markdown("If you don't catch a word the first time it's said, there's more opportunities \
750
- in the easier videos to hear that word again.")
751
-
752
- ###
753
- # HOW MANY WORDS
754
- ###
755
- st.markdown("## How many words you need to know")
756
-
757
- st.markdown("A popular statistic in language learning circles is that you generally \
758
- need to know around 98% of words in a given piece of content to understand it well. \
759
- This statistic is known as 'word coverage', the percentage of words you know in a given text.")
760
-
761
- st.markdown("How many words do you need to know to understand 98% of the words in each level?")
762
-
763
- st.markdown("If we take all the words in CIJ, count them then order them from most common, to least common, \
764
- we can calculate the word coverage you get at different vocabulary sizes. \
765
- For example, if we learn the top 500 words from CIJ, then we'll know around 80% of the words in the \
766
- Complete Beginner videos. And if we learn the top 4,295 words, then we'll know 98% of the words in that category.")
767
-
768
  @st.cache_data
769
  def get_word_coverage_chart(zoom=False):
770
 
@@ -901,20 +752,6 @@ def get_word_coverage_chart(zoom=False):
901
 
902
  return layered_chart
903
 
904
- if st.checkbox('Zoom in'):
905
-
906
- word_coverage_chart = get_word_coverage_chart(zoom=True)
907
-
908
- else:
909
-
910
- word_coverage_chart = get_word_coverage_chart(zoom=False)
911
-
912
- st.altair_chart(word_coverage_chart, use_container_width=True)
913
-
914
- st.markdown("Using the same method of calculating word coverage as before, \
915
- we can also calculate how many of the top words you need to know \
916
- to achieve 98% word coverage in each video.")
917
-
918
  @st.cache_data
919
  def get_ne_spot_hist(show_medians=False):
920
 
@@ -1055,25 +892,6 @@ def get_ne_spot_hist(show_medians=False):
1055
 
1056
  return layered_chart
1057
 
1058
- if st.checkbox('Show medians', key='ne_spot'):
1059
-
1060
- ne_spot_hist = get_ne_spot_hist(show_medians=True)
1061
-
1062
- else:
1063
-
1064
- ne_spot_hist = get_ne_spot_hist(show_medians=False)
1065
-
1066
- st.altair_chart(ne_spot_hist, use_container_width=True)
1067
-
1068
- st.markdown("In general, easier videos require smaller vocabulary sizes to understand.")
1069
-
1070
- ###
1071
- # WORD RARENESS
1072
- ###
1073
- st.markdown("## Word rareness")
1074
-
1075
- st.markdown("More advanced videos tend to use rare/uncommon words more often than easier videos.")
1076
-
1077
  @st.cache_data
1078
  def get_tfplr_hist(show_medians=False):
1079
 
@@ -1213,37 +1031,6 @@ def get_tfplr_hist(show_medians=False):
1213
 
1214
  return layered_chart
1215
 
1216
- if st.checkbox('Show medians', key='tfplr'):
1217
-
1218
- tfplr_hist = get_tfplr_hist(show_medians=True)
1219
-
1220
- else:
1221
-
1222
- tfplr_hist = get_tfplr_hist(show_medians=False)
1223
-
1224
- st.altair_chart(tfplr_hist, use_container_width=True)
1225
-
1226
- st.markdown("How common a word is, is known as its 'rank'. The most common word \
1227
- in a text would be rank 1 and the fifth most common would be rank 5. \
1228
- A word with a low rank is a commonly used word (e.g., 'it', 'walk', 'up') whereas a word with a high rank \
1229
- is an uncommon or 'rare' word (e.g., 'esoteric', 'gauche', 'gallant').")
1230
-
1231
- st.markdown("The words in the videos were compared to the ranks of words generated from a frequency list made from over 4,000 Japanese Netflix \
1232
- TV episodes and movies. Duplicate ranks in the videos were removed, scaled with a log \
1233
- function then used to compute the 25th percentile. This was necessary due \
1234
- to power-law nature of word frequency distributions.")
1235
-
1236
- st.markdown("(It's okay ff the above didn't quite make sense to you - just know that the above graph \
1237
- demonstrates that easier videos tend to use more common words whereas \
1238
- advanced videos tend to use more rare words!)")
1239
-
1240
- ###
1241
- # GRAMMAR
1242
- ###
1243
- st.markdown("## Grammar")
1244
-
1245
- st.markdown("Easier videos tend to use less [subordinating conjunctions](https://universaldependencies.org/u/pos/SCONJ.html) than harder videos.")
1246
-
1247
  @st.cache_data
1248
  def get_sconj_hist(show_medians=False):
1249
 
@@ -1386,33 +1173,6 @@ def get_sconj_hist(show_medians=False):
1386
 
1387
  return layered_chart
1388
 
1389
- if st.checkbox('Show medians', key='sconj'):
1390
-
1391
- sconj_hist = get_sconj_hist(show_medians=True)
1392
-
1393
- else:
1394
-
1395
- sconj_hist = get_sconj_hist(show_medians=False)
1396
-
1397
- st.altair_chart(sconj_hist, use_container_width=True)
1398
-
1399
- st.markdown("We also notice differences in the use of other types of words.")
1400
-
1401
- st.markdown(
1402
- '<div class="dataframe-div">' + grammar_table.to_html() + "</div>"
1403
- , unsafe_allow_html=True)
1404
-
1405
- ###
1406
- # WORD ORIGIN
1407
- ###
1408
- st.markdown("## What type of word")
1409
-
1410
- st.markdown("There are three main categories of words in Japanese:")
1411
- st.markdown("(1) Wago (和語), (2) Kango (漢語) and (3) Gairaigo (外来語)")
1412
- st.markdown("Wago are native Japanese words, Kango are Chinese words and Gairaigo are foreign words.")
1413
-
1414
- st.markdown("Harder videos tend to use more Kango than easier videos")
1415
-
1416
  @st.cache_data
1417
  def get_kango_hist(show_medians=False):
1418
 
@@ -1554,63 +1314,22 @@ def get_kango_hist(show_medians=False):
1554
 
1555
  return layered_chart
1556
 
1557
- if st.checkbox('Show medians', key='kango'):
 
1558
 
1559
- kango_hist = get_kango_hist(show_medians=True)
1560
 
1561
- else:
1562
-
1563
- kango_hist = get_kango_hist(show_medians=False)
1564
 
1565
- st.altair_chart(kango_hist, use_container_width=True)
1566
 
1567
- st.markdown("In Japanese, Kango are somewhat analogous to French words in English. \
1568
- These words tend to be more technical or sophisticated than other words.")
1569
-
1570
- st.markdown("We also notice orderings when counting the percentage of Wago and Gairaigo as well.")
1571
-
1572
- st.markdown(
1573
- '<div class="dataframe-div">' + word_origin_table.to_html() + "</div>"
1574
- , unsafe_allow_html=True)
1575
-
1576
- ###
1577
- # MOST IMPORTANT FACTORS
1578
- ###
1579
- st.markdown("## Which factors matter the most?")
1580
-
1581
- st.markdown("We've just found a number of statistics that lead to orderings in the data \
1582
- but which statistics matter the most?")
1583
-
1584
- st.markdown("To answer this, we can look at a correlation heatmap between each of the variables \
1585
- and observe which statistics correlate the most strongly with the video's level.")
1586
-
1587
- @st.cache_data
1588
- def render_vanilla_heatmap():
1589
-
1590
- corr_matrix = num_video_df.corr()
1591
-
1592
- variable_of_interest = 'Level'
1593
-
1594
- sorted_vars = corr_matrix[variable_of_interest].sort_values(ascending=False).index
1595
-
1596
- sorted_corr_matrix = corr_matrix.loc[sorted_vars, sorted_vars]
1597
 
1598
  plt.figure(figsize=(10, 8))
1599
  sns.heatmap(sorted_corr_matrix, annot=True, cmap='coolwarm', fmt=".3f")
1600
 
1601
  st.pyplot(plt.gcf())
1602
 
1603
- render_vanilla_heatmap()
1604
-
1605
- st.markdown("In case you're not familiar with stuff like this, numbers close to 1 or -1 \
1606
- represent a high level or correlation and numbers close to 0 represent a low level of correlation. \
1607
- Positive numbers represent a positive relationship between the variables and negative numbers represent a \
1608
- reverse relationship between the variables.")
1609
-
1610
- st.markdown("Using a statistics rule of thumb and removing all variables that have correlations \
1611
- weaker than 0.3 (and more than -0.3), we can identify the variables with the strongest correlations.")
1612
-
1613
-
1614
  @st.cache_data
1615
  def render_level_row_unordered():
1616
 
@@ -1651,6 +1370,293 @@ def render_level_col_ordered():
1651
 
1652
  st.pyplot(plt.gcf())
1653
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1654
  if st.checkbox('Flip and sort'):
1655
  render_level_col_ordered()
1656
  else:
 
20
  """, unsafe_allow_html=True
21
  )
22
 
23
+ # functions for loading data
24
  @st.cache_data
25
  def load_dataframes():
26
 
 
100
 
101
  return styled_df
102
 
103
+ # functions for loading data visualizations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  @st.cache_data
105
  def get_wpm_chart(show_medians=False):
106
 
 
240
 
241
  return layered_chart
242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
  @st.cache_data
244
  def get_wpm_vs_sps_chart(interactive=False):
245
 
 
323
  return scatter_plot.interactive()
324
  else:
325
  return scatter_plot
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326
 
327
  @st.cache_data
328
  def get_sentence_length_hist(show_medians=False):
 
466
 
467
  return layered_chart
468
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
469
  @st.cache_data
470
  def get_repetition_hist(show_medians=False):
471
 
 
616
 
617
  return layered_chart
618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
619
  @st.cache_data
620
  def get_word_coverage_chart(zoom=False):
621
 
 
752
 
753
  return layered_chart
754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
755
  @st.cache_data
756
  def get_ne_spot_hist(show_medians=False):
757
 
 
892
 
893
  return layered_chart
894
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
895
  @st.cache_data
896
  def get_tfplr_hist(show_medians=False):
897
 
 
1031
 
1032
  return layered_chart
1033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1034
  @st.cache_data
1035
  def get_sconj_hist(show_medians=False):
1036
 
 
1173
 
1174
  return layered_chart
1175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1176
  @st.cache_data
1177
  def get_kango_hist(show_medians=False):
1178
 
 
1314
 
1315
  return layered_chart
1316
 
1317
+ @st.cache_data
1318
+ def render_vanilla_heatmap():
1319
 
1320
+ corr_matrix = num_video_df.corr()
1321
 
1322
+ variable_of_interest = 'Level'
 
 
1323
 
1324
+ sorted_vars = corr_matrix[variable_of_interest].sort_values(ascending=False).index
1325
 
1326
+ sorted_corr_matrix = corr_matrix.loc[sorted_vars, sorted_vars]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1327
 
1328
  plt.figure(figsize=(10, 8))
1329
  sns.heatmap(sorted_corr_matrix, annot=True, cmap='coolwarm', fmt=".3f")
1330
 
1331
  st.pyplot(plt.gcf())
1332
 
 
 
 
 
 
 
 
 
 
 
 
1333
  @st.cache_data
1334
  def render_level_row_unordered():
1335
 
 
1370
 
1371
  st.pyplot(plt.gcf())
1372
 
1373
+ # load the data
1374
+ video_df, word_coverage_df, num_video_df = load_dataframes()
1375
+ grammar_table = get_grammar_table()
1376
+ word_origin_table = get_word_origin_table()
1377
+
1378
+ st.markdown("Note: this analysis is meant to viewed on a computer and not a phone (sorry!)")
1379
+
1380
+ st.markdown("[Code can be found [here](https://github.com/joshdavham/cij-analysis)]")
1381
+
1382
+ st.markdown("# What makes comprehensible input *comprehensible*?")
1383
+
1384
+ st.markdown("**Comprehensible input** (or CI, for short) is a language teaching technique where teachers \
1385
+ speak in a way that is understandable to their students. \
1386
+ It is believed by many that CI is one of the most optimal and natural \
1387
+ ways to acquire a foreign language \
1388
+ ...but, what exactly is about CI that makes it comprehensible?")
1389
+
1390
+
1391
+
1392
+ st.markdown("To answer this question, I'll be analyzing the videos on \
1393
+ [cijapanese.com](https://cijapanese.com/) (CIJ), a \
1394
+ video platform for learning Japanese.")
1395
+
1396
+ ###
1397
+ # RATE OF SPEECH
1398
+ ###
1399
+ st.markdown("## How fast is CI?")
1400
+
1401
+ st.markdown("If we measure how fast the teachers speak on CIJ, we find that \
1402
+ they speak more slowly in videos meant for beginners and more quickly \
1403
+ for advanced learners.")
1404
+
1405
+ if st.checkbox('Show medians'):
1406
+
1407
+ layered_chart = get_wpm_chart(show_medians=True)
1408
+
1409
+ else:
1410
+
1411
+ layered_chart = get_wpm_chart(show_medians=False)
1412
+
1413
+ st.altair_chart(layered_chart, use_container_width=True)
1414
+
1415
+ st.markdown("To put this data into perspective, native Japanese speakers \
1416
+ tend to speak at rates of over 200 wpm, meaning that most of the videos \
1417
+ on CIJ have been adapted to be a lot slower than that!")
1418
+
1419
+ if st.checkbox('Enable zooming and panning ( ↕ / ↔️ )'):
1420
+ wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=True)
1421
+ else:
1422
+ wpm_vs_sps_chart = get_wpm_vs_sps_chart(interactive=False)
1423
+
1424
+ st.altair_chart(wpm_vs_sps_chart, use_container_width=True)
1425
+
1426
+ st.markdown("We can also measure the rate of speech in syllables per second (SPS) \
1427
+ and compare it to words per minute.")
1428
+
1429
+ st.markdown("(Also, FYI, most of these **graphs are \
1430
+ interactive** so please click around.)")
1431
+
1432
+ ###
1433
+ # STATISTICS LESSON
1434
+ ###
1435
+ st.markdown("## A quick statistics lesson")
1436
+
1437
+ st.markdown("Before we continue this analysis, there's some basic things you should know.")
1438
+
1439
+ st.markdown("### The data")
1440
+
1441
+ st.markdown("The dataset we'll be analyzing comprises of just under 1,000 videos. \
1442
+ In particular, we'll be analyzing the subtitles of the videos.")
1443
+
1444
+ st.markdown('Every video has a Level: **Complete Beginner**, **Beginner**, \
1445
+ **Intermediate**, or **Advanced**.')
1446
+
1447
+ st.markdown("### The statistics")
1448
+
1449
+ st.markdown("The goal of this analysis is to find features in the video data that lead \
1450
+ to a specific pattern called an \"ordering\".")
1451
+
1452
+ st.markdown("We're specifically looking for *any* statistic that can lead to an \
1453
+ ordering of the levels in one of the two following orders:")
1454
+
1455
+ st.markdown("> Complete Beginner < Beginner < Intermediate < Advanced")
1456
+ st.markdown("or")
1457
+ st.markdown("> Complete Beginner > Beginner > Intermediate > Advanced")
1458
+
1459
+ st.markdown("For example: if a statistic is small for Complete Beginnner videos, but gets bigger \
1460
+ for Beginner, Intermediate, then Advanced videos, it suggests \
1461
+ that this is a good statistic for determining what makes a video comprehensible. \
1462
+ In fact, we already saw this above when measuring the **words per minute** statistic.")
1463
+
1464
+ st.markdown("Okay! Now we can continue.")
1465
+
1466
+ ###
1467
+ # SENTENCE LENGTH
1468
+ ###
1469
+ st.markdown("## Sentence length")
1470
+
1471
+ st.markdown("Videos meant for beginners tend to have shorter sentences on average.")
1472
+
1473
+
1474
+ if st.checkbox('Show medians', key='sentence_length'):
1475
+
1476
+ sentence_length_hist = get_sentence_length_hist(show_medians=True)
1477
+
1478
+ else:
1479
+
1480
+ sentence_length_hist = get_sentence_length_hist(show_medians=False)
1481
+
1482
+ st.altair_chart(sentence_length_hist, use_container_width=True)
1483
+
1484
+ st.markdown("This makes sense because long sentences generally tend to be more complex and packed with information \
1485
+ whereas short sentences are usually easier to understand.")
1486
+
1487
+ ###
1488
+ # AMOUNT OF REPETITION
1489
+ ###
1490
+ st.markdown("## Amount of repetition")
1491
+
1492
+ st.markdown("Words are repeated more often in easier videos.")
1493
+
1494
+ if st.checkbox('Show medians', key='repetition'):
1495
+
1496
+ repetition_hist = get_repetition_hist(show_medians=True)
1497
+
1498
+ else:
1499
+
1500
+ repetition_hist = get_repetition_hist(show_medians=False)
1501
+
1502
+ st.altair_chart(repetition_hist, use_container_width=True)
1503
+
1504
+
1505
+ st.markdown("If you don't catch a word the first time it's said, there's more opportunities \
1506
+ in the easier videos to hear that word again.")
1507
+
1508
+ ###
1509
+ # HOW MANY WORDS
1510
+ ###
1511
+ st.markdown("## How many words you need to know")
1512
+
1513
+ st.markdown("A popular statistic in language learning circles is that you generally \
1514
+ need to know around 98% of words in a given piece of content to understand it well. \
1515
+ This statistic is known as 'word coverage', the percentage of words you know in a given text.")
1516
+
1517
+ st.markdown("How many words do you need to know to understand 98% of the words in each level?")
1518
+
1519
+ st.markdown("If we take all the words in CIJ, count them then order them from most common, to least common, \
1520
+ we can calculate the word coverage you get at different vocabulary sizes. \
1521
+ For example, if we learn the top 500 words from CIJ, then we'll know around 80% of the words in the \
1522
+ Complete Beginner videos. And if we learn the top 4,295 words, then we'll know 98% of the words in that category.")
1523
+
1524
+ if st.checkbox('Zoom in'):
1525
+
1526
+ word_coverage_chart = get_word_coverage_chart(zoom=True)
1527
+
1528
+ else:
1529
+
1530
+ word_coverage_chart = get_word_coverage_chart(zoom=False)
1531
+
1532
+ st.altair_chart(word_coverage_chart, use_container_width=True)
1533
+
1534
+ st.markdown("Using the same method of calculating word coverage as before, \
1535
+ we can also calculate how many of the top words you need to know \
1536
+ to achieve 98% word coverage in each video.")
1537
+
1538
+ if st.checkbox('Show medians', key='ne_spot'):
1539
+
1540
+ ne_spot_hist = get_ne_spot_hist(show_medians=True)
1541
+
1542
+ else:
1543
+
1544
+ ne_spot_hist = get_ne_spot_hist(show_medians=False)
1545
+
1546
+ st.altair_chart(ne_spot_hist, use_container_width=True)
1547
+
1548
+ st.markdown("In general, easier videos require smaller vocabulary sizes to understand.")
1549
+
1550
+ ###
1551
+ # WORD RARENESS
1552
+ ###
1553
+ st.markdown("## Word rareness")
1554
+
1555
+ st.markdown("More advanced videos tend to use rare/uncommon words more often than easier videos.")
1556
+
1557
+ if st.checkbox('Show medians', key='tfplr'):
1558
+
1559
+ tfplr_hist = get_tfplr_hist(show_medians=True)
1560
+
1561
+ else:
1562
+
1563
+ tfplr_hist = get_tfplr_hist(show_medians=False)
1564
+
1565
+ st.altair_chart(tfplr_hist, use_container_width=True)
1566
+
1567
+ st.markdown("How common a word is, is known as its 'rank'. The most common word \
1568
+ in a text would be rank 1 and the fifth most common would be rank 5. \
1569
+ A word with a low rank is a commonly used word (e.g., 'it', 'walk', 'up') whereas a word with a high rank \
1570
+ is an uncommon or 'rare' word (e.g., 'esoteric', 'gauche', 'gallant').")
1571
+
1572
+ st.markdown("The words in the videos were compared to the ranks of words generated from a frequency list made from over 4,000 Japanese Netflix \
1573
+ TV episodes and movies. Duplicate ranks in the videos were removed, scaled with a log \
1574
+ function then used to compute the 25th percentile. This was necessary due \
1575
+ to power-law nature of word frequency distributions.")
1576
+
1577
+ st.markdown("(It's okay ff the above didn't quite make sense to you - just know that the above graph \
1578
+ demonstrates that easier videos tend to use more common words whereas \
1579
+ advanced videos tend to use more rare words!)")
1580
+
1581
+ ###
1582
+ # GRAMMAR
1583
+ ###
1584
+ st.markdown("## Grammar")
1585
+
1586
+ st.markdown("Easier videos tend to use less [subordinating conjunctions](https://universaldependencies.org/u/pos/SCONJ.html) than harder videos.")
1587
+
1588
+ if st.checkbox('Show medians', key='sconj'):
1589
+
1590
+ sconj_hist = get_sconj_hist(show_medians=True)
1591
+
1592
+ else:
1593
+
1594
+ sconj_hist = get_sconj_hist(show_medians=False)
1595
+
1596
+ st.altair_chart(sconj_hist, use_container_width=True)
1597
+
1598
+ st.markdown("We also notice differences in the use of other types of words.")
1599
+
1600
+ st.markdown(
1601
+ '<div class="dataframe-div">' + grammar_table.to_html() + "</div>"
1602
+ , unsafe_allow_html=True)
1603
+
1604
+ ###
1605
+ # WORD ORIGIN
1606
+ ###
1607
+ st.markdown("## What type of word")
1608
+
1609
+ st.markdown("There are three main categories of words in Japanese:")
1610
+ st.markdown("(1) Wago (和語), (2) Kango (漢語) and (3) Gairaigo (外来語)")
1611
+ st.markdown("Wago are native Japanese words, Kango are Chinese words and Gairaigo are foreign words.")
1612
+
1613
+ st.markdown("Harder videos tend to use more Kango than easier videos")
1614
+
1615
+
1616
+ if st.checkbox('Show medians', key='kango'):
1617
+
1618
+ kango_hist = get_kango_hist(show_medians=True)
1619
+
1620
+ else:
1621
+
1622
+ kango_hist = get_kango_hist(show_medians=False)
1623
+
1624
+ st.altair_chart(kango_hist, use_container_width=True)
1625
+
1626
+ st.markdown("In Japanese, Kango are somewhat analogous to French words in English. \
1627
+ These words tend to be more technical or sophisticated than other words.")
1628
+
1629
+ st.markdown("We also notice orderings when counting the percentage of Wago and Gairaigo as well.")
1630
+
1631
+ st.markdown(
1632
+ '<div class="dataframe-div">' + word_origin_table.to_html() + "</div>"
1633
+ , unsafe_allow_html=True)
1634
+
1635
+ ###
1636
+ # MOST IMPORTANT FACTORS
1637
+ ###
1638
+ st.markdown("## Which factors matter the most?")
1639
+
1640
+ st.markdown("We've just found a number of statistics that lead to orderings in the data \
1641
+ but which statistics matter the most?")
1642
+
1643
+ st.markdown("To answer this, we can look at a correlation heatmap between each of the variables \
1644
+ and observe which statistics correlate the most strongly with the video's level.")
1645
+
1646
+
1647
+ render_vanilla_heatmap()
1648
+
1649
+ st.markdown("In case you're not familiar with stuff like this, numbers close to 1 or -1 \
1650
+ represent a high level or correlation and numbers close to 0 represent a low level of correlation. \
1651
+ Positive numbers represent a positive relationship between the variables and negative numbers represent a \
1652
+ reverse relationship between the variables.")
1653
+
1654
+ st.markdown("Using a statistics rule of thumb and removing all variables that have correlations \
1655
+ weaker than 0.3 (and more than -0.3), we can identify the variables with the strongest correlations.")
1656
+
1657
+
1658
+
1659
+
1660
  if st.checkbox('Flip and sort'):
1661
  render_level_col_ordered()
1662
  else: