NLP of Chinese Text Data (with code)
Machine Learning and Big Data—Melissa Dell and Matthew Harding
A brief introduction to NLP of Chinese text data.
OCR Tables and Parse the Output
Pros: OCR software frequently misidentifies 4/A, 8/B, 0/O, and other characters. The program created by Prof. Dell could address these issues.
Cons: My experience with the PDF version of the Japanese census shows that the programming could extract half of the data without needing to double-check it manually.
Text similarity using text2vec
https://www.youtube.com/watch?v=BAP6l2uGAHU A practical example of Chinese text data classification using RNN.
https://github.com/huang027/LDA Latent Dirichlet Allocation Topic Model
https://github.com/jiaeyan/Jiayan#2 Jiayan, the 1st NLP toolkit designed for Classical Chinese, supports lexicon construction, tokenizing, POS tagging, sentence segmentation, and punctuation.