Applying a replacement map to characters in emacs
Table of Contents
The Problem
I have text in Cyrillic and need to replace individual characters with their transliteration. I have a tiny json of the mappings:
{"в": "v","а": "a","ф": "f","ё": "yo","д": "d","ж": "zh","ы": "y","э": "e","л": "l","щ": "shch","я": "ya","й": "j","у": "u","н": "n","г": "g","с": "s","п": "p","ч": "ch","б": "b","х": "kh","е": "ye","ъ": "\"","з": "z","ю": "yu","ь": "'","ш": "sh","о": "o","к": "k","и": "i","ц": "ts","м": "m","т": "t","р": "r"}
And I have a number of files that contain lists with entries like
<p>основа</p>
<p>заставил</p>
<p>Лобзик</p>
Given those two inputs, how can I quickly add a links to files named with the transliteration of these words? EG, if I had a different json, change “abc” to “alphabravocharlie”?
Solution: Json -> Hashmap -> Replacement Function
The process turned out pretty smoothly with only a couple pitfalls. Here was the winning strategy:
- Convert JSON transliteration mappings into an emacs hash-table
- I had to extend the json table to include both capitalizations of cyrillic, and to replace spaces with underscores per my particular project needs
apply
a function leveraging(gethash json)
to every char of an input string- Make entry functions: primary one that takes a string and outputs the transliteration, and secondary that takes a marked region and replaces it with the output of the first function. I only turned out to need the first because…
- Perform a regexp find-and-replace on the desired sections to change them completely
Code
(require 'json)
(defun tsa/transliterate-cyrillic (in-string)
(interactive "P")
(let* ((json-object-type 'hash-table)
(json-array-type 'list)
(json-key-type 'string)
(json (json-read-from-string "{\"в\": \"v\",\"а\": \"a\",\"ф\": \"f\",\"ё\": \"yo\",\"д\": \"d\",\"ж\": \"zh\",\"ы\": \"y\",\"э\": \"e\",\"л\": \"l\",\"щ\": \"shch\",\"я\": \"ya\",\"й\": \"j\",\"у\": \"u\",\"н\": \"n\",\"г\": \"g\",\"с\": \"s\",\"п\": \"p\",\"ч\": \"ch\",\"б\": \"b\",\"х\": \"kh\",\"е\": \"ye\",\"ъ\": \"\'\",\"з\": \"z\",\"ю\": \"yu\",\"ь\": \"'\",\"ш\": \"sh\",\"о\": \"o\",\"к\": \"k\",\"и\": \"i\",\"ц\": \"ts\",\"м\": \"m\",\"т\": \"t\",\"р\": \"r\", \"В\": \"V\",\"А\": \"A\",\"Ф\": \"F\",\"Ё\": \"Yo\",\"Д\": \"D\",\"Ж\": \"Zh\",\"Ы\": \"Y\",\"Э\": \"E\",\"Л\": \"L\",\"Щ\": \"Shch\",\"Я\": \"Ya\",\"Й\": \"J\",\"У\": \"U\",\"Н\": \"N\",\"Г\": \"G\",\"С\": \"S\",\"П\": \"P\",\"Ч\": \"Ch\",\"Б\": \"B\",\"Х\": \"Kh\",\"Е\": \"Ye\",\"Ъ\": \"\'\",\"З\": \"Z\",\"Ю\": \"Yu\",\"Ь\": \"'\",\"Ш\": \"Sh\",\"О\": \"O\",\"К\": \"K\",\"И\": \"I\",\"Ц\": \"Ts\",\" \"\: \"_\",\"М\": \"M\",\"Т\": \"T\",\"Р\": \"R\"}")))
(cl-flet* ((replace-char (x) (gethash (char-to-string x) json (char-to-string x)))
(replace-all (s) (apply #'concat (mapcar #'replace-char s))))
(replace-all in-string))))
(defun tsa/cyrillic-area (beg end)
"Translate a selected region to cyrillic"
(interactive "r")
(let ((in-string (buffer-substring-no-properties beg end)))
(save-excursion
(delete-region beg end)
(goto-char beg)
(insert (tsa/transliterate-cyrillic in-string)))))
;; (tsa/transliterate-cyrillic "лЛЛЛобзик") ;; "lLLLobzik"
replace-regexp with elisp call
Mark this region:
<p>основа</p>
<p>заставил</p>
<p>Лобзик</p>
Replacement with elisp call:
M-x replace-regexp
<p>\(.*?\)</p>
# replace with:
<li data-audioname="\,(tsa/transliterate-cyrillic \1)">\1</li>
Result:
<li data-audioname="osnova">основа</li>
<li data-audioname="zastavil">заставил</li>
<li data-audioname="Lobzik">Лобзик</li>
Gotchas
- It turned out, contrary to my given json, my project distinguishes between uppercase and lowercase cyrillic (hint:
M-x upcase
makes a very quick fix to add an uppercase section to the json), and links to multiword audio files with underscores for spaces - As it turns out, emacs has a function
standard-display-cyrillic-translit
which appears to do almost what I need; however, it asks for versions of transliteration, and it it also just modifies the display without allowing me to save the non-cyrillic in certain places. - Emacs has a function
replace-region-contents
that sounds like it is exactly what I want, but actually, not. It wants to replace a section with a buffer of something else, which I wasn’t prepared to figure out.
Resources
- Stack overflow json-parsing from Wasamasa: https://emacs.stackexchange.com/a/27409/17004
- Reddit mapping suggestion from daddyfreddy: https://www.reddit.com/r/emacs/comments/ksras4/applying_a_replacement_map_to_characters_in_emacs/gijgqiu?utm_source=share&utm_medium=web2x&context=3
- Insight from ergoemacs about working with strings and regions: http://ergoemacs.org/emacs/elisp_command_working_on_string_or_region.html