0

I'm trying to split Arabic text into individual words. Here's sample code:

var str = "المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.";
var strWithHashtag = "المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن #يعامل بعضهم بعضًا بروح الإخاء.";
var substrings = strWithHashtag.Split(' ');

The text is copied from https://r12a.github.io/scripts/arabic/, and it's the first paragraph under sample (arabic). I have two questions:

  • Why is the period sign placed at the end of str even though it appears as the first character on the web page?
  • When I split the string into individual words, يعامل# becomes #يعامل. How can I keep the original position of the # sign? Eventually, I need to extract hashtags from RTL languages, and so I need # to appear as the first character of the RTL hashtag.
user246392
  • 2,661
  • 11
  • 54
  • 96
  • 1
    RTL text mixing with LTR is vary hard to reason about by looking at it (unless you are fluent in at least one RTL and LTR language)… You may want to clarify what exactly you *need* to see displayed and how you want that text to be represented in a string. Reading on direction marks (https://en.wikipedia.org/wiki/Left-to-right_mark) and overall bidi (https://en.wikipedia.org/wiki/Bidirectional_text) could help to clarify what exactly you want to achieve. – Alexei Levenkov Feb 25 '20 at 01:27
  • I thought my question was clear. You could paste the code in a console app and see each element in `substrings`. One of the elements will show #يعامل, but I want it to show the # sign as it was part of the original string (i.e. `يعامل#`) – user246392 Feb 25 '20 at 01:30
  • Check https://www.w3.org/International/questions/qa-html-dir –  Feb 25 '20 at 01:39
  • Related https://stackoverflow.com/questions/3601760/html-arabic-support –  Feb 25 '20 at 01:45
  • My arabic is *very basic*, but, by my thinking, the "." is the first character in the `str` string. Arabic reads from right to left. – Flydog57 Feb 25 '20 at 02:41

1 Answers1

0

Why is the period sign placed at the end of str even though it appears as the first character on the web page?

Because the IDE (such as Visual Studio) is Left-To-Right (LTR) view, and you'll need to switch the view to Right-To-Left (RTL) to show it correctly. So, don't worry about how it shows up inside the code, as long as it's showing correctly in RTL page, then it's in the correct position.

When I split the string into individual words, يعامل# becomes #يعامل. How can I keep the original position of the # sign? Eventually, I need to extract hashtags from RTL languages, and so I need # to appear as the first character of the RTL hashtag.

Same thing here, the correct one would be #يعامل, if you convert it to RTL view, the hashtag would be the first character from the right which is the correct position in Arabic.

iSR5
  • 3,274
  • 2
  • 14
  • 13