User talk:JL-Bot

Not sure this is configured correctly

[edit]

On this page:

It looks like a couple of topics are repeated, and on (the DYK section) isn't list at all. How can I fix this? --evrik (talk) 16:25, 7 June 2022 (UTC)[reply]

@Evrik: Do you have examples of things that are repeated/not listed? Headbomb {t · c · p · b} 22:30, 7 June 2022 (UTC)[reply]
For the DYK section, there was an extra equals sign on the end of the line. I removed that and re-ran. The DYK section is now present. I'm not seeing any duplicate topics. Let me know what you are questioning and I will take a look at it. -- JLaTondre (talk) 01:06, 8 June 2022 (UTC)[reply]
The bot currently outputs by category. There is overlap in the categories (for example, I would assume all "Main page featured lists" are "Featured lists"), but they are not the same (not all featured content appears on the main page). So what is the ask here?
  1. When outputting a featured type (ex. "Featured lists"), provide an option to add a Wikipedia icon if it appeared on the main page?
    1. It would then be up to specifier to only include the one category type and the option; or
    2. The bot would only display the larger category if the option is set even if both categories are specified
  2. Provide an option to consolidate by page type (article, list, picture, sound) where it would show a different icon for each recognized type? So for example, the new section could be "Recognized articles" and you would get a different icon and date (where applicable) for each of "Featured article", "Former featured article", "DYK", etc.
-- JLaTondre (talk) 11:55, 8 June 2022 (UTC)[reply]

Well, TFAs should be either current or former FAs. So a condensed option would, IMO, required that both FA and FFA are covered. And then the TFAs could be 'merged' into FA/FFA. Likewise for TFLs, which would be merged into both FL and FFL sections. So if you have

 |content-featured-articles
 |content-former-featured-articles
 |content-mainpage-featured
 

the output would be as is, but if you had something like

|content-featured-articles
|content-former-featured-articles
|content-mainpage-featured=condensed (or something equivalent)

then the output would be merged as above (or similar, depending on whether or not icons were desired) Headbomb {t · c · p · b} 12:38, 8 June 2022 (UTC)[reply]

@JLaTondre: if you run the bot on Portal:Scouting/Recognized_content I can do a full mockup. Headbomb {t · c · p · b} 01:55, 12 June 2022 (UTC)[reply]
Done. -- JLaTondre (talk) 15:40, 12 June 2022 (UTC)[reply]
Mockup. POTD icons will be supported once Template_talk:Icon#POTD_support is enacted. Headbomb {t · c · p · b} 16:32, 12 June 2022 (UTC)[reply]
@JLaTondre and Headbomb: Thank you both! --evrik (talk) 02:33, 13 June 2022 (UTC)[reply]

FM captions

[edit]

Is there a reason why the "caption" option actually displays the media's title rather than its caption? The titles are so rarely helpful, while the captions would definitely be! MeegsC (talk) 17:38, 7 August 2022 (UTC)[reply]

Primarily performance, but also the lack of standard formats. Captions are on the pages that use the images and it would be add significant time to go pull them. The task already takes most of a day to run. Images can also appear on multiple pages with significantly different captions. Captions are not always in a standard format which makes pulling them from the page text problematic. -- JLaTondre (talk) 12:00, 9 August 2022 (UTC)[reply]
Could we not use the captions that are in the picture file, and default to the title only if the picture file doesn't have an English caption? Most file captions are better than the title! MeegsC (talk) 13:14, 9 August 2022 (UTC)[reply]
The description field? Yes, that might work. I will look into it. -- JLaTondre (talk) 16:36, 13 August 2022 (UTC)[reply]

Highlight journal= from different character set

[edit]

If, for example, you have |journal=Аcta Вaltico‑Slavica, where А and В comes from the Cyrillic alphabet and the others from the Latin alphabet, it would be useful in the complilation to highlight this sort of thing, i.e. when an entry has characters from two different alphabets. If it's from a single alphabet, no highlighting is needed.


Journal1 Type2 Target1 Type2 Citations Articles Citations/article Search
Аcta Вaltico‑Slavica ? Acta Baltico-Slavica ? 1 1 1.000

Wikipedia (J·M·T)
Google (J·M·T)

In general, there could be a color scheme like

  • Red = Latin
  • Orange = Arabic
  • DarkKhaki = Chinese
  • Green = Cyrillic
  • Blue = Greek
  • Indigo = Hebrew
  • Violet = Japanese
  • DeepSkyBlue = Other1
  • MediumPurple = Other2 (only used when Other1 is already used)
  • DeepPink = Other3 (only used when Other2 is already used, might not be needed)

Would this be difficult to implement? Headbomb {t · c · p · b} 05:30, 27 August 2025 (UTC)[reply]

Yes, that is doable. Perl, which is what that part of the citation processing is written in, makes it easy to check language scripts. Perl can recognize all the ones listed at perlunicode#Scripts (all the ones you are requesting are on that list). For Chinese, it would really be detecting for Han script - which in my understanding is used for several Eastern languages. It will probably be a couple of weeks before I can complete it. -- JLaTondre (talk) 23:14, 27 August 2025 (UTC)[reply]
Yes, if it's the Han alphabet, then that's the character set that should be highlighted. The point is to detect names that have multiple character sets in them, which should be rare, and usually limited to case like |journal=The Journal of Things = Το ημερολόγιο των πραγμάτων.
It's probably simpler to collect them and have them all reported on their own WP:JCW/Multiscript subpage, with that highlighting only in effect on that page.
Headbomb {t · c · p · b} 00:59, 28 August 2025 (UTC)[reply]
A separate page is easier. I can have a separate script for that vs. integrating into the regular output. -- JLaTondre (talk) 23:37, 28 August 2025 (UTC)[reply]
Should it report cases where there is a language template? For example, what should it do with Sidirotrohia ({{langx|el|Σιδηροτροχιά}}) which will produce Sidirotrohia (Σιδηροτροχιά) (after the change discussed below)? There are also cases where people enter titles in multiple languages without the use of a template? Should it only report a mismatch when it happens within a single word? -- JLaTondre (talk) 23:53, 28 August 2025 (UTC)[reply]
If there's a language template, that can be ignored IMO. I suppose to start, mismatches could happen accross multiple words, this way it could catch things like Acta Whatever А. Headbomb {t · c · p · b} 00:12, 29 August 2025 (UTC)[reply]

I have the code to detect multiple scripts completed. It is returning 2,325 cases in the last dump. The majority are of the format of a single non-Latin script followed by Latin script (or vice versa). For example:

  • 한국한문학연구 (Korean Literature Research)
  • 한국전자통신학회 논문지 = the Journal of the Korea Institute of Electronic Communication Sciences
  • 한국언어문화 [Journal of Korean Language and Culture]
  • ЕтноАнтропоЗум / EthnoAnthropoZoom
  • Езиков свят - Orbis Linguarum
  • Military History Studies (军事历史研究)
  • Linguistic Sciences 语言科学
  • Acta Historica: Труды По Историческим И Обществоведческим Наукам
  • 7iber | حبر

Should I exclude cases where it is a single non-Latin script + space or punctuation + a Latin script (and the reverse order)? It seems like these are valid cases and you are more interested in ones like Artanіya which has a Cyrillic і in the middle of Latin characters? -- JLaTondre (talk) 14:06, 13 September 2025 (UTC)[reply]

I think Script 1 [Seperator] Script 2 can be excluded without much loss, so long as each script have 3+ letters in them. This way it excludes Езиков свят - Orbis Linguarum, but includes Acta Whatever А. Headbomb {t · c · p · b} 15:06, 13 September 2025 (UTC)[reply]

Updated Citation Field Extraction & Template Processing

[edit]

@Headbomb I have been updating how the |journal= | |magazine= field is extracted from the citation templates. Currently, the bot is using a regex to extract the field, but this occasionally gives a bad result due to the complexity of pattern matching against templates within templates, comments, nowiki markup, and everything else people can embed in a citation template. It doesn't happen often, but there are cases that end up in Invalid titles that are due to parsing errors. The new method will use a tokenizer to split the citation templates into parts and pull out the |journal= | |magazine= field. I have found it to be more reliable.

As part of this change, I needed to tweak how the template expansion works. That led down a rabbit of hole of checking all the template expansions that have been implemented and validating them.

I came up with some items that could use your input.

  • There are a handful of cases where |journal=.. The old method would miss these whereas the new method will return a period. Do you want these to be reported or dropped? On the article output, this causes an extra period when the template is expanded. For example, at Arleen McCarty Hynes#References, you can see a ". ." on 20, 34, 35. I wasn't sure if this is something you wanted to cleanup.
  • The current processing removes ({{langx}}) from the end of a citation, but not other language templates. So |journal=Studime Historike ({{langx|en|Historical Studies}}) becomes Studime Historike, but |journal=Hubei Wenshi Ziliao ({{lang|zh-hans|湖北文史资料}}) becomes Hubei Wenshi Ziliao (湖北文史资料). My assumption is that any language templates in parenthesis (or brackets) at the end of a citation should be removed and only the non-parenthesis (non-brackets) section returned. Is this correct?

It will be a bit before this new version is ready to go live. I will do it separately from the language script detection request above so that it's easier to see any unintended effects of either change. JLaTondre (talk) 00:27, 28 August 2025 (UTC)[reply]

  • |journal=. should be reported yeah. No special processing needed, though I suppose they could also be added to WP:JCW/Dots. It definitely needs cleaning up.
  • Ideally, I think the best thing is to report what is rendered, so |journal=Studime Historike ({{langx|en|Historical Studies}}) can be treated as |journal=Studime Historike (Historical Studies) and |journal=Hubei Wenshi Ziliao ({{lang|zh-hans|湖北文史资料}}) can be treated as |journal=Hubei Wenshi Ziliao (湖北文史资料)
Headbomb {t · c · p · b} 00:47, 28 August 2025 (UTC)[reply]
I will make sure the lone periods end up on Dots.
I will remove the special handling of langx in parenthesis. I did look at using the API to expand templates so that I wouldn't have to hard code processing them. This would have simplified things as I would not have to add handling new ones when they popup in dumps or make updates if there is a change in how they operate. However, it would cause additional items to appear in the output that I don't believe you want. For some examples:
  • {{ill|Die Sprache|de}} produces Die Sprache [de] where I believe only Die Sprache should be output
  • {{nihongo|BOMB|[[:ja:BOMB|BOMB]]}} produces BOMB (BOMB) where I believe only BOMB should be output
But let me know if that is incorrect. -- JLaTondre (talk) 00:06, 29 August 2025 (UTC)[reply]
You believe correctly. Headbomb {t · c · p · b} 00:13, 29 August 2025 (UTC)[reply]

WP:PACKERS Good Articles

[edit]

There's 202 Good Articles in Category:GA-Class Green Bay Packers articles, yet JL-Bot only says there are 192. Any reason for the discrepancy Headbomb? « Gonzo fan2007 (talk) @ 19:42, 29 August 2025 (UTC)[reply]

Investigating... the guilties are
Headbomb {t · c · p · b} 19:47, 29 August 2025 (UTC)[reply]
@Gonzo fan2007: those all seem to have been promoted to GA status in the last week. You just need to wait until the next scheduled run, usually on Saturdays. Headbomb {t · c · p · b} 19:50, 29 August 2025 (UTC)[reply]
Dohhhh!!! Sorry, noob mistake. Thanks for the quick reply Headbomb. « Gonzo fan2007 (talk) @ 19:58, 29 August 2025 (UTC)[reply]

Could you add..

[edit]

Unbalanced brackets? Like any entry like Quart. J. Math (Oxford.

Thinking <>, [], (), {}, ‹›, «», ⟨⟩, ≪≫

Thinking they could be added as a second line to WP:JCW/Invalid. Headbomb {t · c · p · b} 15:07, 9 September 2025 (UTC)[reply]

I will do after the above items. -- JLaTondre (talk) 22:27, 9 September 2025 (UTC)[reply]