Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -1713,6 +1713,47 @@
</table>


## ID-Publishers

<table class="publishers id">
<thead>
<tr>
<th>Class&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>Name&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>URL&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</th>
<th>Languages</th>
<th>Missing&#160;Attributes</th>
<th>Deprecated&#160;Attributes</th>
<th>Additional&#160;Attributes&#160;&#160;&#160;&#160;</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<code>MediaIndonesia</code>
</td>
<td>
<div>Media Indonesia</div>
</td>
<td>
<a href="https://www.mediaindonesia.com/">
<span>www.mediaindonesia.com</span>
</a>
</td>
<td>
<code>id</code>
</td>
<td>
<code>images</code>
<code>topics</code>
</td>
<td>&#160;</td>
<td>&#160;</td>
</tr>
</tbody>
</table>


## IL-Publishers

<table class="publishers il">
Expand Down
2 changes: 2 additions & 0 deletions src/fundus/publishers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from fundus.publishers.es import ES
from fundus.publishers.fr import FR
from fundus.publishers.gl import GL
from fundus.publishers.id import ID
from fundus.publishers.il import IL
from fundus.publishers.ind import IND
from fundus.publishers.isl import ISL
Expand Down Expand Up @@ -83,6 +84,7 @@ class PublisherCollection(metaclass=PublisherCollectionMeta):
es = ES
fr = FR
gl = GL
id = ID
il = IL
ind = IND
isl = ISL
Expand Down
15 changes: 15 additions & 0 deletions src/fundus/publishers/id/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from fundus.publishers.base_objects import Publisher, PublisherGroup
from fundus.publishers.id.media_indonesia import MediaIndonesiaParser
from fundus.scraping.filter import inverse, regex_filter
from fundus.scraping.url import Sitemap


class ID(metaclass=PublisherGroup):
default_language = "id"

MediaIndonesia = Publisher(
name="Media Indonesia",
domain="https://www.mediaindonesia.com/",
parser=MediaIndonesiaParser,
sources=[Sitemap("https://mediaindonesia.com/sitemap.xml")],
)
41 changes: 41 additions & 0 deletions src/fundus/publishers/id/media_indonesia.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import datetime
from typing import List, Optional

from lxml.cssselect import CSSSelector

from fundus.parser import BaseParser, ParserProxy
from fundus.parser.base_parser import attribute
from fundus.parser.data import ArticleBody
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
generic_topic_parsing,
image_extraction,
)


class MediaIndonesiaParser(ParserProxy):
class V1(BaseParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images attribute seems to be missing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topics also seem to be missing. They can be accessed using the meta attribute keywords.

_paragraph_selector = CSSSelector("div.article")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph selector selects just the entire article as one big paragraph. You should consider something like div.article > p. Make sure it is consistent with the subheadlines of this article. I would recommend switching to XPaths, I personally prefert them for these more subtle cases.

_subheadline_selector = CSSSelector("div.article > h2")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This article uses a different formatting for the subheadlines.


@attribute
def title(self) -> Optional[str]:
return self.precomputed.ld.bf_search("headline")

@attribute
def body(self) -> Optional[ArticleBody]:
return extract_article_body_with_selector(
self.precomputed.doc,
subheadline_selector=self._subheadline_selector,
paragraph_selector=self._paragraph_selector,
)

@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.ld.bf_search("author"))

@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))
20 changes: 20 additions & 0 deletions tests/resources/parser/test_data/id/MediaIndonesia.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"V1": {
"authors": [
"Andhika"
],
"body": {
"summary": [],
"sections": [
{
"headline": [],
"paragraphs": [
"Kondisi Tambang di Raja Ampat(Auriga Nusantara) Komisi Pemberantasan Korupsi (KPK) mengaku hingga saat ini belum menemukan surat keputusan (SK) resmi terkait pencabutan empat izin usaha pertambangan (IUP) nikel di Raja Ampat, Papua Barat Daya. Padahal, pemerintah sudah mengumumkan pencabutan izin pada Juni 2025. Kepala Satuan Tugas Koordinasi dan Supervisi KPK Wilayah V, Dian Patria, mengatakan pihaknya masih mencari kejelasan mengenai dokumen resmi pencabutan tersebut. “Dicabut di Istana Negara bulan Juni, tapi terus terang sampai detik ini kami belum pernah melihat SK pencabutannya,” ujar Dian di Gedung Merah Putih KPK, Jakarta, Selasa (21/10). Dian menjelaskan bahwa tim KPK telah menelusuri ke sejumlah kementerian, termasuk Kementerian ESDM dan Kementerian Investasi/Badan Koordinasi Penanaman Modal (BKPM), namun belum memperoleh dokumen yang dimaksud.Baca juga : Hasil Penyelidikan Tambang Nikel Raja Ampat Segera Dirilis “Kami tanya ke Minerba, jawabnya di BKPM. Kami tanya ke BKPM, katanya belum ada surat dari Minerba. Setelah dicek ulang, katanya surat sudah masuk dan sedang diproses,” paparnya. Ia pun mempertanyakan keseriusan pemerintah dalam menindaklanjuti pencabutan empat IUP tambang nikel Raja Ampat yang sempat diumumkan secara publik. “Apakah pemerintah benar-benar serius mencabut empat IUP di Raja Ampat yang diumumkan di Istana Negara? Karena sampai sekarang tidak ada dokumennya sama sekali,” tegas Dian.Baca juga : Pengamat Soroti Peredaran Gambar AI Raja Ampat, Partisipasi Publik yang Sehat Harus Dilandasi Fakta Meski demikian, KPK memastikan tidak ada aktivitas pertambangan di empat lokasi tersebut berdasarkan hasil pemantauan lapangan. Empat perusahaan yang izin usahanya dicabut adalah PT Anugerah Surya Pratama, PT Nurham, PT Mulia Raymond Perkasa, dan PT Kawei Sejahtera Mining. Pencabutan dilakukan karena perusahaan-perusahaan itu terbukti melakukan pelanggaran lingkungan di kawasan geowisata dan geopark Raja Ampat. Sebelumnya, Menteri ESDM menyebut langkah pencabutan IUP tambang Raja Ampat merupakan bagian dari upaya menjaga kawasan geopark Raja Ampat agar tidak rusak akibat aktivitas tambang, sekaligus memastikan pengelolaan sumber daya alam tetap berkelanjutan. Cek berita dan artikel yg lain di Google News dan dan ikuti WhatsApp channel mediaindonesia.com Editor : Andhika"
]
}
]
},
"publishing_date": "2025-10-22 06:59:00+07:00",
"title": "KPK tak Temukan Surat Pencabutan IUP Nikel Raja Ampat Apa Iya Betul Dicabut"
}
}
Binary file not shown.
6 changes: 6 additions & 0 deletions tests/resources/parser/test_data/id/meta.info
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"MediaIndonesia_2025_10_22.html.gz": {
"url": "https://mediaindonesia.com/ekonomi/822959/kpk-tak-temukan-surat-pencabutan-iup-nikel-raja-ampat-apa-iya-betul-dicabut",
"crawl_date": "2025-10-22 02:09:31.055285"
}
}