pedramamini · stephanchenette · Jul 4, 2025
diff --git a/README.md b/README.md
@@ -2,6 +2,18 @@
 
 A Python library for interfacing with the Rewind.ai SQLite database.
 
+## Changelog
+
+### 2025-07-04 - Voice Export & Training Data Features
+- **NEW**: `--export-own-voice` CLI option for exporting user's voice transcripts organized by day
+- **NEW**: `--speech-source` filter to separate user voice (`me`) from other speakers (`others`)
+- **NEW**: Multi-format export support: text, JSON, and audio file export
+- **NEW**: `--export-format audio` with `--audio-export-dir` for exporting actual M4A audio files
+- **NEW**: `my-words.sh` script for generating word clouds from your voice data
+- **ENHANCED**: RewindDB core library now supports speech source filtering
+- **USE CASE**: Perfect for collecting clean voice training data for LLM fine-tuning
+- **FILTER**: Text exports contain only user's voice (no other speakers), audio exports contain full conversations
+
 ## Project Overview
 
 RewindDB is a Python library that provides a convenient interface to the Rewind.ai SQLite database. Rewind.ai is a personal memory assistant that captures audio transcripts and screen OCR data in real-time. This project allows you to programmatically access and search through this data, making it possible to retrieve past conversations, find specific information mentioned in meetings, or analyze screen content from previous work sessions.
@@ -89,7 +101,9 @@ python mcp_stdio.py --env-file /path/to/custom/.env
 
 ### transcript_cli.py
 
-Retrieve audio transcripts from the Rewind.ai database.
+Retrieve audio transcripts from the Rewind.ai database with advanced voice filtering and export capabilities.
+
+#### Basic Transcript Retrieval
 
 ```bash
 # get transcripts from the last hour
@@ -108,6 +122,44 @@ python transcript_cli.py --relative "7 days" --debug
 python transcript_cli.py --relative "1 hour" --env-file /path/to/custom/.env
 ```
 
+#### Voice Source Filtering
+
+```bash
+# filter for only your own voice
+python transcript_cli.py --relative "1 hour" --speech-source me
+
+# filter for other speakers only
+python transcript_cli.py --relative "1 day" --speech-source others
+
+# filter works with any time range
+python transcript_cli.py --from "2025-07-01" --to "2025-07-02" --speech-source me
+```
+
+#### Voice Export for Training Data 🎙️
+
+**Perfect for collecting clean voice training data for LLM fine-tuning**
+
+```bash
+# export your voice transcripts organized by day (text format)
+python transcript_cli.py --export-own-voice "2025-01-01 to 2025-07-04"
+
+# export as JSON with metadata
+python transcript_cli.py --export-own-voice "2025-01-01 to 2025-07-04" --export-format json --save-to my_voice.json
+
+# export actual audio files organized by day
+python transcript_cli.py --export-own-voice "2025-01-01 to 2025-07-04" --export-format audio --audio-export-dir ./my_voice_audio
+
+# generate word cloud from your voice data (requires wordcloud command)
+./my-words.sh  # automatically uses last 6 months of your voice data
+```
+
+**Key Features:**
+- **Clean Training Data**: Text exports contain only YOUR voice, filtered out other speakers
+- **Audio Export**: M4A files organized by day with transcript summaries
+- **Multiple Formats**: Text (readable), JSON (structured), Audio (original files)
+- **Day Organization**: Perfect for chronological training data or analysis
+- **Word Cloud**: Quick visualization of your most-used words with `my-words.sh`
+
 ### search_cli.py
 
 Search for keywords across both audio transcripts and screen OCR data.
@@ -352,6 +404,18 @@ from datetime import datetime
 start_time = datetime(2023, 5, 11, 13, 0, 0)  # 1:00 PM
 end_time = datetime(2023, 5, 11, 17, 0, 0)    # 5:00 PM
 transcripts = db.get_audio_transcripts_absolute(start_time, end_time)
+
+# filter by speech source for voice training data
+user_only = db.get_audio_transcripts_relative(hours=1, speech_source='me')
+others_only = db.get_audio_transcripts_relative(hours=1, speech_source='others')
+
+# get voice data organized by day for training
+transcripts_by_day = db.get_own_voice_transcripts_by_day(start_time, end_time)
+for date, transcripts in transcripts_by_day.items():
+    print(f"{date}: {len(transcripts)} words")
+    words = [t['word'] for t in transcripts]
+    text = ' '.join(words)
+    print(f"Sample: {text[:100]}...")
 ```
 
 ### Retrieving Screen OCR Data
@@ -415,6 +479,14 @@ Audio snippets are stored on disk at:
 #### Transcript Words
 Individual words extracted from audio recordings through speech recognition. Each word in the `transcript_word` table includes information about when it occurred within the audio recording (timeOffset), its position in the full text (fullTextOffset), and its duration. Transcript words are linked to their source audio recording.
 
+**Key Fields:**
+- `speechSource`: Identifies the speaker - `'me'` for user's voice, `'others'` for other speakers
+- `word`: The transcribed word text
+- `timeOffset`: Timing within the audio segment (milliseconds)
+- `duration`: Length of the spoken word (milliseconds)
+
+This speaker identification enables clean voice training data export by filtering to only the user's spoken words.
+
 #### Frames
 Screenshots captured by Rewind.ai at regular intervals as you use your computer. Each frame in the `frame` table includes a timestamp (createdAt) and is linked to the application segment it belongs to. Frames are the visual equivalent of audio recordings, capturing what was on your screen at specific moments.
 

diff --git a/rewinddb/core.py b/rewinddb/core.py
@@ -92,7 +92,8 @@ def close(self) -> None:
             self.conn.close()
 
     def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
-                                      end_time: datetime.datetime) -> typing.List[dict]:
+                                      end_time: datetime.datetime, 
+                                      speech_source: typing.Optional[str] = None) -> typing.List[dict]:
         """retrieve audio transcripts within an absolute time range.
 
         queries the audio and transcript_word tables to get transcribed words
@@ -101,6 +102,7 @@ def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
         args:
             start_time: the start datetime to query from
             end_time: the end datetime to query to
+            speech_source: optional filter for speech source ('me' for user voice, 'others' for other speakers)
 
         returns:
             a list of dictionaries containing transcript data
@@ -118,26 +120,36 @@ def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
             start_timestamp = int(start_time.timestamp() * 1000)  # convert to milliseconds
             end_timestamp = int(end_time.timestamp() * 1000)  # convert to milliseconds
 
-            query = """
+            # Build the WHERE clause based on speech_source filter
+            where_clause = "a.startTime + tw.timeOffset BETWEEN ? AND ?"
+            params = [start_timestamp, end_timestamp]
+
+            if speech_source:
+                where_clause += " AND tw.speechSource = ?"
+                params.append(speech_source)
+
+            query = f"""
             SELECT
                 a.id as audio_id,
                 a.startTime as start_time,
                 a.duration,
                 tw.id as word_id,
                 tw.word,
                 tw.timeOffset as time_offset,
-                tw.duration
+                tw.duration,
+                tw.speechSource as speech_source,
+                a.path as audio_path
             FROM
                 audio a
             JOIN
                 transcript_word tw ON a.segmentId = tw.segmentId
             WHERE
-                a.startTime + tw.timeOffset BETWEEN ? AND ?
+                {where_clause}
             ORDER BY
                 a.startTime, tw.timeOffset
             """
 
-            self.cursor.execute(query, (start_timestamp, end_timestamp))
+            self.cursor.execute(query, params)
             rows = self.cursor.fetchall()
 
             # If no results, try with string-formatted timestamps
@@ -146,26 +158,36 @@ def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
                 start_timestamp = start_time.strftime("%Y-%m-%dT%H:%M:%S.000")
                 end_timestamp = end_time.strftime("%Y-%m-%dT%H:%M:%S.999")
 
-                query = """
+                # Build the WHERE clause for string format
+                where_clause = "a.startTime BETWEEN ? AND ?"
+                params = [start_timestamp, end_timestamp]
+
+                if speech_source:
+                    where_clause += " AND tw.speechSource = ?"
+                    params.append(speech_source)
+
+                query = f"""
                 SELECT
                     a.id as audio_id,
                     a.startTime as start_time,
                     a.duration,
                     tw.id as word_id,
                     tw.word,
                     tw.timeOffset as time_offset,
-                    tw.duration
+                    tw.duration,
+                    tw.speechSource as speech_source,
+                    a.path as audio_path
                 FROM
                     audio a
                 JOIN
                     transcript_word tw ON a.segmentId = tw.segmentId
                 WHERE
-                    a.startTime BETWEEN ? AND ?
+                    {where_clause}
                 ORDER BY
                     a.startTime, tw.timeOffset
                 """
 
-                self.cursor.execute(query, (start_timestamp, end_timestamp))
+                self.cursor.execute(query, params)
                 rows = self.cursor.fetchall()
 
             results = []
@@ -199,6 +221,8 @@ def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
                     'word': row[4],
                     'time_offset': row[5],
                     'duration': row[6],  # using duration instead of confidence
+                    'speech_source': row[7] if len(row) > 7 else None,
+                    'audio_path': row[8] if len(row) > 8 else None,
                     'absolute_time': absolute_time
                 })
 
@@ -209,7 +233,8 @@ def get_audio_transcripts_absolute(self, start_time: datetime.datetime,
             return []
 
     def get_audio_transcripts_relative(self, days: int = 0, hours: int = 0,
-                                      minutes: int = 0, seconds: int = 0) -> typing.List[dict]:
+                                      minutes: int = 0, seconds: int = 0, 
+                                      speech_source: typing.Optional[str] = None) -> typing.List[dict]:
         """retrieve audio transcripts from a relative time period.
 
         queries audio transcripts from a time period relative to now.
@@ -219,6 +244,7 @@ def get_audio_transcripts_relative(self, days: int = 0, hours: int = 0,
             hours: number of hours to look back
             minutes: number of minutes to look back
             seconds: number of seconds to look back
+            speech_source: optional filter for speech source ('me' for user voice, 'others' for other speakers)
 
         returns:
             a list of dictionaries containing transcript data
@@ -229,7 +255,39 @@ def get_audio_transcripts_relative(self, days: int = 0, hours: int = 0,
         delta = datetime.timedelta(days=days, hours=hours, minutes=minutes, seconds=seconds)
         start_time = now - delta
 
-        return self.get_audio_transcripts_absolute(start_time, now)
+        return self.get_audio_transcripts_absolute(start_time, now, speech_source)
+
+    def get_own_voice_transcripts_by_day(self, start_time: datetime.datetime,
+                                        end_time: datetime.datetime) -> typing.Dict[str, typing.List[dict]]:
+        """retrieve user's own voice transcripts organized by day.
+
+        queries audio transcripts for user's own voice only (speechSource = 'me') 
+        and organizes them by day for voice training data export.
+
+        args:
+            start_time: the start datetime to query from
+            end_time: the end datetime to query to
+
+        returns:
+            a dictionary with dates as keys and lists of transcript dictionaries as values
+        """
+
+        # Get all own voice transcripts
+        transcripts = self.get_audio_transcripts_absolute(start_time, end_time, speech_source='me')
+
+        # Group by day
+        transcripts_by_day = {}
+        for transcript in transcripts:
+            # Get the date in local time
+            local_time = transcript['absolute_time'].astimezone()
+            date_str = local_time.date().isoformat()
+
+            if date_str not in transcripts_by_day:
+                transcripts_by_day[date_str] = []
+
+            transcripts_by_day[date_str].append(transcript)
+
+        return transcripts_by_day
 
     def get_screen_ocr_absolute(self, start_time: datetime.datetime,
                                end_time: datetime.datetime) -> typing.List[dict]: