INTRO
This notebook is an introductory exploration of the E5 coral urol-e5/timeseries_molecular (GitHub repo) multi-species data using a tensor decomposition approach, using workflow-stdm (GitHub Repo), a tensor decomposition package Steven created using an AI agent.
I misunderstood what the goal was, until well after I went through the trouble of running this.
The primary goal was simply to run the package and see what types of files/visualizations it generated. It successfully generates the expected output files; primarily numpy binary files for each factor:
gene_factors.npyrecontstructed_tensory.npyspecies_factorys.npysummary.jsontime_factors.npy
I think Steven would’ve liked if this package had also output some simple tables of the results that were human-readable and could be used to easy glance at the results, instead of all these numpy binary files.
Not realizing this at the time, I ran the package and worked heavily to assess/process the output data into something understandable. In the notebook below, I added steps to improve the input data formatting, process a range of ranks, compare those ranks, and select the “best” rank. Additionally, I added a number of visualizations to help better understand the results. However, like the default settings of the package, I did not add any steps to output results in any sort of table formats.
Overall, this analysis is likely similar to that of using barnacle (GitHub Repo) for this multi-omics analysis that I ran on 20251008 (notebook entry). Both greatly differ than the analysis I messed around with using MOFA2 (notebook entry).
The contents below are from knitted markdown 13.00-multiomics-stdm.Rmd (commit 84515d3).
1 BACKGROUND
This notebook performs tensor decomposition analysis on multi-species gene expression timeseries data using Steven’s workflow-stdm (GitHub repo) tensor decomposition project.
The analysis identifies optimal tensor rank through reconstruction error analysis and provides biological interpretations of gene expression patterns across species and time.
1.1 Analysis Overview
The notebook conducts the following analyses:
- Data Transformation: Converts wide-format gene expression data to long format required by the stdm package
- Rank Comparison: Tests tensor decomposition across multiple ranks (2-15) to identify optimal number of components
- Tensor Decomposition: Performs PARAFAC decomposition to identify gene, species, and temporal patterns
- Visualization: Creates comprehensive plots to interpret biological patterns in the decomposition results
1.2 Input Files
- Primary Input:
/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm/input-data/vst_counts_matrix.csv- Wide-format VST-normalized gene expression data
- Columns: gene IDs and sample names (format: ACR-139-TP1, POR-216-TP2, etc.)
- Sample names encode: Species (ACR/POR/POC) - Individual ID - Timepoint (TP1-TP4)
1.3 Expected Output Structure
../output/13.00-multiomics-stdm/
├── vst_counts_matrix_long_format.csv # Transformed input data
├── YYYYMMDD_HHMMSS/ # Main timestamped results directory
│ ├── rank_analysis.png # Rank selection visualization
│ ├── rank_selection_summary.json # Optimal rank determination
│ ├── optimal_tensor_decomposition_visualization_rank_X.png # Comprehensive plots
│ ├── plot_01_component_strengths_optimal_rank_X.png # Individual visualizations
│ ├── plot_02_temporal_heatmap_optimal_rank_X.png
│ ├── plot_03_species_heatmap_optimal_rank_X.png
│ ├── plot_04_temporal_trajectories_optimal_rank_X.png
│ ├── plot_05_species_clustering_optimal_rank_X.png
│ └── rank_comparison/ # Rank comparison results
│ ├── rank_comparison_results.json # Summary of all ranks tested
│ ├── rank_02/ # Results for each rank
│ │ └── YYYYMMDD_HHMMSS/ # Timestamped stdm output
│ │ ├── summary.json # Decomposition metrics
│ │ ├── gene_factors.npy # Gene factor matrix
│ │ ├── species_factors.npy # Species factor matrix
│ │ ├── time_factors.npy # Time factor matrix
│ │ ├── reconstructed_tensor.npy # Reconstructed data
│ │ └── rank_02_comprehensive_visualization.png
│ ├── rank_03/
│ │ └── YYYYMMDD_HHMMSS/
│ │ └── [same structure as rank_02]
│ └── ... # Additional ranks (up to rank_15)
2 SETUP
2.1 Libraries
2.2 Set R variables
# OUTPUT DIRECTORY
output_dir <- "../output/13.00-multiomics-stdm"
#INPUT FILE(S)
# PYTHON ENVIRONMENT - Use the project's .venv environment
python_path <- "/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv/bin/python"2.3 Load project Python environment
The project uses a virtual environment (.venv) with all required packages including stdm. If this is successful, the Python path should show the .venv environment.
# Use the project's .venv Python environment directly
library(reticulate)
# Set Python path to the project's .venv
use_python(python_path, required = TRUE)
# Show final configuration
py_config()## python: /home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv/bin/python
## libpython: /usr/lib/python3.12/config-3.12-x86_64-linux-gnu/libpython3.12.so
## pythonhome: /home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv:/home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv
## version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
## numpy: /home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv/lib/python3.12/site-packages/numpy
## numpy_version: 2.3.3
##
## NOTE: Python version was forced by use_python() function
2.4 Test Python Environment
Let’s verify the Python environment has all required libraries:
import sys
print(f"Python executable: {sys.executable}")
# Test problematic imports
try:
import PIL
print("✓ PIL imported successfully")
except ImportError as e:
print(f"✗ PIL import failed: {e}")
try:
import matplotlib.pyplot as plt
print("✓ matplotlib.pyplot imported successfully")
except ImportError as e:
print(f"✗ matplotlib.pyplot import failed: {e}")
try:
import pandas as pd
import numpy as np
print("✓ pandas and numpy imported successfully")
except ImportError as e:
print(f"✗ pandas/numpy import failed: {e}")
print("Environment test completed.")## Python executable: /home/shared/16TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/.venv/bin/python
## ✓ PIL imported successfully
## ✓ matplotlib.pyplot imported successfully
## ✓ pandas and numpy imported successfully
## Environment test completed.
3 STDM ANALYSIS
3.1 Transform data to stdm format
The stdm package expects data in long format with columns: gene, species, timepoint, expression. Our current data is in wide format, so we need to transform it first.
import pandas as pd
import numpy as np
import os
import re
# Get output directory from R environment
output_dir = r.output_dir
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Read the wide-format data
print("Reading wide-format data...")
input_file = "/home/shared/16TB_HDD_01/sam/gitrepos/sr320/workflow-stdm/input-data/vst_counts_matrix.csv"
df_wide = pd.read_csv(input_file)
print(f"Original data shape: {df_wide.shape}")
print(f"Columns: {list(df_wide.columns[:5])}... (showing first 5)")
# Create long-format data
print("Transforming to long format...")
data_long = []
# Get gene column (first column)
gene_col = df_wide.columns[0] # Should be 'group_id'
sample_cols = df_wide.columns[1:] # All sample columns
for _, row in df_wide.iterrows():
gene = row[gene_col]
for sample_col in sample_cols:
expression_value = row[sample_col]
# Parse sample name to extract species and timepoint
# Expected format: ACR-139-TP1, POR-216-TP2, POC-201-TP3, etc.
match = re.match(r'([A-Z]+)-(\d+)-(TP\d+)', sample_col)
if match:
species_prefix = match.group(1) # ACR, POR, POC (species designation)
individual_id = match.group(2) # 139, 216, 201 (individual/sample ID)
timepoint = match.group(3) # TP1, TP2, TP3, TP4
# Create individual identifier (species-individual combination)
species = f"{species_prefix}-{individual_id}"
data_long.append({
'gene': gene,
'species': species,
'timepoint': timepoint,
'expression': expression_value
})
else:
print(f"Warning: Could not parse sample name: {sample_col}")
# Create DataFrame
df_long = pd.DataFrame(data_long)
print(f"Transformed data shape: {df_long.shape}")
print(f"Number of unique genes: {df_long['gene'].nunique()}")
print(f"Number of unique species: {df_long['species'].nunique()}")
print(f"Number of unique timepoints: {df_long['timepoint'].nunique()}")
# Show sample of transformed data
print("\nSample of transformed data:")
print(df_long.head(10))
# Save transformed data
output_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long.to_csv(output_file, index=False)
print(f"\nTransformed data saved to: {output_file}")## Reading wide-format data...
## Original data shape: (9800, 118)
## Columns: ['group_id', 'ACR-139-TP1', 'ACR-139-TP2', 'ACR-139-TP3', 'ACR-139-TP4']... (showing first 5)
## Transforming to long format...
## Transformed data shape: (1146600, 4)
## Number of unique genes: 9800
## Number of unique species: 30
## Number of unique timepoints: 4
##
## Sample of transformed data:
## gene species timepoint expression
## 0 OG_00686 ACR-139 TP1 5.971679
## 1 OG_00686 ACR-139 TP2 6.206077
## 2 OG_00686 ACR-139 TP3 6.379771
## 3 OG_00686 ACR-139 TP4 5.882078
## 4 OG_00686 ACR-145 TP1 5.414779
## 5 OG_00686 ACR-145 TP2 7.260837
## 6 OG_00686 ACR-145 TP3 6.630233
## 7 OG_00686 ACR-145 TP4 5.533403
## 8 OG_00686 ACR-150 TP1 5.922190
## 9 OG_00686 ACR-150 TP2 6.834550
##
## Transformed data saved to: ../output/13.00-multiomics-stdm/vst_counts_matrix_long_format.csv
3.2 Rank comparison analysis
First, let’s run tensor decomposition across multiple ranks to find the optimal number of components.
import subprocess
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import glob
from datetime import datetime
# Get output directory from R environment
output_dir = r.output_dir
# Define ranks to test (2-15 should cover the useful range)
ranks_to_test = list(range(2, 16))
# Use the transformed data file
input_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
print(f"Running stdm decomposition across {len(ranks_to_test)} different ranks...")
print(f"Ranks to test: {ranks_to_test}")
print(f"Input file: {input_file}")
# Create main timestamped directory for all results
main_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
main_results_dir = os.path.join(output_dir, main_timestamp)
os.makedirs(main_results_dir, exist_ok=True)
# Create a subdirectory for rank comparison within main results
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
os.makedirs(comparison_dir, exist_ok=True)
# Dictionary to store results
rank_results = {}
# Run decomposition for each rank
for rank in ranks_to_test:
print(f"\n{'='*50}")
print(f"Processing rank {rank}...")
print(f"{'='*50}")
# Create rank-specific subdirectory
rank_subdir = os.path.join(comparison_dir, f"rank_{rank:02d}")
os.makedirs(rank_subdir, exist_ok=True)
# Run stdm decomposition
try:
result = subprocess.run([
"stdm", "run",
"--input", input_file,
"--output", rank_subdir,
"--rank", str(rank),
"--method", "parafac",
"--sparsity-threshold", "0.01",
"--no-log-transform",
"--standardize",
"--verbose"
], capture_output=True, text=True, check=True)
# Find the results directory (should be timestamped)
result_dirs = glob.glob(os.path.join(rank_subdir, "20*"))
if result_dirs:
latest_result_dir = max(result_dirs)
# Load summary statistics
summary_file = os.path.join(latest_result_dir, "summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
summary = json.load(f)
rank_results[rank] = {
'reconstruction_error': summary['reconstruction_error'],
'rank': summary['rank'],
'method': summary['method'],
'result_dir': latest_result_dir
}
print(f" Rank {rank}: Reconstruction error = {summary['reconstruction_error']:.6f}")
else:
print(f" Warning: No summary file found for rank {rank}")
rank_results[rank] = {'reconstruction_error': np.nan, 'rank': rank, 'result_dir': None}
else:
print(f" Warning: No results directory found for rank {rank}")
rank_results[rank] = {'reconstruction_error': np.nan, 'rank': rank, 'result_dir': None}
except subprocess.CalledProcessError as e:
print(f" Error running decomposition for rank {rank}: {e}")
rank_results[rank] = {'reconstruction_error': np.nan, 'rank': rank, 'result_dir': None}
# Save rank comparison results
comparison_results_file = os.path.join(comparison_dir, "rank_comparison_results.json")
with open(comparison_results_file, 'w') as f:
json.dump(rank_results, f, indent=2)
print(f"\n{'='*60}")
print("RANK COMPARISON COMPLETED")
print(f"{'='*60}")
print(f"Main results directory: {main_results_dir}")
print(f"Rank comparison data: {comparison_dir}")
print(f"Comparison results file: {comparison_results_file}")
# Display summary table
print(f"\nSUMMARY TABLE:")
print(f"{'Rank':<6} {'Reconstruction Error':<20} {'Status'}")
print(f"{'-'*40}")
valid_results = {}
for rank in ranks_to_test:
if rank in rank_results and not np.isnan(rank_results[rank]['reconstruction_error']):
error = rank_results[rank]['reconstruction_error']
valid_results[rank] = error
print(f"{rank:<6} {error:<20.6f} {'Success'}")
else:
print(f"{rank:<6} {'N/A':<20} {'Failed'}")
if valid_results:
best_rank = min(valid_results.keys(), key=lambda k: valid_results[k])
print(f"\nBest rank by reconstruction error: {best_rank} (error: {valid_results[best_rank]:.6f})")
else:
print(f"\nNo successful decompositions found!")## Running stdm decomposition across 14 different ranks...
## Ranks to test: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
## Input file: ../output/13.00-multiomics-stdm/vst_counts_matrix_long_format.csv
##
## ==================================================
## Processing rank 2...
## ==================================================
## Rank 2: Reconstruction error = 0.643202
##
## ==================================================
## Processing rank 3...
## ==================================================
## Rank 3: Reconstruction error = 0.544191
##
## ==================================================
## Processing rank 4...
## ==================================================
## Rank 4: Reconstruction error = 0.470061
##
## ==================================================
## Processing rank 5...
## ==================================================
## Rank 5: Reconstruction error = 0.451016
##
## ==================================================
## Processing rank 6...
## ==================================================
## Rank 6: Reconstruction error = 0.434595
##
## ==================================================
## Processing rank 7...
## ==================================================
## Rank 7: Reconstruction error = 0.423404
##
## ==================================================
## Processing rank 8...
## ==================================================
## Rank 8: Reconstruction error = 0.410814
##
## ==================================================
## Processing rank 9...
## ==================================================
## Rank 9: Reconstruction error = 0.408422
##
## ==================================================
## Processing rank 10...
## ==================================================
## Rank 10: Reconstruction error = 0.395594
##
## ==================================================
## Processing rank 11...
## ==================================================
## Rank 11: Reconstruction error = 0.385213
##
## ==================================================
## Processing rank 12...
## ==================================================
## Rank 12: Reconstruction error = 0.384416
##
## ==================================================
## Processing rank 13...
## ==================================================
## Rank 13: Reconstruction error = 0.372751
##
## ==================================================
## Processing rank 14...
## ==================================================
## Rank 14: Reconstruction error = 0.364498
##
## ==================================================
## Processing rank 15...
## ==================================================
## Rank 15: Reconstruction error = 0.360396
##
## ============================================================
## RANK COMPARISON COMPLETED
## ============================================================
## Main results directory: ../output/13.00-multiomics-stdm/20251015_190157
## Rank comparison data: ../output/13.00-multiomics-stdm/20251015_190157/rank_comparison
## Comparison results file: ../output/13.00-multiomics-stdm/20251015_190157/rank_comparison/rank_comparison_results.json
##
## SUMMARY TABLE:
## Rank Reconstruction Error Status
## ----------------------------------------
## 2 0.643202 Success
## 3 0.544191 Success
## 4 0.470061 Success
## 5 0.451016 Success
## 6 0.434595 Success
## 7 0.423404 Success
## 8 0.410814 Success
## 9 0.408422 Success
## 10 0.395594 Success
## 11 0.385213 Success
## 12 0.384416 Success
## 13 0.372751 Success
## 14 0.364498 Success
## 15 0.360396 Success
##
## Best rank by reconstruction error: 15 (error: 0.360396)
3.3 Analyze rank comparison results
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import glob
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory and find the main results directory
output_dir = r.output_dir
# Find the most recent main results directory (timestamped)
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
else:
# Fallback to old structure if new structure doesn't exist yet
comparison_dir = os.path.join(output_dir, "rank_comparison")
main_results_dir = output_dir
# Load rank comparison results
comparison_results_file = os.path.join(comparison_dir, "rank_comparison_results.json")
if os.path.exists(comparison_results_file):
with open(comparison_results_file, 'r') as f:
rank_results = json.load(f)
# Convert to DataFrame for easier analysis
results_data = []
for rank_str, data in rank_results.items():
rank = int(rank_str)
if not np.isnan(data['reconstruction_error']):
results_data.append({
'rank': rank,
'reconstruction_error': data['reconstruction_error'],
'result_dir': data['result_dir']
})
if results_data:
df_results = pd.DataFrame(results_data)
df_results = df_results.sort_values('rank')
print("="*60)
print("RANK SELECTION ANALYSIS")
print("="*60)
# Calculate additional metrics
df_results['error_change'] = df_results['reconstruction_error'].diff()
df_results['error_change_pct'] = df_results['reconstruction_error'].pct_change() * 100
# Find elbow point using second derivative
if len(df_results) >= 3:
# Calculate second derivative (acceleration of error decrease)
df_results['second_derivative'] = df_results['error_change'].diff()
# Find elbow (where second derivative is maximum - greatest change in slope)
elbow_idx = df_results['second_derivative'].idxmax()
elbow_rank = df_results.loc[elbow_idx, 'rank'] if not pd.isna(elbow_idx) else None
else:
elbow_rank = None
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Scree plot - Reconstruction error vs rank
axes[0, 0].plot(df_results['rank'], df_results['reconstruction_error'],
'bo-', linewidth=2, markersize=8)
axes[0, 0].set_xlabel('Rank (Number of Components)')
axes[0, 0].set_ylabel('Reconstruction Error')
axes[0, 0].set_title('Scree Plot: Reconstruction Error vs Rank')
axes[0, 0].grid(True, alpha=0.3)
# Highlight elbow point
if elbow_rank:
elbow_error = df_results[df_results['rank'] == elbow_rank]['reconstruction_error'].iloc[0]
axes[0, 0].plot(elbow_rank, elbow_error, 'ro', markersize=12,
label=f'Elbow Point (Rank {elbow_rank})')
axes[0, 0].legend()
# 2. Error reduction rate
axes[0, 1].plot(df_results['rank'][1:], df_results['error_change'][1:],
'go-', linewidth=2, markersize=6)
axes[0, 1].set_xlabel('Rank')
axes[0, 1].set_ylabel('Error Change from Previous Rank')
axes[0, 1].set_title('Error Reduction Rate')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
# 3. Percentage change in error
axes[1, 0].plot(df_results['rank'][1:], df_results['error_change_pct'][1:],
'mo-', linewidth=2, markersize=6)
axes[1, 0].set_xlabel('Rank')
axes[1, 0].set_ylabel('Error Change (%)')
axes[1, 0].set_title('Percentage Error Reduction')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].axhline(y=0, color='r', linestyle='--', alpha=0.5)
# 4. Second derivative (elbow detection)
if len(df_results) >= 3:
axes[1, 1].plot(df_results['rank'][2:], df_results['second_derivative'][2:],
'co-', linewidth=2, markersize=6)
axes[1, 1].set_xlabel('Rank')
axes[1, 1].set_ylabel('Second Derivative')
axes[1, 1].set_title('Elbow Detection (Second Derivative)')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
if elbow_rank and elbow_rank >= 3:
elbow_second_deriv = df_results[df_results['rank'] == elbow_rank]['second_derivative'].iloc[0]
axes[1, 1].plot(elbow_rank, elbow_second_deriv, 'ro', markersize=12,
label=f'Elbow Point (Rank {elbow_rank})')
axes[1, 1].legend()
else:
axes[1, 1].text(0.5, 0.5, 'Not enough data points\nfor elbow detection',
transform=axes[1, 1].transAxes, ha='center', va='center')
axes[1, 1].set_title('Elbow Detection (Insufficient Data)')
plt.tight_layout()
# Save rank analysis plot
rank_analysis_plot = os.path.join(main_results_dir, "rank_analysis.png")
plt.savefig(rank_analysis_plot, dpi=300, bbox_inches='tight')
print(f"Rank analysis plot saved to: {rank_analysis_plot}")
plt.show()
# Print detailed analysis
print(f"\nDETAILED RANK ANALYSIS:")
print(f"{'Rank':<6} {'Error':<12} {'Change':<12} {'Change %':<12} {'2nd Deriv':<12}")
print(f"{'-'*60}")
for _, row in df_results.iterrows():
rank = int(row['rank'])
error = row['reconstruction_error']
change = row['error_change'] if not pd.isna(row['error_change']) else 0
change_pct = row['error_change_pct'] if not pd.isna(row['error_change_pct']) else 0
second_deriv = row['second_derivative'] if not pd.isna(row['second_derivative']) else 0
print(f"{rank:<6} {error:<12.6f} {change:<12.6f} {change_pct:<12.2f} {second_deriv:<12.6f}")
# Recommendations
print(f"\nRECOMMENDATIONS:")
# Best by reconstruction error
best_error_rank = df_results.loc[df_results['reconstruction_error'].idxmin(), 'rank']
best_error = df_results['reconstruction_error'].min()
print(f" Best reconstruction error: Rank {int(best_error_rank)} (error: {best_error:.6f})")
# Elbow point recommendation
if elbow_rank:
elbow_error = df_results[df_results['rank'] == elbow_rank]['reconstruction_error'].iloc[0]
print(f" Elbow point (best trade-off): Rank {int(elbow_rank)} (error: {elbow_error:.6f})")
# Conservative recommendation (where improvement becomes marginal)
# Find where error reduction becomes < 5% of previous rank
marginal_ranks = df_results[df_results['error_change_pct'] > -5]['rank']
if len(marginal_ranks) > 0:
conservative_rank = marginal_ranks.iloc[0]
conservative_error = df_results[df_results['rank'] == conservative_rank]['reconstruction_error'].iloc[0]
print(f" Conservative choice (< 5% improvement): Rank {int(conservative_rank)} (error: {conservative_error:.6f})")
print(f"\n SUGGESTED RANK: {int(elbow_rank) if elbow_rank else int(best_error_rank)}")
print(f" RATIONALE: {'Elbow point provides best complexity/accuracy trade-off' if elbow_rank else 'Best reconstruction error'}")
# Save summary
summary_data = {
'best_rank_by_error': int(best_error_rank),
'best_error': float(best_error),
'elbow_rank': int(elbow_rank) if elbow_rank else None,
'suggested_rank': int(elbow_rank) if elbow_rank else int(best_error_rank),
'total_ranks_tested': len(df_results),
'analysis_date': datetime.now().isoformat()
}
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
with open(summary_file, 'w') as f:
json.dump(summary_data, f, indent=2)
print(f"\nRank selection summary saved to: {summary_file}")
# Store main results directory for subsequent analyses
globals()['main_results_dir'] = main_results_dir
else:
print("No valid results found in rank comparison!")
else:
print(f"Rank comparison results file not found: {comparison_results_file}")
# Suppress output
None3.4 Analyze optimal decomposition results
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
import glob
import os
# Get output directory from R environment
output_dir = r.output_dir
# Find the main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
# Fallback to find the most recent main results directory
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = output_dir
# Load rank selection summary to get optimal rank
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Analyzing optimal rank {optimal_rank} results from rank comparison data")
# Find the optimal rank data in rank_comparison directory
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Analyzing optimal results from: {latest_dir}")
# Load summary
with open(os.path.join(latest_dir, "summary.json")) as f:
summary = json.load(f)
print("\n" + "="*60)
print("OPTIMAL DECOMPOSITION QUALITY ASSESSMENT")
print("="*60)
# Print key metrics
print(f"Method: {summary['method'].upper()}")
print(f"Rank: {summary['rank']} (data-driven selection)")
print(f"Reconstruction Error: {summary['reconstruction_error']:.6f}")
# Quality interpretation
error = summary['reconstruction_error']
if error < 0.1:
quality = "Excellent (may be overfitting)"
elif error < 0.3:
quality = "Good"
elif error < 0.5:
quality = "Acceptable"
else:
quality = "Poor - consider increasing rank"
print(f"Quality Assessment: {quality}")
# Load factor matrices
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
print(f"\nFactor Matrix Shapes:")
print(f" Gene factors: {gene_factors.shape}")
print(f" Species factors: {species_factors.shape}")
print(f" Time factors: {time_factors.shape}")
print(f"\nConvergence Check:")
print(f" All factors finite: {np.all(np.isfinite(gene_factors)) and np.all(np.isfinite(species_factors)) and np.all(np.isfinite(time_factors))}")
print(f" No NaN values: {not (np.any(np.isnan(gene_factors)) or np.any(np.isnan(species_factors)) or np.any(np.isnan(time_factors)))}")
# Show component strengths
print(f"\nComponent Analysis:")
component_strengths = []
for i in range(summary['rank']):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
# Sort components by strength
sorted_components = np.argsort(component_strengths)[::-1]
n_top = min(5, len(sorted_components))
print(f" Top {n_top} strongest components: {sorted_components[:n_top] + 1}") # +1 for 1-based indexing
# Find most variable timepoints
print(f"\nTemporal Patterns:")
time_variance = np.var(time_factors, axis=0)
most_temporal = np.argsort(time_variance)[::-1][:n_top]
print(f" Components with strongest temporal patterns: {most_temporal + 1}")
# Species diversity
print(f"\nSpecies Patterns:")
species_variance = np.var(species_factors, axis=0)
most_species_specific = np.argsort(species_variance)[::-1][:n_top]
print(f" Components with strongest species differences: {most_species_specific + 1}")
print(f"\n" + "="*60)
print("CONCLUSION: The optimal decomposition successfully converged!")
print(f"Results saved in: {latest_dir}")
print("="*60)
# Store latest_dir for subsequent analyses (this is now the rank_comparison data)
globals()['optimal_results_dir'] = latest_dir
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found in rank comparison!")
else:
print("No rank selection summary found!")## Analyzing optimal rank 5 results from rank comparison data
## Analyzing optimal results from: ../output/13.00-multiomics-stdm/20251015_190157/rank_comparison/rank_05/20251015_190351
##
## ============================================================
## OPTIMAL DECOMPOSITION QUALITY ASSESSMENT
## ============================================================
## Method: PARAFAC
## Rank: 5 (data-driven selection)
## Reconstruction Error: 0.451016
## Quality Assessment: Acceptable
##
## Factor Matrix Shapes:
## Gene factors: (9800, 5)
## Species factors: (30, 5)
## Time factors: (4, 5)
##
## Convergence Check:
## All factors finite: True
## No NaN values: True
##
## Component Analysis:
## Top 5 strongest components: [3 5 2 1 4]
##
## Temporal Patterns:
## Components with strongest temporal patterns: [5 4 3 2 1]
##
## Species Patterns:
## Components with strongest species differences: [1 2 3 4 5]
##
## ============================================================
## CONCLUSION: The optimal decomposition successfully converged!
## Results saved in: ../output/13.00-multiomics-stdm/20251015_190157/rank_comparison/rank_05/20251015_190351
## ============================================================
4 VISUALIZATIONS
4.1 Generate visualizations for all tested ranks
This section creates comprehensive visualizations for each rank tested, allowing comparison across different numbers of components.
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg') # Must be before importing pyplot
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import glob
import os
import json
# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.ioff() # Turn off interactive mode
# Get output directory and find the main results directory
output_dir = r.output_dir
# Find the most recent main results directory (timestamped)
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = output_dir
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
# Load rank comparison results
comparison_results_file = os.path.join(comparison_dir, "rank_comparison_results.json")
if os.path.exists(comparison_results_file):
with open(comparison_results_file, 'r') as f:
rank_results = json.load(f)
# Load the transformed data
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
species_names = sorted(df_long['species'].unique())
timepoint_names = sorted(df_long['timepoint'].unique())
species_types = [name.split('-')[0] for name in species_names]
unique_types = list(set(species_types))
print("="*60)
print("GENERATING VISUALIZATIONS FOR ALL TESTED RANKS")
print("="*60)
successful_ranks = []
# Process each rank that completed successfully
for rank_str, data in rank_results.items():
rank = int(rank_str)
result_dir = data.get('result_dir')
if result_dir and not np.isnan(data['reconstruction_error']):
successful_ranks.append(rank)
print(f"\nProcessing visualizations for rank {rank}...")
try:
# Load factor matrices
gene_factors = np.load(os.path.join(result_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(result_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(result_dir, "time_factors.npy"))
# Calculate component strengths
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
# Sort components by strength
sorted_components = np.argsort(component_strengths)[::-1]
n_top_display = min(5, rank)
top_components = sorted_components[:n_top_display]
# Create comprehensive visualization for this rank
fig = plt.figure(figsize=(16, 12))
# 1. Component strengths
plt.subplot(3, 3, 1)
plt.bar(range(1, len(component_strengths) + 1),
[component_strengths[i] for i in sorted_components])
plt.xlabel('Component (ranked by strength)')
plt.ylabel('Component Strength')
plt.title(f'Component Strengths (Rank {rank})')
if rank <= 10:
plt.xticks(range(1, rank + 1))
# 2. Temporal patterns heatmap
plt.subplot(3, 3, 2)
time_data = time_factors[:, top_components]
sns.heatmap(time_data.T,
xticklabels=timepoint_names,
yticklabels=[f'C{i+1}' for i in top_components],
cmap='RdBu_r', center=0, annot=True, fmt='.2f')
plt.title(f'Top {n_top_display} Components: Temporal')
plt.xlabel('Timepoint')
# 3. Species clustering PCA
plt.subplot(3, 3, 3)
if rank >= 2:
n_pca_components = min(rank, 10)
species_subset = species_factors[:, sorted_components[:n_pca_components]]
pca = PCA(n_components=2)
species_pca = pca.fit_transform(species_subset)
colors = ['red', 'blue', 'green', 'orange', 'purple'][:len(unique_types)]
species_type_map = {stype: i for i, stype in enumerate(unique_types)}
for i, (x, y) in enumerate(species_pca):
stype = species_types[i]
plt.scatter(x, y, c=colors[species_type_map[stype]],
alpha=0.7, s=40,
label=stype if i == species_types.index(stype) else "")
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title(f'Species PCA (Rank {rank})')
plt.legend()
plt.grid(True, alpha=0.3)
# 4. Temporal trajectories
plt.subplot(3, 3, 4)
for i, comp_idx in enumerate(top_components):
plt.plot(range(1, len(timepoint_names) + 1),
time_factors[:, comp_idx],
marker='o', linewidth=2, label=f'C{comp_idx + 1}')
plt.xlabel('Timepoint')
plt.ylabel('Factor Value')
plt.title(f'Temporal Trajectories (Rank {rank})')
plt.xticks(range(1, len(timepoint_names) + 1), timepoint_names)
plt.legend()
plt.grid(True, alpha=0.3)
# 5. Component correlations
plt.subplot(3, 3, 5)
gene_corr = np.corrcoef(gene_factors.T)
sns.heatmap(gene_corr, cmap='RdBu_r', center=0,
xticklabels=range(1, rank + 1),
yticklabels=range(1, rank + 1),
square=True)
plt.title(f'Component Correlations (Rank {rank})')
# 6. Species type comparison
plt.subplot(3, 3, 6)
species_by_type = {}
for i, name in enumerate(species_names):
stype = name.split('-')[0]
if stype not in species_by_type:
species_by_type[stype] = []
species_by_type[stype].append(i)
type_means = {}
for stype, indices in species_by_type.items():
type_means[stype] = np.mean(species_factors[indices, :], axis=0)
x_pos = np.arange(len(top_components))
width = 0.8 / len(type_means)
for i, (stype, means) in enumerate(type_means.items()):
plt.bar(x_pos + i * width, means[top_components],
width, label=stype, alpha=0.8)
plt.xlabel('Component')
plt.ylabel('Mean Factor Value')
plt.title(f'Species Comparison (Rank {rank})')
plt.xticks(x_pos + width * (len(type_means) - 1) / 2,
[f'C{i+1}' for i in top_components])
plt.legend()
# 7. Temporal variance
plt.subplot(3, 3, 7)
time_variances = np.var(time_factors, axis=0)
sorted_time_vars = np.argsort(time_variances)[::-1]
plt.bar(range(1, len(time_variances) + 1),
[time_variances[i] for i in sorted_time_vars])
plt.xlabel('Component (by temporal variance)')
plt.ylabel('Temporal Variance')
plt.title(f'Temporal Variability (Rank {rank})')
if rank <= 10:
plt.xticks(range(1, rank + 1))
# 8. Species variance
plt.subplot(3, 3, 8)
species_variances = np.var(species_factors, axis=0)
sorted_species_vars = np.argsort(species_variances)[::-1]
plt.bar(range(1, len(species_variances) + 1),
[species_variances[i] for i in sorted_species_vars])
plt.xlabel('Component (by species variance)')
plt.ylabel('Species Variance')
plt.title(f'Species Variability (Rank {rank})')
if rank <= 10:
plt.xticks(range(1, rank + 1))
# 9. Gene factor distribution
plt.subplot(3, 3, 9)
gene_variances = np.var(gene_factors, axis=1)
plt.hist(gene_variances, bins=30, alpha=0.7, edgecolor='black')
plt.xlabel('Gene Factor Variance')
plt.ylabel('Number of Genes')
plt.title(f'Gene Variability (Rank {rank})')
plt.grid(True, alpha=0.3)
plt.tight_layout()
# Save rank-specific visualization in the main results directory for easy access
rank_plot_file = os.path.join(main_results_dir, f"rank_{rank:02d}_comprehensive_visualization.png")
plt.savefig(rank_plot_file, dpi=300, bbox_inches='tight')
print(f" Saved: {rank_plot_file}")
# Also save in the rank-specific subdirectory
rank_subdir_plot = os.path.join(result_dir, f"rank_{rank:02d}_comprehensive_visualization.png")
plt.savefig(rank_subdir_plot, dpi=300, bbox_inches='tight')
plt.close() # Close the figure to free memory instead of showing all plots
except Exception as e:
print(f" Error processing rank {rank}: {e}")
print(f"\nSuccessfully generated visualizations for ranks: {successful_ranks}")
else:
print("Rank comparison results file not found!")
# Suppress output
None4.2 Visualize optimal decomposition results
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import glob
import os
import json
# Set up plotting style
matplotlib.use('Agg') # Use non-interactive backend
plt.style.use('default')
sns.set_palette("husl")
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Find the main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = output_dir
# Load rank selection summary to get optimal rank
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Visualizing optimal rank {optimal_rank} results from rank comparison data")
# Find the optimal rank data in rank_comparison directory
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal analysis data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
latest_dir = None
else:
print(f"Optimal rank {optimal_rank} directory not found in rank comparison!")
latest_dir = None
else:
print("No rank selection summary found!")
latest_dir = None
if latest_dir:
print(f"Visualizing results from: {latest_dir}")
# Load all data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get species names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
# Get unique species and timepoint names
species_names = sorted(df_long['species'].unique())
timepoint_names = sorted(df_long['timepoint'].unique())
# Calculate component strengths
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
# Sort components by strength
sorted_components = np.argsort(component_strengths)[::-1]
# Determine how many top components to show (max 5, or all if fewer)
n_components = gene_factors.shape[1]
n_top_display = min(5, n_components)
# Create figure with multiple subplots
fig = plt.figure(figsize=(20, 24))
# 1. Component strengths bar plot
plt.subplot(4, 3, 1)
plt.bar(range(1, len(component_strengths) + 1),
[component_strengths[i] for i in sorted_components])
plt.xlabel('Component (ranked by strength)')
plt.ylabel('Component Strength')
plt.title('Component Strengths (Gene × Species × Time norms)')
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
else:
plt.xticks(range(1, n_components + 1, max(1, n_components // 10)))
# 2. Time factors heatmap for top components
plt.subplot(4, 3, 2)
top_components = sorted_components[:n_top_display]
time_data = time_factors[:, top_components]
sns.heatmap(time_data.T,
xticklabels=timepoint_names,
yticklabels=[f'Comp {i+1}' for i in top_components],
cmap='RdBu_r', center=0, annot=True, fmt='.3f')
plt.title(f'Top {n_top_display} Components: Temporal Patterns')
plt.xlabel('Timepoint')
plt.ylabel('Component')
# 3. Species factors heatmap for top components
plt.subplot(4, 3, 3)
species_data = species_factors[:, top_components]
# Create species type labels for better visualization
species_types = [name.split('-')[0] for name in species_names]
unique_types = list(set(species_types))
type_colors = plt.cm.Set3(np.linspace(0, 1, len(unique_types)))
sns.heatmap(species_data.T,
xticklabels=[f"{name[:8]}..." if len(name) > 8 else name for name in species_names],
yticklabels=[f'Comp {i+1}' for i in top_components],
cmap='RdBu_r', center=0)
plt.title(f'Top {n_top_display} Components: Species Patterns')
plt.xlabel('Individual')
plt.ylabel('Component')
plt.xticks(rotation=90)
# 4. Temporal patterns for strongest components
plt.subplot(4, 3, 4)
for i, comp_idx in enumerate(top_components):
plt.plot(range(1, len(timepoint_names) + 1),
time_factors[:, comp_idx],
marker='o', linewidth=2, label=f'Component {comp_idx + 1}')
plt.xlabel('Timepoint')
plt.ylabel('Factor Value')
plt.title(f'Temporal Trajectories (Top {n_top_display} Components)')
plt.xticks(range(1, len(timepoint_names) + 1), timepoint_names)
plt.legend()
plt.grid(True, alpha=0.3)
# 5. Species clustering based on factor loadings
plt.subplot(4, 3, 5)
# Use all components for clustering if <= 10, otherwise top 10
n_cluster_components = min(10, n_components)
species_subset = species_factors[:, sorted_components[:n_cluster_components]]
# PCA for visualization
if species_subset.shape[1] >= 2:
pca = PCA(n_components=2)
species_pca = pca.fit_transform(species_subset)
# Color by species type
colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown'][:len(unique_types)]
species_type_map = {stype: i for i, stype in enumerate(unique_types)}
for i, (x, y) in enumerate(species_pca):
stype = species_types[i]
plt.scatter(x, y, c=colors[species_type_map[stype]],
alpha=0.7, s=60, label=stype if i == species_types.index(stype) else "")
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Species Clustering (PCA of Factor Loadings)')
plt.legend()
plt.grid(True, alpha=0.3)
else:
plt.text(0.5, 0.5, 'Not enough components\nfor PCA visualization',
transform=plt.gca().transAxes, ha='center', va='center')
plt.title('Species Clustering (Insufficient Components)')
# 6. Gene factor distribution
plt.subplot(4, 3, 6)
gene_variances = np.var(gene_factors, axis=1)
plt.hist(gene_variances, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Gene Factor Variance')
plt.ylabel('Number of Genes')
plt.title('Distribution of Gene Factor Variability')
plt.grid(True, alpha=0.3)
# 7. Component correlation matrix
plt.subplot(4, 3, 7)
# Calculate correlations between components across genes
gene_corr = np.corrcoef(gene_factors.T)
sns.heatmap(gene_corr, cmap='RdBu_r', center=0,
xticklabels=range(1, gene_factors.shape[1] + 1),
yticklabels=range(1, gene_factors.shape[1] + 1))
plt.title('Component Correlations (Gene Space)')
plt.xlabel('Component')
plt.ylabel('Component')
# 8. Species type comparison
plt.subplot(4, 3, 8)
species_by_type = {}
for i, name in enumerate(species_names):
stype = name.split('-')[0]
if stype not in species_by_type:
species_by_type[stype] = []
species_by_type[stype].append(i)
# Calculate mean factor values per species type for top components
type_means = {}
for stype, indices in species_by_type.items():
type_means[stype] = np.mean(species_factors[indices, :], axis=0)
# Plot for top components
x_pos = np.arange(len(top_components))
width = 0.8 / len(type_means) # Adjust width based on number of species types
for i, (stype, means) in enumerate(type_means.items()):
plt.bar(x_pos + i * width, means[top_components],
width, label=stype, alpha=0.8)
plt.xlabel('Component')
plt.ylabel('Mean Factor Value')
plt.title(f'Species Type Comparison (Top {n_top_display} Components)')
plt.xticks(x_pos + width * (len(type_means) - 1) / 2,
[f'Comp {i+1}' for i in top_components])
plt.legend()
plt.grid(True, alpha=0.3)
# 9. Temporal variance per component
plt.subplot(4, 3, 9)
time_variances = np.var(time_factors, axis=0)
sorted_time_vars = np.argsort(time_variances)[::-1]
plt.bar(range(1, len(time_variances) + 1),
[time_variances[i] for i in sorted_time_vars])
plt.xlabel('Component (ranked by temporal variance)')
plt.ylabel('Temporal Variance')
plt.title('Components by Temporal Variability')
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
else:
plt.xticks(range(1, n_components + 1, max(1, n_components // 10)))
# 10. Species variance per component
plt.subplot(4, 3, 10)
species_variances = np.var(species_factors, axis=0)
sorted_species_vars = np.argsort(species_variances)[::-1]
plt.bar(range(1, len(species_variances) + 1),
[species_variances[i] for i in sorted_species_vars])
plt.xlabel('Component (ranked by species variance)')
plt.ylabel('Species Variance')
plt.title('Components by Species Variability')
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
else:
plt.xticks(range(1, n_components + 1, max(1, n_components // 10)))
# 11. Top genes per component heatmap
plt.subplot(4, 3, 11)
n_top_genes = 10
top_gene_indices = []
# Get top genes for each of the top components
for comp_idx in top_components:
top_genes = np.argsort(np.abs(gene_factors[:, comp_idx]))[-n_top_genes:]
top_gene_indices.extend(top_genes)
top_gene_indices = list(set(top_gene_indices)) # Remove duplicates
top_gene_data = gene_factors[top_gene_indices, :][:, top_components]
sns.heatmap(top_gene_data,
cmap='RdBu_r', center=0,
yticklabels=[f'Gene {i}' for i in top_gene_indices],
xticklabels=[f'Comp {i+1}' for i in top_components])
plt.title(f'Top {len(top_gene_indices)} Genes × Top {n_top_display} Components')
plt.xlabel('Component')
plt.ylabel('Gene')
# 12. Reconstruction quality assessment
plt.subplot(4, 3, 12)
# Load reconstructed tensor
reconstructed = np.load(os.path.join(latest_dir, "reconstructed_tensor.npy"))
# Create original tensor from the long format data
# Pivot to get the original tensor structure
original_pivot = df_long.pivot_table(values='expression',
index='gene',
columns=['species', 'timepoint'],
fill_value=0)
print(f"Debug: Original pivot shape: {original_pivot.shape}")
print(f"Debug: Reconstructed tensor shape: {reconstructed.shape}")
print(f"Debug: Expected shape: ({gene_factors.shape[0]}, {species_factors.shape[0]}, {time_factors.shape[0]})")
# Check if we can safely reshape
expected_size = gene_factors.shape[0] * species_factors.shape[0] * time_factors.shape[0]
actual_size = original_pivot.size
if actual_size == expected_size:
# Reshape the pivot table to match tensor format
n_genes, n_species, n_timepoints = gene_factors.shape[0], species_factors.shape[0], time_factors.shape[0]
original_reshaped = original_pivot.values.reshape(n_genes, n_species, n_timepoints)
# Calculate per-gene reconstruction error
gene_errors = np.mean(np.abs(original_reshaped - reconstructed), axis=(1, 2))
plt.hist(gene_errors, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Mean Absolute Error per Gene')
plt.ylabel('Number of Genes')
plt.title('Reconstruction Error Distribution')
plt.axvline(np.mean(gene_errors), color='red', linestyle='--',
label=f'Mean: {np.mean(gene_errors):.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
else:
# If reshape fails, show a simpler quality metric
# Calculate overall reconstruction quality using Frobenius norm
original_flat = original_pivot.values.flatten()
reconstructed_flat = reconstructed.flatten()
# Ensure both arrays are the same size by taking the minimum
min_size = min(len(original_flat), len(reconstructed_flat))
original_flat = original_flat[:min_size]
reconstructed_flat = reconstructed_flat[:min_size]
residuals = original_flat - reconstructed_flat
plt.hist(residuals, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Reconstruction Residuals')
plt.ylabel('Frequency')
plt.title('Reconstruction Residuals Distribution')
plt.axvline(np.mean(residuals), color='red', linestyle='--',
label=f'Mean: {np.mean(residuals):.3f}')
plt.axvline(0, color='black', linestyle=':', alpha=0.5, label='Perfect fit')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
# Save the plot in main results directory
plot_file = os.path.join(main_results_dir, f"optimal_tensor_decomposition_visualization_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"\nOptimal visualization saved to: {plot_file}")
plt.show()
# Print interpretation guide
print("\n" + "="*60)
print("OPTIMAL VISUALIZATION INTERPRETATION GUIDE")
print("="*60)
print("1. Component Strengths: Shows which components capture most variation")
print("2. Temporal Patterns: How each component changes over time")
print("3. Species Patterns: How each individual loads on components")
print("4. Temporal Trajectories: Line plots of time patterns")
print("5. Species Clustering: PCA showing individual similarities")
print("6. Gene Variability: Distribution of gene response diversity")
print("7. Component Correlations: Independence of components")
print("8. Species Type Comparison: Average patterns by species type")
print("9. Temporal Variance: Which components show most time variation")
print("10. Species Variance: Which components show most individual variation")
print("11. Top Genes Heatmap: Key genes driving each component")
print("12. Reconstruction Quality: How well the optimal model fits the data")
else:
print("No optimal rank results found!")
# Suppress output
None4.3 Individual Plot 1: Component Strengths
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Calculate component strengths
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
# Sort components by strength
sorted_components = np.argsort(component_strengths)[::-1]
# Create plot
plt.figure(figsize=(12, 6))
plt.bar(range(1, len(component_strengths) + 1),
[component_strengths[i] for i in sorted_components])
plt.xlabel('Component (ranked by strength)')
plt.ylabel('Component Strength')
plt.title(f'Component Strengths (Optimal Rank {optimal_rank})')
# Set x-ticks to show all components if <= 10, otherwise show every 2nd or 5th
n_components = len(component_strengths)
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
elif n_components <= 20:
plt.xticks(range(1, n_components + 1, 2))
else:
plt.xticks(range(1, n_components + 1, 5))
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_01_component_strengths_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Component strengths plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.4 Individual Plot 2: Temporal Patterns Heatmap
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get timepoint names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
timepoint_names = sorted(df_long['timepoint'].unique())
# Calculate component strengths and get top 5
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
top_5_components = sorted_components[:min(5, optimal_rank)] # Use min to handle ranks < 5
# Create heatmap
plt.figure(figsize=(10, 6))
time_data = time_factors[:, top_5_components]
# Add a labeled colorbar so readers know what the colors represent
sns.heatmap(time_data.T,
xticklabels=timepoint_names,
yticklabels=[f'Component {i+1}' for i in top_5_components],
cmap='RdBu_r', center=0, annot=True, fmt='.3f',
cbar=True, cbar_kws={'label': 'Time factor value (loading)'} ,
linewidths=0.5)
plt.title(f'Top {len(top_5_components)} Components: Temporal Patterns (Optimal Rank {optimal_rank})')
plt.xlabel('Timepoint')
plt.ylabel('Component')
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_02_temporal_heatmap_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Temporal patterns heatmap saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.5 Individual Plot 3: Species Patterns Heatmap
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get species names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
species_names = sorted(df_long['species'].unique())
# Calculate component strengths and get top 5
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
top_5_components = sorted_components[:min(5, optimal_rank)] # Use min to handle ranks < 5
# Create heatmap
plt.figure(figsize=(16, 6))
species_data = species_factors[:, top_5_components]
# Add a labeled colorbar so readers know what the colors represent
sns.heatmap(species_data.T,
xticklabels=[f"{name[:8]}..." if len(name) > 8 else name for name in species_names],
yticklabels=[f'Component {i+1}' for i in top_5_components],
cmap='RdBu_r', center=0,
cbar=True, cbar_kws={'label': 'Species factor value (loading)'},
linewidths=0.5)
plt.title(f'Top {len(top_5_components)} Components: Species Patterns (Optimal Rank {optimal_rank})')
plt.xlabel('Individual')
plt.ylabel('Component')
plt.xticks(rotation=90)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_03_species_heatmap_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Species patterns heatmap saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.6 Individual Plot 4: Temporal Trajectories
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get timepoint names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
timepoint_names = sorted(df_long['timepoint'].unique())
# Calculate component strengths and get top 5
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
top_5_components = sorted_components[:min(5, optimal_rank)] # Use min to handle ranks < 5
# Create line plot
plt.figure(figsize=(12, 6))
for i, comp_idx in enumerate(top_5_components):
plt.plot(range(1, len(timepoint_names) + 1),
time_factors[:, comp_idx],
marker='o', linewidth=2, label=f'Component {comp_idx + 1}')
plt.xlabel('Timepoint')
plt.ylabel('Factor Value')
plt.title(f'Temporal Trajectories (Top {len(top_5_components)} Components, Optimal Rank {optimal_rank})')
plt.xticks(range(1, len(timepoint_names) + 1), timepoint_names)
plt.legend()
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_04_temporal_trajectories_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Temporal trajectories plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.7 Individual Plot 5: Species Clustering PCA
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get species names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
species_names = sorted(df_long['species'].unique())
# Create species type labels (same as comprehensive visualization)
species_types = [name.split('-')[0] for name in species_names]
unique_types = list(set(species_types))
# Calculate component strengths (same as comprehensive visualization)
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
# Use the same logic as comprehensive visualization
rank = gene_factors.shape[1] # This is the actual rank (number of components)
if rank >= 2:
n_pca_components = min(rank, 10)
species_subset = species_factors[:, sorted_components[:n_pca_components]]
pca = PCA(n_components=2)
species_pca = pca.fit_transform(species_subset)
# Create plot (same styling as comprehensive visualization)
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green', 'orange', 'purple'][:len(unique_types)]
species_type_map = {stype: i for i, stype in enumerate(unique_types)}
for i, (x, y) in enumerate(species_pca):
stype = species_types[i]
plt.scatter(x, y, c=colors[species_type_map[stype]],
alpha=0.7, s=40,
label=stype if i == species_types.index(stype) else "")
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title(f'Species PCA (Optimal Rank {optimal_rank})')
plt.legend()
plt.grid(True, alpha=0.3)
# Save to main results directory for consistency
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_05_species_clustering_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Species clustering PCA plot saved to: {plot_file}")
plt.show()
else:
print(f"Optimal rank {optimal_rank} has insufficient components for PCA visualization")
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.8 Individual Plot 6: Gene Factor Distribution
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
# Calculate gene variances
gene_variances = np.var(gene_factors, axis=1)
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(gene_variances, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Gene Factor Variance')
plt.ylabel('Number of Genes')
plt.title(f'Distribution of Gene Factor Variability (Optimal Rank {optimal_rank})')
plt.grid(True, alpha=0.3)
# Add statistics
plt.axvline(np.mean(gene_variances), color='red', linestyle='--',
label=f'Mean: {np.mean(gene_variances):.3f}')
plt.axvline(np.median(gene_variances), color='orange', linestyle='--',
label=f'Median: {np.median(gene_variances):.3f}')
plt.legend()
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_06_gene_distribution_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Gene factor distribution plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.9 Individual Plot 7: Component Correlations
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
# Calculate correlations between components across genes
gene_corr = np.corrcoef(gene_factors.T)
# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(gene_corr, cmap='RdBu_r', center=0,
xticklabels=range(1, gene_factors.shape[1] + 1),
yticklabels=range(1, gene_factors.shape[1] + 1),
square=True,
cbar_kws={'label': 'Correlation coefficient'})
plt.title(f'Component Correlations (Gene Space) - Optimal Rank {optimal_rank}')
plt.xlabel('Component')
plt.ylabel('Component')
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_07_component_correlations_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Component correlations plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.10 Individual Plot 8: Species Type Comparison
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get species names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
species_names = sorted(df_long['species'].unique())
# Calculate component strengths and get top 5
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
top_5_components = sorted_components[:min(5, optimal_rank)] # Use min to handle ranks < 5
# Group species by type
species_by_type = {}
for i, name in enumerate(species_names):
stype = name.split('-')[0]
if stype not in species_by_type:
species_by_type[stype] = []
species_by_type[stype].append(i)
# Calculate mean factor values per species type for top components
type_means = {}
for stype, indices in species_by_type.items():
type_means[stype] = np.mean(species_factors[indices, :], axis=0)
# Create bar plot
plt.figure(figsize=(12, 6))
x_pos = np.arange(len(top_5_components))
width = 0.8 / len(type_means) # Adjust width based on number of species types
for i, (stype, means) in enumerate(type_means.items()):
plt.bar(x_pos + i * width, means[top_5_components],
width, label=stype, alpha=0.8)
plt.xlabel('Component')
plt.ylabel('Mean Factor Value')
plt.title(f'Species Type Comparison (Top {len(top_5_components)} Components) - Optimal Rank {optimal_rank}')
plt.xticks(x_pos + width * (len(type_means) - 1) / 2,
[f'Comp {i+1}' for i in top_5_components])
plt.legend()
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_08_species_comparison_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Species type comparison plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.11 Individual Plot 9: Temporal Variance per Component
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Calculate temporal variances
time_variances = np.var(time_factors, axis=0)
sorted_time_vars = np.argsort(time_variances)[::-1]
# Create bar plot
plt.figure(figsize=(12, 6))
plt.bar(range(1, len(time_variances) + 1),
[time_variances[i] for i in sorted_time_vars])
plt.xlabel('Component (ranked by temporal variance)')
plt.ylabel('Temporal Variance')
plt.title(f'Components by Temporal Variability - Optimal Rank {optimal_rank}')
# Set x-ticks to show all components if <= 10, otherwise show every 2nd or 5th
n_components = len(time_variances)
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
elif n_components <= 20:
plt.xticks(range(1, n_components + 1, 2))
else:
plt.xticks(range(1, n_components + 1, 5))
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_09_temporal_variance_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Temporal variance plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.12 Individual Plot 10: Species Variance per Component
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
# Calculate species variances
species_variances = np.var(species_factors, axis=0)
sorted_species_vars = np.argsort(species_variances)[::-1]
# Create bar plot
plt.figure(figsize=(12, 6))
plt.bar(range(1, len(species_variances) + 1),
[species_variances[i] for i in sorted_species_vars])
plt.xlabel('Component (ranked by species variance)')
plt.ylabel('Species Variance')
plt.title(f'Components by Species Variability - Optimal Rank {optimal_rank}')
# Set x-ticks to show all components if <= 10, otherwise show every 2nd or 5th
n_components = len(species_variances)
if n_components <= 10:
plt.xticks(range(1, n_components + 1))
elif n_components <= 20:
plt.xticks(range(1, n_components + 1, 2))
else:
plt.xticks(range(1, n_components + 1, 5))
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_10_species_variance_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Species variance plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.13 Individual Plot 11: Top Genes Heatmap
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data to get actual gene names
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
gene_names = sorted(df_long['gene'].unique())
# Calculate component strengths and get top components
component_strengths = []
for i in range(gene_factors.shape[1]):
gene_norm = np.linalg.norm(gene_factors[:, i])
species_norm = np.linalg.norm(species_factors[:, i])
time_norm = np.linalg.norm(time_factors[:, i])
total_strength = gene_norm * species_norm * time_norm
component_strengths.append(total_strength)
sorted_components = np.argsort(component_strengths)[::-1]
top_components = sorted_components[:min(5, optimal_rank)] # Use min to handle ranks < 5
# Get top genes for each of the top components
n_top_genes = 10
top_gene_indices = []
for comp_idx in top_components:
top_genes = np.argsort(np.abs(gene_factors[:, comp_idx]))[-n_top_genes:]
top_gene_indices.extend(top_genes)
top_gene_indices = list(set(top_gene_indices)) # Remove duplicates
top_gene_data = gene_factors[top_gene_indices, :][:, top_components]
# Get actual gene names for the selected indices
top_gene_names = [gene_names[i] for i in top_gene_indices]
# Create heatmap
plt.figure(figsize=(10, 12))
sns.heatmap(top_gene_data,
cmap='RdBu_r', center=0,
yticklabels=top_gene_names,
xticklabels=[f'Comp {i+1}' for i in top_components],
cbar_kws={'label': 'Gene factor loading'})
plt.title(f'Top {len(top_gene_indices)} Genes × Top {len(top_components)} Components - Optimal Rank {optimal_rank}')
plt.xlabel('Component')
plt.ylabel('Gene')
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_11_top_genes_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Top genes heatmap saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.14 Individual Plot 12: Reconstruction Quality
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import glob
import os
import json
# Set up plotting
matplotlib.use('Agg') # Use non-interactive backend
plt.ioff() # Turn off interactive mode
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
optimal_rank = None
main_results_dir = None
else:
print("No main results directory found!")
optimal_rank = None
# Look for the optimal rank data in rank_comparison directory
latest_dir = None
if main_results_dir and optimal_rank:
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
reconstructed = np.load(os.path.join(latest_dir, "reconstructed_tensor.npy"))
# Load the transformed data
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
# Create original tensor from the long format data
original_pivot = df_long.pivot_table(values='expression',
index='gene',
columns=['species', 'timepoint'],
fill_value=0)
# Check if we can safely reshape
expected_size = gene_factors.shape[0] * species_factors.shape[0] * time_factors.shape[0]
actual_size = original_pivot.size
plt.figure(figsize=(10, 6))
if actual_size == expected_size:
# Reshape the pivot table to match tensor format
n_genes, n_species, n_timepoints = gene_factors.shape[0], species_factors.shape[0], time_factors.shape[0]
original_reshaped = original_pivot.values.reshape(n_genes, n_species, n_timepoints)
# Calculate per-gene reconstruction error
gene_errors = np.mean(np.abs(original_reshaped - reconstructed), axis=(1, 2))
plt.hist(gene_errors, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Mean Absolute Error per Gene')
plt.ylabel('Number of Genes')
plt.title(f'Reconstruction Error Distribution - Optimal Rank {optimal_rank}')
plt.axvline(np.mean(gene_errors), color='red', linestyle='--',
label=f'Mean: {np.mean(gene_errors):.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
else:
# If reshape fails, show a simpler quality metric
original_flat = original_pivot.values.flatten()
reconstructed_flat = reconstructed.flatten()
# Ensure both arrays are the same size by taking the minimum
min_size = min(len(original_flat), len(reconstructed_flat))
original_flat = original_flat[:min_size]
reconstructed_flat = reconstructed_flat[:min_size]
residuals = original_flat - reconstructed_flat
plt.hist(residuals, bins=50, alpha=0.7, edgecolor='black')
plt.xlabel('Reconstruction Residuals')
plt.ylabel('Frequency')
plt.title(f'Reconstruction Residuals Distribution - Optimal Rank {optimal_rank}')
plt.axvline(np.mean(residuals), color='red', linestyle='--',
label=f'Mean: {np.mean(residuals):.3f}')
plt.axvline(0, color='black', linestyle=':', alpha=0.5, label='Perfect fit')
plt.legend()
plt.grid(True, alpha=0.3)
# Save individual plot in main results directory
if main_results_dir:
plot_file = os.path.join(main_results_dir, f"plot_12_reconstruction_quality_optimal_rank_{optimal_rank}.png")
plt.savefig(plot_file, dpi=300, bbox_inches='tight')
print(f"Reconstruction quality plot saved to: {plot_file}")
plt.show()
else:
print(f"Could not load data for optimal rank {optimal_rank if optimal_rank else 'unknown'}")
# Suppress output
None4.15 Interpret species clustering patterns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import glob
import os
# Get output directory from R environment
output_dir = r.output_dir
# Look for main results directory and optimal rank data (consistent with other plots)
latest_dir = None
# Look for main results directory
if 'main_results_dir' in globals():
main_results_dir = globals()['main_results_dir']
else:
main_result_dirs = glob.glob(os.path.join(output_dir, "20*"))
if main_result_dirs:
main_results_dir = max(main_result_dirs)
else:
main_results_dir = None
# Get the optimal rank from the rank selection summary (same as other plots)
if main_results_dir:
summary_file = os.path.join(main_results_dir, "rank_selection_summary.json")
if os.path.exists(summary_file):
with open(summary_file, 'r') as f:
rank_summary = json.load(f)
optimal_rank = rank_summary['suggested_rank']
print(f"Using optimal rank {optimal_rank} (from rank selection analysis)")
# Look for the optimal rank data in rank_comparison directory
comparison_dir = os.path.join(main_results_dir, "rank_comparison")
optimal_rank_dir = os.path.join(comparison_dir, f"rank_{optimal_rank:02d}")
if os.path.exists(optimal_rank_dir):
# Find the timestamped results within the rank directory
result_dirs = glob.glob(os.path.join(optimal_rank_dir, "20*"))
if result_dirs:
latest_dir = max(result_dirs)
print(f"Using optimal rank {optimal_rank} data from: {latest_dir}")
else:
print(f"No timestamped results found in rank_{optimal_rank:02d} directory!")
else:
print(f"Optimal rank {optimal_rank} directory not found!")
else:
print("No rank selection summary found! Cannot determine optimal rank.")
main_results_dir = None
else:
print("No main results directory found!")
if latest_dir:
# Load data
gene_factors = np.load(os.path.join(latest_dir, "gene_factors.npy"))
species_factors = np.load(os.path.join(latest_dir, "species_factors.npy"))
time_factors = np.load(os.path.join(latest_dir, "time_factors.npy"))
# Load the transformed data
data_file = os.path.join(output_dir, "vst_counts_matrix_long_format.csv")
df_long = pd.read_csv(data_file)
species_names = sorted(df_long['species'].unique())
species_types = [name.split('-')[0] for name in species_names]
unique_types = sorted(list(set(species_types)))
print("="*60)
print("SPECIES CLUSTERING INTERPRETATION")
print("="*60)
# 1. Quantify species separation
print("\n1. SPECIES SEPARATION ANALYSIS:")
# Group individuals by species type
species_groups = {}
for i, name in enumerate(species_names):
stype = name.split('-')[0]
if stype not in species_groups:
species_groups[stype] = []
species_groups[stype].append(i)
# Calculate within-species vs between-species distances
from scipy.spatial.distance import pdist, squareform
# Use all components for distance calculation
distances = squareform(pdist(species_factors, metric='euclidean'))
within_species_distances = []
between_species_distances = []
for i, species_i in enumerate(species_names):
for j, species_j in enumerate(species_names):
if i < j: # Avoid duplicates
type_i = species_i.split('-')[0]
type_j = species_j.split('-')[0]
if type_i == type_j:
within_species_distances.append(distances[i, j])
else:
between_species_distances.append(distances[i, j])
print(f" Within-species distances: {np.mean(within_species_distances):.3f} ± {np.std(within_species_distances):.3f}")
print(f" Between-species distances: {np.mean(between_species_distances):.3f} ± {np.std(between_species_distances):.3f}")
# Statistical test
t_stat, p_value = stats.ttest_ind(between_species_distances, within_species_distances)
print(f" T-test p-value: {p_value:.3e}")
if p_value < 0.05:
print(" → Species ARE significantly separated (but effect might be small)")
else:
print(" → Species are NOT significantly separated")
# 2. Variance decomposition
print("\n2. VARIANCE DECOMPOSITION:")
# Calculate total variance and partition it
total_var = np.var(species_factors, axis=0)
# Calculate within-species variance
within_var = np.zeros(species_factors.shape[1])
for stype, indices in species_groups.items():
if len(indices) > 1: # Need at least 2 individuals
group_var = np.var(species_factors[indices, :], axis=0)
within_var += group_var * len(indices) / len(species_names)
# Between-species variance
between_var = total_var - within_var
# Calculate percentages
within_percent = (np.mean(within_var) / np.mean(total_var)) * 100
between_percent = (np.mean(between_var) / np.mean(total_var)) * 100
print(f" Within-species variance: {within_percent:.1f}%")
print(f" Between-species variance: {between_percent:.1f}%")
# 3. Components that best separate species
print("\n3. COMPONENTS WITH STRONGEST SPECIES EFFECTS:")
# Calculate F-statistics for each component
f_stats = []
for comp in range(species_factors.shape[1]):
groups = [species_factors[indices, comp] for indices in species_groups.values()]
f_stat, p_val = stats.f_oneway(*groups)
f_stats.append((comp, f_stat, p_val))
# Sort by F-statistic
f_stats.sort(key=lambda x: x[1], reverse=True)
print(" Top 5 components by species separation:")
for i, (comp, f_stat, p_val) in enumerate(f_stats[:5]):
significance = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else ""
print(f" Component {comp+1}: F={f_stat:.2f}, p={p_val:.3e} {significance}")
# 4. Create detailed visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Distance distributions
axes[0, 0].hist(within_species_distances, alpha=0.7, label='Within-species', bins=20)
axes[0, 0].hist(between_species_distances, alpha=0.7, label='Between-species', bins=20)
axes[0, 0].set_xlabel('Euclidean Distance')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distance Distributions')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Variance decomposition pie chart
axes[0, 1].pie([within_percent, between_percent],
labels=[f'Within-species\n({within_percent:.1f}%)',
f'Between-species\n({between_percent:.1f}%)'],
autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Variance Decomposition')
# Box plots for top separating component
if f_stats:
top_comp = f_stats[0][0]
species_data = []
species_labels = []
for stype, indices in species_groups.items():
species_data.extend(species_factors[indices, top_comp])
species_labels.extend([stype] * len(indices))
df_comp = pd.DataFrame({'Species': species_labels, 'Factor_Value': species_data})
sns.boxplot(data=df_comp, x='Species', y='Factor_Value', ax=axes[1, 0])
axes[1, 0].set_title(f'Component {top_comp+1} by Species\n(Best Separating Component)')
axes[1, 0].grid(True, alpha=0.3)
# Individual count per species
species_counts = [len(indices) for indices in species_groups.values()]
axes[1, 1].bar(species_groups.keys(), species_counts)
axes[1, 1].set_xlabel('Species')
axes[1, 1].set_ylabel('Number of Individuals')
axes[1, 1].set_title('Sample Size per Species')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
# Save the interpretation plot in the main results directory (consistent with other plots)
if main_results_dir:
interp_plot_file = os.path.join(main_results_dir, f"species_clustering_interpretation_optimal_rank_{optimal_rank}.png")
else:
interp_plot_file = os.path.join(latest_dir, "species_clustering_interpretation.png")
plt.savefig(interp_plot_file, dpi=300, bbox_inches='tight')
print(f"\nInterpretation plot saved to: {interp_plot_file}")
plt.show()
# 5. Biological interpretation
print("\n4. BIOLOGICAL INTERPRETATION:")
if between_percent < 20:
print(" → STRONG INDIVIDUAL VARIATION: Gene expression varies more between")
print(" individuals than between species. This suggests:")
print(" • Individual phenotypic plasticity is high")
print(" • Species have similar fundamental gene regulatory networks")
print(" • Environmental or developmental factors affect individuals")
print(" more than species identity")
elif between_percent < 40:
print(" → MODERATE SPECIES EFFECTS: Some species differences exist but")
print(" individual variation is still dominant. This suggests:")
print(" • Species have some distinct expression patterns")
print(" • Individual variation remains important")
print(" • Mixed evolutionary/ecological pressures")
else:
print(" → STRONG SPECIES EFFECTS: Clear species-specific expression")
print(" patterns exist. This suggests:")
print(" • Species have distinct gene regulatory strategies")
print(" • Strong evolutionary divergence")
print(" • Species-specific adaptations")
print(f"\n Sample sizes: {dict(zip(species_groups.keys(), [len(indices) for indices in species_groups.values()]))}")
if any(len(indices) < 5 for indices in species_groups.values()):
print(" ⚠️ WARNING: Some species have small sample sizes, which can")
print(" affect the ability to detect species-specific patterns.")
else:
print("No results directories found!")## Using optimal rank 5 (from rank selection analysis)
## Using optimal rank 5 data from: ../output/13.00-multiomics-stdm/20251015_190157/rank_comparison/rank_05/20251015_190351
## ============================================================
## SPECIES CLUSTERING INTERPRETATION
## ============================================================
##
## 1. SPECIES SEPARATION ANALYSIS:
## Within-species distances: 1.007 ± 1.225
## Between-species distances: 3.390 ± 0.860
## T-test p-value: 4.930e-78
## → Species ARE significantly separated (but effect might be small)
##
## 2. VARIANCE DECOMPOSITION:
## Within-species variance: 25.4%
## Between-species variance: 74.6%
##
## 3. COMPONENTS WITH STRONGEST SPECIES EFFECTS:
## Top 5 components by species separation:
## Component 1: F=2220.60, p=1.113e-30 ***
## Component 2: F=1124.54, p=1.003e-26 ***
## Component 4: F=39.46, p=9.691e-09 ***
## Component 5: F=3.78, p=3.575e-02 *
## Component 3: F=0.80, p=4.590e-01
## (array([39., 32., 22., 11., 5., 4., 3., 1., 0., 0., 0., 0., 0.,
## 0., 7., 2., 0., 0., 0., 9.]), array([0.10695279, 0.33056048, 0.55416817, 0.77777586, 1.00138356,
## 1.22499125, 1.44859894, 1.67220664, 1.89581433, 2.11942202,
## 2.34302971, 2.56663741, 2.7902451 , 3.01385279, 3.23746049,
## 3.46106818, 3.68467587, 3.90828356, 4.13189126, 4.35549895,
## 4.57910664]), <BarContainer object of 20 artists>)
## (array([10., 32., 45., 41., 19., 11., 16., 15., 17., 17., 19., 15., 6.,
## 3., 3., 3., 4., 12., 7., 5.]), array([2.21788468, 2.38773948, 2.55759428, 2.72744908, 2.89730388,
## 3.06715869, 3.23701349, 3.40686829, 3.57672309, 3.74657789,
## 3.91643269, 4.08628749, 4.25614229, 4.4259971 , 4.5958519 ,
## 4.7657067 , 4.9355615 , 5.1054163 , 5.2752711 , 5.4451259 ,
## 5.6149807 ]), <BarContainer object of 20 artists>)
## Text(0.5, 0, 'Euclidean Distance')
## Text(0, 0.5, 'Frequency')
## Text(0.5, 1.0, 'Distance Distributions')
## <matplotlib.legend.Legend object at 0x7c874132a690>
## ([<matplotlib.patches.Wedge object at 0x7c874136ab70>, <matplotlib.patches.Wedge object at 0x7c874208a300>], [Text(-0.7877726729698648, 0.7677331670065548, 'Within-species\n(25.4%)'), Text(0.7877729231290416, -0.7677329103175957, 'Between-species\n(74.6%)')], [Text(-0.42969418525628983, 0.41876354563993895, '25.4%'), Text(0.42969432170674987, -0.4187634056277794, '74.6%')])
## Text(0.5, 1.0, 'Variance Decomposition')
## <Axes: xlabel='Species', ylabel='Factor_Value'>
## Text(0.5, 1.0, 'Component 1 by Species\n(Best Separating Component)')
## <BarContainer object of 3 artists>
## Text(0.5, 0, 'Species')
## Text(0, 0.5, 'Number of Individuals')
## Text(0.5, 1.0, 'Sample Size per Species')
##
## Interpretation plot saved to: ../output/13.00-multiomics-stdm/20251015_190157/species_clustering_interpretation_optimal_rank_5.png
##
## 4. BIOLOGICAL INTERPRETATION:
## → STRONG SPECIES EFFECTS: Clear species-specific expression
## patterns exist. This suggests:
## • Species have distinct gene regulatory strategies
## • Strong evolutionary divergence
## • Species-specific adaptations
##
## Sample sizes: {'ACR': 10, 'POC': 10, 'POR': 10}
