Merge branch 'structure'

added to code readme
added german version
2024-10-18 14:08:09 +02:00 · 2024-10-18 14:07:53 +02:00 · 2024-10-17 16:31:53 +02:00
3 changed files with 513 additions and 28 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,8 @@
-# Data Analysis Project Structure Guide
+# Data Analysis Structure Guide
-Welcome to the Data Analysis Project Structure Guide! This guide is designed to
+**German version**: [README_DE.md](README_DE.md)
 Welcome to our Project Structure Guide! This guide is designed to
 help new students understand how to structure data analysis projects in Python
 effectively. By following these best practices, you'll create projects that are
 organized, maintainable, and reproducible.
--- a/README_DE.md
+++ b/README_DE.md
@@ -0,0 +1,223 @@
 # Leitfaden zur Datenanalyse-Struktur
 Willkommen zu diesem Leitfaden zur Projektstruktur! Dieser Leitfaden soll dir helfen zu verstehen, wie du Datenanalyseprojekte in Python effektiv strukturierst.
 ## Inhaltsverzeichnis
 1. [Projektstruktur](#projektstruktur)
   - [Trennung von Daten und Code](#trennung-von-daten-und-code)
   - [Trennung von Abbildungen und Code](#trennung-von-abbildungen-und-code)
 2. [Versionskontrolle mit Git](#versionskontrolle-mit-git)
   - [Grundlegende Git-Befehle](#grundlegende-git-befehle)
 3. [Best Practices für Datenanalyseprojekte](#best-practices-für-datenanalyseprojekte)
 4. [Zusätzliche Ressourcen](#zusätzliche-ressourcen)
 5. [Fazit](#fazit)
 ---
 ## Projektstruktur
 Eine gut organisierte Projektstruktur ist entscheidend für Zusammenarbeit und Skalierbarkeit. Hier ist ein empfohlenes Verzeichnislayout:
 ```
 project-name/
 ├── data/
 │   ├── model_weights/          # Trainierte Modellgewichte
 │   ├── raw/                    # Originale, unveränderte Datensätze
 │   └── processed/              # Bereinigte oder transformierte Daten
 ├── code/
 │   └── my_python_program.py    # Python-Skripte
 ├── figures/
 ├── docs/
 │   ├── my_fancy_latex_thesis/  # LaTeX-Dateien für die Abschlussarbeit (empfohlen)
 │   ├── my_presentation.pptx    # Präsentationsfolien
 │   └── my_thesis.docx          # Word-Dokument (nicht empfohlen)
 ├── .gitignore                  # Dateien und Verzeichnisse, die von Git ignoriert werden sollen
 ├── README.md                   # Projektübersicht und Anleitungen
 └── requirements.txt            # Python-Abhängigkeiten
 ```
 ### Trennung von Daten und Code
 - **Datenverzeichnis (`data/`)**: Speichere hier alle deine Datensätze.
  - `raw/`: Originale, unveränderte Datensätze.
  - `processed/`: Daten, die bereinigt oder transformiert wurden.
 - **Quellcode-Verzeichnis (`code/`)**: Enthält alle Codeskripte und Module.
 **Vorteile:**
 - **Organisation**: Hält Daten getrennt vom Code, was die Verwaltung erleichtert.
 - **Reproduzierbarkeit**: Klare Trennung stellt sicher, dass Datenverarbeitungsschritte dokumentiert und wiederholbar sind.
 - **Zusammenarbeit**: Du und deine Kollaborateure können leicht verschiedene Komponenten des Projekts finden und verstehen.
 ### Trennung von Abbildungen und Code
 - **Abbildungsverzeichnis (`figures/`)**: Speichere hier alle generierten Plots, Bilder und Visualisierungen.
 **Vorteile:**
 - **Klarheit**: Trennt Ausgaben vom Code und reduziert Unordnung.
 - **Versionskontrolle**: Einfachere Nachverfolgung von Änderungen im Code ohne große Binärdateien wie Bilder.
 - **Präsentation**: Vereinfacht das Erstellen von Berichten oder Präsentationen, indem alle Abbildungen an einem Ort gesammelt sind.
 ---
 ## Versionskontrolle mit Git
 Git ist ein leistungsstarkes Versionskontrollsystem, das dir hilft, Änderungen zu verfolgen, mit anderen zusammenzuarbeiten und die Historie deines Projekts zu verwalten. Aber was ist Versionskontrolle? Hast du jemals Dateien wie `project_final_v2.py` oder `project_final_final.py` erstellt? Versionskontrolle löst dieses Problem, indem sie Änderungen verfolgt und dir ermöglicht, zu früheren Versionen zurückzukehren. Als Bonus hast du auch ein Backup deines Projekts, falls etwas schiefgeht.
 ### Grundlegende Git-Befehle
 - **Ein Repository initialisieren**
  ```bash
  git init
  ```
 - **Remote-Repository hinzufügen (GitHub, Gittea)**
  ```bash
  git remote add origin <repository-url>
  ```
 - **Ein Repository klonen**
  ```bash
  git clone <repository-url>
  ```
 - **Status prüfen**
  ```bash
  git status
  ```
 - **Änderungen hinzufügen**
  ```bash
  git add <dateiname>
  # Oder alle Änderungen hinzufügen
  git add .
  ```
 - **Änderungen committen**
  ```bash
  git commit -m "Commit-Nachricht"
  ```
 - **Zum Remote-Repository pushen**
  ```bash
  git push origin main
  ```
 - **Vom Remote-Repository pullen**
  ```bash
  git pull origin main
  ```
 #### Erweiterte Git-Befehle
 - **Einen neuen Branch erstellen**
  ```bash
  git branch <branch-name>
  ```
 - **Zwischen Branches wechseln**
  ```bash
  git checkout <branch-name>
  ```
 - **Branches zusammenführen**
  ```bash
  git merge <branch-name>
  ```
 - **Commit-Historie anzeigen**
  ```bash
  git log
  ```
 **Tipps:**
 - **Oft committen**: Regelmäßige Commits erleichtern das Nachverfolgen von Änderungen.
 - **Aussagekräftige Nachrichten**: Verwende beschreibende Commit-Nachrichten für besseres Verständnis.
 - **Verwende `.gitignore`**: Schließe Dateien und Verzeichnisse aus, die nicht verfolgt werden sollten (z. B. große Datendateien, virtuelle Umgebungen).
 ---
 ## Best Practices für Datenanalyseprojekte
 1. **Verwende virtuelle Umgebungen**
   - Nutze `venv`, `conda` oder `pyenv`, um projektspezifische Abhängigkeiten zu verwalten.
   - Dokumentiere Abhängigkeiten in `requirements.txt` oder verwende `poetry` für das Paketmanagement.
 2. **Dokumentiere deine Arbeit**
   - Pflege eine klare und informative `README.md`.
   - Verwende Docstrings und Kommentare in deinem Code.
   - Führe ein Changelog für bedeutende Updates.
 3. **Schreibe modularen Code**
   - Unterteile Code in Funktionen und Klassen.
   - Nutze Code wieder, um Duplikate zu vermeiden.
 4. **Befolge Codierungsstandards**
   - Halte dich an die PEP 8-Richtlinien für Python-Code.
   - Verwende Linter wie `flake8` oder Formatter wie `black` oder `ruff`, um die Codequalität zu gewährleisten.
 5. **Automatisiere die Datenverarbeitung**
   - Schreibe Skripte, um die Datenbereinigung und -vorverarbeitung zu automatisieren.
   - Stelle sicher, dass Skripte von Anfang bis Ende ausgeführt werden können, um Ergebnisse zu reproduzieren.
 6. **Teste deinen Code**
   - Implementiere Unit-Tests mit Frameworks wie `unittest` oder `pytest`.
   - Halte Tests im Verzeichnis `tests/`.
 7. **Gehe sorgfältig mit Daten um**
   - Committe keine Daten in die Versionskontrolle.
 8. **Versioniere Daten und Modelle**
   - Speichere Modellversionen mit Zeitstempeln oder eindeutigen Kennungen.
 9. **Sichere regelmäßig**
   - Pushe Änderungen häufig in ein Remote-Repository.
   - Erwäge zusätzliche Backups für kritische Daten.
 10. **Arbeite effektiv zusammen**
    - Verwende Branches für neue Funktionen oder Experimente.
    - Führe Änderungen mit Pull Requests und Code Reviews zusammen.
 ---
 ## Zusätzliche Ressourcen
 - **Git-Dokumentation**: [git-scm.com/docs](https://git-scm.com/docs)
 - **PEP 8 Style Guide**: [python.org/dev/peps/pep-0008](https://www.python.org/dev/peps/pep-0008/)
 - **Python Virtual Environments**:
  - [`venv` Modul](https://docs.python.org/3/library/venv.html)
  - [Anaconda Distribution](https://www.anaconda.com/products/distribution)
  - [`pyenv` Virtuelle Umgebungen](https://github.com/pyenv/pyenv)
 ---
 ## Fazit
 Die effektive Strukturierung deiner Datenanalyseprojekte ist der erste Schritt zu erfolgreicher und reproduzierbarer Forschung. Indem du Daten, Code und Abbildungen trennst, Versionskontrolle verwendest und bewährte Methoden befolgst, legst du ein starkes Fundament für deine Arbeit und die Zusammenarbeit mit anderen.
 Viel Spaß beim Programmieren!
--- a/code/README.md
+++ b/code/README.md
@@ -1,41 +1,301 @@
-# Structure of a script
+# Writing a Good Python Script: A Primer
-1. Initially you should specify which packages you use in the scripts
+This primer will guide you through best practices to write effective and clean
 Python scripts. Whether you're working on a data processing pipeline, a machine
 learning model, or a simple utility script, following these guidelines will
 help you create maintainable and readable code.
-~~~python
+## 1. Use a Declarative and Meaningful Script Name
 import pathlib      # Packages that are provided from python
 Choose a script name that clearly describes its purpose. This makes it easier
 for others (and yourself) to understand what the script does without reading
 the code.
 **Examples:**
 - `data_cleaning.py` instead of `script1.py`
 - `generate_report.py` instead of `run.py`
 ## 2. Start with a Short Explanation (Docstring)
 At the beginning of your script, include a docstring that briefly explains what the script does. This helps users quickly grasp the script's functionality.
 ```python
 """
 This script loads raw data, cleans it by removing null values and duplicates,
 and saves the processed data to a new file.
 """
 ```
 ## 3. Import All Required Packages at the Beginning
 List all your imports at the top of the script. This makes dependencies clear and simplifies maintenance.
 ```python
 import sys                # Packages that are provided by Python
 from pathlib import Path
 import numpy as np        # Packages that are downloaded, specified in the requierements.txt
 import pandas as pd
 import my_module          # Modules that are written by yourself
 ```
-import myscript     # Scripts from your Project/Pipeline
+## 4. Encapsulate Code in Functions and Classes
 ~~~
 Organize your code by wrapping functionality within functions or classes. This
 promotes code reuse, testing, and readability. Ideally, functions should do one
 thing and do it well. Classes can be used for more complex logic or when you need
 to maintain state. Clean functions and classes contain type hints and docstrings
 to explain their purpose and inputs/outputs.
-2. Next your code for the specific problem that you are trying to solve, all written code should be containded in a function/classes
+**Examples of Functions:**
   It should contain a main function with is calling all individual function to solve the problem.
-~~~python
+```python
-def load_data(path):
+def load_data(file_path: str) -> pd.DataFrame:
-  with open(path, "r") as f:
+    """Loads data from a CSV file.
    f.read()
  return f
-def main(path):
+    Parameters:
-  load_data(path)
+    ----------
-~~~
+    file_path : str
        Path to the CSV file.
-3. If the script is a standalone script, it can be run by calling python myscript.py it should contain...
+    Returns:
    -------
    pd.DataFrame
        Loaded data as a DataFrame.
    """
    return pd.read_csv(file_path)
-~~~python
+def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """Cleans the DataFrame by removing null values and duplicates.
    Parameters:
    ----------
    df : pd.DataFrame
        Input DataFrame.
    Returns:
    -------
    pd.DataFrame
        Cleaned DataFrame.
    """
    df = df.dropna()
    df = df.drop_duplicates()
    return df
 def save_data(df: pd.DataFrame, output_path: str) -> None:
    """Saves the DataFrame to a CSV file.
    Parameters:
    ----------
    df : pd.DataFrame
        DataFrame to save.
    output_path : str
        Path to save the CSV file.
    """
    df.to_csv(output_path, index=False)
 ```
 **Example of a Class:**
 ```python
 class DataProcessor:
    """A class for processing data."""
    def __init__(self, file_path):
        self.data = self.load_data(file_path)
    def load_data(self, file_path):
        return pd.read_csv(file_path)
    def clean_data(self):
        self.data.dropna(inplace=True)
        self.data.drop_duplicates(inplace=True)
    def save_data(self, output_path):
        self.data.to_csv(output_path, index=False)
 ```
 ## 5. Define a `main()` Function
 Create a `main()` function that serves as the entry point of your script. This
 function should orchestrate the flow of your program.
 ```python
 def main():
    """Main function that orchestrates the data processing."""
    input_file = 'data/raw/data.csv'
    output_file = 'data/processed/clean_data.csv'
    # Using functions
    data = load_data(input_file)
    clean_data = clean_data(data)
    save_data(clean_data, output_file)
    # Or using a class
    # processor = DataProcessor(input_file)
    # processor.clean_data()
    # processor.save_data(output_file)
    print("Data processing complete.")
 ```
 ## 6. Use the `if __name__ == "__main__":` Statement
 This is a common Python idiom that allows you to check if the script is being
 run as the main program. This ensures that the `main()` function is only called
 when the script is executed directly. If you execute the `main()` function
 directly, it will be executed when the module, or just parts of it, are
 imported in another script.
 So at the end of your script, add:
 ```python
 if __name__ == "__main__":
-  path = "../data/README.md"
+    main()
-  main(path)
+```
 ~~~
-# Tips and tricks
+This checks if the script is being run as the main program and calls `main()` accordingly.
- Plotting scripts should be named the same as the output figure for easier backtracking
+
- Plotting scripts should start with plot, so that one can create a bash script for that executes all plot* scripts
+## Putting It All Together
- If you use a directory for managing specific task, in python it is called a module, you neeed a __init__.py file in the directory more in [packagehowto](https://whale.am28.uni-tuebingen.de/git/pweygoldt/packagehowto)
+
 Here's how your script might look when you combine all these best practices:
 ```python
 """
 This script loads raw data, cleans it by removing null values and duplicates, and saves the processed data to a new file.
 """
 import os
 import sys
 import pandas as pd
 import numpy as np
 def load_data(file_path: str) -> pd.DataFrame:
    """Loads data from a CSV file.
    Parameters:
    ----------
    file_path : str
        Path to the CSV file.
    Returns:
    -------
    pd.DataFrame
        Loaded data as a DataFrame.
    """
    return pd.read_csv(file_path)
 def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """Cleans the DataFrame by removing null values and duplicates.
    Parameters:
    ----------
    df : pd.DataFrame
        Input DataFrame.
    Returns:
    -------
    pd.DataFrame
        Cleaned DataFrame.
    """
    df = df.dropna()
    df = df.drop_duplicates()
    return df
 def save_data(df: pd.DataFrame, output_path: str) -> None:
    """Saves the DataFrame to a CSV file.
    Parameters:
    ----------
    df : pd.DataFrame
        DataFrame to save.
    output_path : str
        Path to save the CSV file.
    """
    df.to_csv(output_path, index=False)
 def main():
    """Main function that orchestrates the data processing."""
    input_file = 'data/raw/data.csv'
    output_file = 'data/processed/clean_data.csv'
    data = load_data(input_file)
    clean_data = clean_data(data)
    save_data(clean_data, output_file)
    print("Data processing complete.")
 if __name__ == "__main__":
    main()
 ```
 ## Additional Tips
 - **Comment Your Code:** Use comments to explain non-obvious parts of your code. However, strive to write code that is self-explanatory.
 - **Follow PEP 8 Guidelines:** Adhere to the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide for Python code to improve readability. To make this easy, use an auto-formatter like `black` or `ruff`.
 - **Use Meaningful Variable, Function and ClasE Names:** Choose names that convey their purpose. Avoid single-letter variable names except for simple iterators. Instead of `x` and `y` use e.g., `time` and `signal`.
 - **Handle Exceptions:** Use try-except blocks to handle potential errors gracefully.
  ```python
  try:
      data = load_data(input_file)
  except FileNotFoundError:
      print(f"Error: The file {input_file} was not found.")
      sys.exit(1)
  ```
 - **Use Logging Instead of Print Statements:** For larger scripts, consider using the `logging` module for better control over logging levels and outputs.
  ```python
  import logging
  logging.basicConfig(level=logging.INFO)
  logging.info("Data processing complete.")
  ```
 - **Parameterize Your Scripts:** Use command-line arguments or a configuration file to make your script more flexible.
  ```python
  import argparse
  def parse_arguments():
      parser = argparse.ArgumentParser(description="Process and clean data.")
      parser.add_argument('--input', required=True, help='Input file path')
      parser.add_argument('--output', required=True, help='Output file path')
      return parser.parse_args()
  def main():
      args = parse_arguments()
      data = load_data(args.input)
      clean_data = clean_data(data)
      save_data(clean_data, args.output)
  ```
  - **Make Your Code Modular:** Break down your script into multiple files or
  modules for better organization and reusability. For example, move data
  processing functions that are used in multiple scripts to a separate module
  called `data_processing.py`.
  - **Coding a figure:** If you are coding a figure, you can follow our [coding
  a figure
  guide](https://github.com/bendalab/plottools/blob/master/docs/guide.md).
  Applying the same principles to your figure code will make it easier to
  modify and reuse.
 ## Conclusion
 By following these best practices, you'll create Python scripts that are:
 - **Readable:** Clear structure and naming make your code easy to understand.
 - **Maintainable:** Encapsulation and modularity simplify updates and debugging.
 - **Reusable:** Functions and classes can be imported and used in other scripts.
 - **Robust:** Error handling ensures your script can handle unexpected situations gracefully.
 Remember, good coding practices not only make your life easier but also help
 others who may work with your code in the future. The effort you put into
 writing clean and effective scripts will pay off in the long run.
 Happy coding!
Author	SHA1	Message	Date
weygoldt	602d9bb422	Merge branch 'structure'	2024-10-18 14:08:09 +02:00
weygoldt	092ea67f40	added to code readme	2024-10-18 14:07:53 +02:00
weygoldt	9ab39d61c3	added german version	2024-10-17 16:31:53 +02:00