## Basic tool info Model name: wiro/rag-chat-website Model description: Instantly retrieve and analyze content from any website URL. Select your LLM model, fetch key information from the page, and generate context-aware responses with ease! Model cover: https://cdn.wiro.ai/uploads/models/wiro-rag-chat-website-cover.webp Model categories: - llm - persistent - tool - chat - rag - website Model tags: - conversational - text-generation-inference - gemma2-chat - text generation - wiro-chat - question answer - chat - llama2-chat - llama3-chat - rag - ask file - ask document - document chat - pdf chat Run Task Endpoint (POST): https://api.wiro.ai/v1/Run/wiro/rag-chat-website Get Task Detail Endpoint (POST): https://api.wiro.ai/v1/Task/Detail ## Model Inputs: - name: selectedModel label: select-model help: select-model-help type: select default: "617" options: - value: "728" label: meta-llama/Llama-3.2-3B-Instruct description: The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. triggerwords: [] generatesettings: [] modelfolderid: 136706 image: https://cdn.wiro.ai/uploads/models/meta-llama-Llama-3.2-3B-Instruct-cover.jpg computingtime: 1 second readme:






Model Information


The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.


Model Developer: Meta


Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.


































































Training DataParamsInput modalitiesOutput modalitiesContext LengthGQAShared EmbeddingsToken countKnowledge cutoff
Llama 3.2 (text only)A new mix of publicly available online data.1B (1.23B)Multilingual TextMultilingual Text and code128kYesYesUp to 9T tokensDecember 2023
3B (3.21B)Multilingual TextMultilingual Text and code
Llama 3.2 Quantized (text only)A new mix of publicly available online data.1B (1.23B)Multilingual TextMultilingual Text and code8kYesYesUp to 9T tokensDecember 2023
3B (3.21B)Multilingual TextMultilingual Text and code


Supported Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.


Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.


Model Release Date: Sept 25, 2024


Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.


License: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).


Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.







Intended Use


Intended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.


Out of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.







How to use


This repository contains two versions of Llama-3.2-3B-Instruct, for use with transformers and with the original llama codebase.







Use with transformers


Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.


Make sure to update your transformers installation via pip install --upgrade transformers.


import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Note: You can also find detailed recipes on how to use the model locally, with torch.compile(), assisted generations, quantised and more at huggingface-llama-recipes







Use with llama


Please, follow the instructions in the repository


To download Original checkpoints, see the example command below leveraging huggingface-cli:


huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --include "original/*" --local-dir Llama-3.2-3B-Instruct






Hardware and Software


Training Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.


Training Energy Use: Training utilized a cumulative of 916k GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.


Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 240 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq.






































































Training Time (GPU hours)Logit Generation Time (GPU Hours)Training Power Consumption (W)Training Location-Based Greenhouse Gas Emissions (tons CO2eq)Training Market-Based Greenhouse Gas Emissions (tons CO2eq)
Llama 3.2 1B370k-7001070
Llama 3.2 3B460k-7001330
Llama 3.2 1B SpinQuant1.70700Negligible**0
Llama 3.2 3B SpinQuant2.40700Negligible**0
Llama 3.2 1B QLora1.3k07000.3810
Llama 3.2 3B QLora1.6k07000.4610
Total833k86k2400


** The location-based CO2e emissions of Llama 3.2 1B SpinQuant and Llama 3.2 3B SpinQuant are less than 0.001 metric tonnes each. This is due to the minimal training GPU hours that are required.


The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.







Training Data


Overview: Llama 3.2 was pretrained on up to 9 trillion tokens of data from publicly available sources. For the 1B and 3B Llama 3.2 models, we incorporated logits from the Llama 3.1 8B and 70B models into the pretraining stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance. In post-training we used a similar recipe as Llama 3.1 and produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO).


Data Freshness: The pretraining data has a cutoff of December 2023.







Quantization







Quantization Scheme


We designed the current quantization scheme with the PyTorch’s ExecuTorch inference framework and Arm CPU backend in mind, taking into account metrics including model quality, prefill/decoding speed, and memory footprint. Our quantization scheme involves three parts:








Quantization-Aware Training and LoRA


The quantization-aware training (QAT) with low-rank adaptation (LoRA) models went through only post-training stages, using the same data as the full precision models. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with LoRA adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16. Because our approach is similar to QLoRA of Dettmers et al., (2023) (i.e., quantization followed by LoRA adapters), we refer this method as QLoRA. Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO).







SpinQuant


SpinQuant was applied, together with generative post-training quantization (GPTQ). For the SpinQuant rotation matrix fine-tuning, we optimized for 100 iterations, using 800 samples with sequence-length 2048 from the WikiText 2 dataset. For GPTQ, we used 128 samples from the same dataset with the same sequence-length.







Benchmarks - English Text


In this section, we report the results for Llama 3.2 models on standard automatic benchmarks. For all these evaluations, we used our internal evaluations library.







Base Pretrained Models














































































CategoryBenchmark# ShotsMetricLlama 3.2 1BLlama 3.2 3BLlama 3.1 8B
GeneralMMLU5macro_avg/acc_char32.25866.7
AGIEval English3-5average/acc_char23.339.247.8
ARC-Challenge25acc_char32.869.179.7
Reading comprehensionSQuAD1em49.267.777
QuAC (F1)1f137.942.944.9
DROP (F1)3f128.045.259.5
Long ContextNeedle in Haystack0em96.811







Instruction Tuned Models






































































































































































































































































CapabilityBenchmark# ShotsMetricLlama 3.2 1B bf16Llama 3.2 1B Vanilla PTQ**Llama 3.2 1B Spin QuantLlama 3.2 1B QLoRALlama 3.2 3B bf16Llama 3.2 3B Vanilla PTQ**Llama 3.2 3B Spin QuantLlama 3.2 3B QLoRALlama 3.1 8B
GeneralMMLU5macro_avg/acc49.343.347.349.063.460.56262.469.4
Re-writingOpen-rewrite eval0micro_avg/rougeL41.639.240.941.240.140.340.840.740.9
SummarizationTLDR9+ (test)1rougeL16.814.916.716.819.019.119.219.117.2
Instruction followingIFEval0Avg(Prompt/Instruction acc Loose/Strict)59.551.558.455.677.473.973.575.980.4
MathGSM8K (CoT)8em_maj1@144.433.140.646.577.772.975.777.984.5
MATH (CoT)0final_em30.620.525.331.048.044.245.349.251.9
ReasoningARC-C0acc59.454.35760.778.675.677.677.683.4
GPQA0acc27.225.926.325.932.832.831.733.932.8
Hellaswag0acc41.238.141.341.569.866.36866.378.7
Tool UseBFCL V20acc25.714.315.923.767.053.460.163.567.1
Nexus0macro_avg/acc13.55.29.612.534.332.431.530.138.5
Long ContextInfiniteBench/En.QA0longbook_qa/f120.3N/AN/AN/A19.8N/AN/AN/A27.3
InfiniteBench/En.MC0longbook_choice/acc38.0N/AN/AN/A63.3N/AN/AN/A72.2
NIH/Multi-needle0recall75.0N/AN/AN/A84.7N/AN/AN/A98.8
MultilingualMGSM (CoT)0em24.513.718.224.458.248.954.356.868.9


**for comparison purposes only. Model not released.







Multilingual Benchmarks






















































































































CategoryBenchmarkLanguageLlama 3.2 1BLlama 3.2 1B Vanilla PTQ**Llama 3.2 1B Spin QuantLlama 3.2 1B QLoRALlama 3.2 3BLlama 3.2 3B Vanilla PTQ**Llama 3.2 3B Spin QuantLlama 3.2 3B QLoRALlama 3.1 8B
GeneralMMLU (5-shot, macro_avg/acc)Portuguese39.834.938.940.254.550.953.353.462.1
Spanish41.536.039.841.855.151.953.653.662.5
Italian39.834.938.140.653.849.952.151.761.6
German39.234.937.539.653.350.052.251.360.6
French40.534.839.240.854.651.253.353.362.3
Hindi33.530.032.134.043.340.442.042.150.9
Thai34.731.232.434.944.541.344.042.250.3


**for comparison purposes only. Model not released.







Inference time


In the below table, we compare the performance metrics of different quantization methods (SpinQuant and QAT + LoRA) with the BF16 baseline. The evaluation was done using the ExecuTorch framework as the inference engine, with the ARM CPU as a backend using Android OnePlus 12 device.






























































CategoryDecode (tokens/sec)Time-to-first-token (sec)Prefill (tokens/sec)Model size (PTE file size in MB)Memory size (RSS in MB)
1B BF16 (baseline)19.21.060.323583,185
1B SpinQuant50.2 (2.6x)0.3 (-76.9%)260.5 (4.3x)1083 (-54.1%)1,921 (-39.7%)
1B QLoRA45.8 (2.4x)0.3 (-76.0%)252.0 (4.2x)1127 (-52.2%)2,255 (-29.2%)
3B BF16 (baseline)7.63.021.261297,419
3B SpinQuant19.7 (2.6x)0.7 (-76.4%)89.7 (4.2x)2435 (-60.3%)3,726 (-49.8%)
3B QLoRA18.5 (2.4x)0.7 (-76.1%)88.8 (4.2x)2529 (-58.7%)4,060 (-45.3%)


(*) The performance measurement is done using an adb binary-based approach.
(**) It is measured on an Android OnePlus 12 device.
(***) Time-to-first-token (TTFT) is measured with prompt length=64


Footnote:








Responsibility & Safety


As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:



  1. Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama

  2. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm

  3. Provide protections for the community to help prevent the misuse of our models







Responsible Deployment


Approach: Llama is a foundational technology designed to be used in a variety of use cases. Examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models, enabling the world to benefit from the technology power, by aligning our model safety for generic use cases and addressing a standard set of harms. Developers are then in the driver’s seat to tailor safety for their use cases, defining their own policies and deploying the models with the necessary safeguards in their Llama systems. Llama 3.2 was developed following the best practices outlined in our Responsible Use Guide.







Llama 3.2 Instruct


Objective: Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. We implemented the same set of safety mitigations as in Llama 3, and you can learn more about these in the Llama 3 paper.


Fine-Tuning Data: We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control.


Refusals and Tone: Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines.







Llama 3.2 Systems


Safety as a System: Large language models, including Llama 3.2, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box.







New Capabilities and Use Cases


Technological Advancement: Llama releases usually introduce new capabilities that require specific considerations in addition to the best practices that generally apply across all Generative AI use cases. For prior release capabilities also supported by Llama 3.2, see Llama 3.1 Model Card, as the same considerations apply here as well.


Constrained Environments: Llama 3.2 1B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version.







Evaluations


Scaled Evaluations: We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Purple Llama safeguards to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case.


Red Teaming: We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets.







Critical Risks


In addition to our safety work above, we took extra care on measuring and/or mitigating the following critical risk areas:


1. CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive Weapons): Llama 3.2 1B and 3B models are smaller and less capable derivatives of Llama 3.1. For Llama 3.1 70B and 405B, to assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons and have determined that such testing also applies to the smaller 1B and 3B models.


2. Child Safety: Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.


3. Cyber Attacks: For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s 1B and 3B models are smaller and less capable models than Llama 3.1 405B, we broadly believe that the testing conducted for the 405B model also applies to Llama 3.2 models.







Community


Industry Partnerships: Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository.


Grants: We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here.


Reporting: Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community.







Ethical Considerations and Limitations


Values: The core values of Llama 3.2 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.2 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress.


Testing: Llama 3.2 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.2 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.


categories: ["model","llm","persistent","chat","checkpoint-folder","llama3-chat","bf16","rag"] filename: docker arguments: run --gpus device=~>GPU:DEVICENO<~ -t ~>DOCKERMOUNT<~ --rm harbor_llm_docker:latest python3 /wiro/llm/general_llm_chat.py --model_id="{{params_modelID}}" --prompt="{{params_prompt}}" --instructions="{{params_system_prompt}}" --temperature={{params_temperature}} --top_k={{params_top_k}} --top_p={{params_top_p}} --max_length={{params_max_tokens}} --min_length={{params_min_tokens}} --max_new_tokens={{params_max_new_tokens}} --min_new_tokens={{params_min_new_tokens}} --repetition_penalty={{params_repetition_penalty}} --length_penalty={{params_length_penalty}} --stop_strings="{{params_stop_sequences}}" --chat_id="{{params_session_id}}" --user_id="{{params_user_id}}" --model_path="~>FOLDER:136706<~" --model_type="llama3" --seed="{{params_seed}}" --chat --stream_output --cache_dir="~>FOLDER:123374<~" --output="~>FOLDEROUTPUT<~" --precision="{{params_precision}}" {{params_quantization}} {{params_do_sample}} parameters: [{"title":"","subtitle":"","items":[{"advanced":false,"type":"textarea","class":"is-12","required":true,"rows":"4","id":"prompt","placeholder":"prompt","label":"prompt","defaultvalue":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","value":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Prompt to send to the model.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"user_id","placeholder":"user_id","label":"user_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"session_id","placeholder":"session_id","label":"session_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session.","quick":true},{"advanced":true,"type":"textarea","class":"is-12","required":false,"rows":"4","id":"system_prompt","placeholder":"system_prompt","label":"system_prompt","defaultvalue":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","value":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"temperature","placeholder":"temperature","label":"temperature","defaultvalue":"0.7","value":"0.7","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"top_p","placeholder":"top_p","label":"top_p","defaultvalue":"0.95","value":"0.95","minvalue":"0","maxvalue":"1","incrementby":"0.01","optionsLoad":"","options":[],"note":"When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"top_k","placeholder":"top_k","label":"top_k","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"100","incrementby":"1","optionsLoad":"","options":[],"note":"When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"repetition_penalty","placeholder":"repetition_penalty","label":"repetition_penalty","defaultvalue":"1.0","value":"1.0","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"It is a hyperparameter used to reduce the likelihood of the model generating repetitive text by applying a penalty to previously generated tokens, encouraging more diverse and coherent output.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"length_penalty","placeholder":"length_penalty","label":"length_penalty","defaultvalue":"1","value":"1","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"A parameter that controls how long the outputs are. If < 1, the model will tend to generate shorter outputs, and > 1 will tend to generate longer outputs.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_tokens","placeholder":"max_tokens","label":"max_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"800000","incrementby":"1","optionsLoad":"","options":[],"note":"Maximum number of tokens to generate. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_tokens","placeholder":"min_tokens","label":"min_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_new_tokens","placeholder":"max_new_tokens","label":"max_new_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"4096","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to max_tokens. max_new_tokens only exists for backwards compatibility purposes. We recommend you use max_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_new_tokens","placeholder":"min_new_tokens","label":"min_new_tokens","defaultvalue":"-1","value":"-1","minvalue":"-1","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to min_tokens. min_new_tokens only exists for backwards compatibility purposes. We recommend you use min_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"stop_sequences","placeholder":"stop_sequences","label":"stop_sequences","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"A semicolon-separated list of sequences to stop generation at. For example, ';' will stop generation at the first instance of 'end' or ''.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"seed","placeholder":"seed","label":"seed","defaultvalue":"123456","value":"123456","minvalue":"0","maxvalue":"9999999","incrementby":"1","optionsLoad":"","options":[],"note":"seed-help"},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"quantization","placeholder":"quantization","label":"quantization","defaultvalue":"--quantization","value":"--quantization","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower."},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"do_sample","placeholder":"do_sample","label":"do_sample","defaultvalue":"--do_sample","value":"--do_sample","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results."}]}] samples: ["https://cdn.wiro.ai/uploads/models/meta-llama-Llama-3.2-3B-Instruct-sample-1.txt"] time: 1737558684 marketplace: 1 trainallowed: 0 cps: 0.000000000000 ownercpt: 0.000000000000 onlymembers: 0 tags: ["text generation","transformers","safetensors","pytorch","8 languages","llama","facebook","meta","llama-3","conversational","text-generation-inference","inference endpoints","llama3.2"] averagepoint: 0.00 commentcount: 0 ratedusercount: 0 priority: 0 sourceurl: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct sourcedownloadurl: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct license: [] licenseupdate: 1777302336 cleanslugowner: meta-llama cleanslugproject: llama-3-2-3b-instruct noindex: 0 modifiedtime: 0 - value: "1525" label: Qwen/Qwen3-30B-A3B description: Qwen3-30B-A3B model. triggerwords: [] generatesettings: [] modelfolderid: 726161 image: https://cdn.wiro.ai/uploads/models/Qwen-Qwen3-30B-A3B-cover.png computingtime: 1 second readme: Qwen3-32B-A3B — Part of the latest generation of Qwen large language models, this 32B-parameter dense model offers advanced reasoning, strong instruction-following, enhanced agent capabilities, and broad multilingual support. Built on extensive training, Qwen3-32B delivers high performance across diverse tasks, making it well-suited for complex problem-solving, content creation, and interactive AI applications. categories: ["model","llm","persistent","chat","checkpoint-folder","bf16","rag","qwen3","llm-reasoning"] filename: docker arguments: run --gpus device=~>GPU:DEVICENO<~ -t ~>DOCKERMOUNT<~ --rm harbor_llm_docker:latest python3 /wiro/llm/general_llm_chat.py --model_id="{{params_modelID}}" --prompt="{{params_prompt}}" --instructions="{{params_system_prompt}} You must always structure your output in the following format: Write your hidden reasoning and step-by-step thoughts here. Do not skip this block.Write only the final concise user-facing answer here.Never omit or reorder these tags. Always open and close both and tags correctly, even if the reasoning or the answer is very short." --temperature={{params_temperature}} --top_k={{params_top_k}} --top_p={{params_top_p}} --max_length={{params_max_tokens}} --min_length={{params_min_tokens}} --max_new_tokens={{params_max_new_tokens}} --min_new_tokens={{params_min_new_tokens}} --repetition_penalty={{params_repetition_penalty}} --length_penalty={{params_length_penalty}} --stop_strings="{{params_stop_sequences}}" --chat_id="{{params_session_id}}" --user_id="{{params_user_id}}" --model_path="~>FOLDER:726161<~" --model_type="qwen3-reasoning" --seed="{{params_seed}}" --chat --stream_output --cache_dir="~>FOLDER:123374<~" --output="~>FOLDEROUTPUT<~" --precision="{{params_precision}}" {{params_quantization}} {{params_do_sample}} parameters: [{"title":"","subtitle":"","items":[{"advanced":false,"type":"textarea","class":"is-12","required":true,"rows":"4","id":"prompt","placeholder":"prompt","label":"prompt","defaultvalue":"Can you say something interesting about the Eiffel Tower?","value":"Can you say something interesting about the Eiffel Tower?","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Prompt to send to the model.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"user_id","placeholder":"User ID","label":"User ID","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"session_id","placeholder":"Session ID","label":"Session ID","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session.","quick":true},{"advanced":true,"type":"textarea","class":"is-12","required":false,"rows":"4","id":"system_prompt","placeholder":"System Prompt","label":"System Prompt","defaultvalue":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","value":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"temperature","placeholder":"Temperature","label":"Temperature","defaultvalue":"0.7","value":"0.7","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"top_p","placeholder":"Top P","label":"Top P","defaultvalue":"0.95","value":"0.95","minvalue":"0","maxvalue":"1","incrementby":"0.01","optionsLoad":"","options":[],"note":"When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"top_k","placeholder":"Top K","label":"Top K","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"100","incrementby":"1","optionsLoad":"","options":[],"note":"When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"repetition_penalty","placeholder":"Repetition Penalty","label":"Repetition Penalty","defaultvalue":"1.0","value":"1.0","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"It is a hyperparameter used to reduce the likelihood of the model generating repetitive text by applying a penalty to previously generated tokens, encouraging more diverse and coherent output.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"length_penalty","placeholder":"Length Penalty","label":"Length Penalty","defaultvalue":"1","value":"1","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"A parameter that controls how long the outputs are. If < 1, the model will tend to generate shorter outputs, and > 1 will tend to generate longer outputs.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_tokens","placeholder":"Max Tokens","label":"Max Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"800000","incrementby":"1","optionsLoad":"","options":[],"note":"Maximum number of tokens to generate. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_tokens","placeholder":"Min Tokens","label":"Min Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_new_tokens","placeholder":"Max New Tokens","label":"Max New Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"4096","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to max_tokens. max_new_tokens only exists for backwards compatibility purposes. We recommend you use max_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_new_tokens","placeholder":"Min New Tokens","label":"Min New Tokens","defaultvalue":"-1","value":"-1","minvalue":"-1","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to min_tokens. min_new_tokens only exists for backwards compatibility purposes. We recommend you use min_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"stop_sequences","placeholder":"Stop Sequences","label":"Stop Sequences","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"A semicolon-separated list of sequences to stop generation at. For example, ';' will stop generation at the first instance of 'end' or ''.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"seed","placeholder":"Deed","label":"Deed","defaultvalue":"123456","value":"123456","minvalue":"0","maxvalue":"9999999","incrementby":"1","optionsLoad":"","options":[],"note":"seed-help"},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"quantization","placeholder":"Quantization","label":"Quantization","defaultvalue":"--quantization","value":"--quantization","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower."},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"do_sample","placeholder":"Do Sample","label":"Do Sample","defaultvalue":"--do_sample","value":"--do_sample","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results."}]}] samples: ["https://cdn.wiro.ai/uploads/models/Qwen-Qwen3-30B-A3B-sample-1.txt"] time: 1754291196 marketplace: 1 trainallowed: 0 cps: 0.000000000000 ownercpt: 0.000000000000 onlymembers: 0 tags: ["text generation","transformers","safetensors","english","chat","conversational","text-generation-inference","inference endpoints","apache-2.0"] averagepoint: 0.00 commentcount: 0 ratedusercount: 0 priority: 0 sourceurl: git clone https://huggingface.co/Qwen/Qwen3-30B-A3B sourcedownloadurl: git clone https://huggingface.co/Qwen/Qwen3-30B-A3B license: [] licenseupdate: 1752017945 cleanslugowner: qwen cleanslugproject: qwen3-30b-a3b llm_thinking_regex: {"start": "","end": "","regex": "(?:)?(.*?)<\\/think>"} noindex: 0 modifiedtime: 0 - value: "740" label: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B description: DeepSeek-R1-Distill-Qwen-7B is a distilled version of the DeepSeek-R1 model with 7 billion parameters, designed to provide high-quality language understanding while optimizing efficiency. Leveraging advanced knowledge distillation techniques, it retains the core capabilities of larger models with improved speed and lower resource consumption. This model is well-suited for tasks requiring robust natural language processing while maintaining cost-effective deployment. triggerwords: [] generatesettings: [] modelfolderid: 137702 image: https://cdn.wiro.ai/uploads/models/deepseek-ai-DeepSeek-R1-Distill-Qwen-7B-cover.jpg computingtime: 1 second readme:






DeepSeek-R1







DeepSeek-V3











Paper Link👁️









1. Introduction


We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.
With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors.
However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance,
we introduce DeepSeek-R1, which incorporates cold-start data before RL.
DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.


NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section.











2. Model Summary




Post-Training: Large-Scale Reinforcement Learning on the Base Model



  • We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.



  • We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities.
    We believe the pipeline will benefit the industry by creating better models.






Distillation: Smaller Models Can Be Powerful Too



  • We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.

  • Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.







3. Model Downloads







DeepSeek-R1 Models





























Model#Total Params#Activated ParamsContext LengthDownload
DeepSeek-R1-Zero671B37B128K🤗 HuggingFace
DeepSeek-R1671B37B128K🤗 HuggingFace




DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base.
For more details regarding the model architecture, please refer to DeepSeek-V3 repository.







DeepSeek-R1-Distill Models











































ModelBase ModelDownload
DeepSeek-R1-Distill-Qwen-1.5BQwen2.5-Math-1.5B🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-7BQwen2.5-Math-7B🤗 HuggingFace
DeepSeek-R1-Distill-Llama-8BLlama-3.1-8B🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-14BQwen2.5-14B🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-32BQwen2.5-32B🤗 HuggingFace
DeepSeek-R1-Distill-Llama-70BLlama-3.3-70B-Instruct🤗 HuggingFace




DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
We slightly change their configs and tokenizers. Please use our setting to run these models.







4. Evaluation Results







DeepSeek-R1-Evaluation


For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.



































































































































































































































































CategoryBenchmark (Metric)Claude-3.5-Sonnet-1022GPT-4o 0513DeepSeek V3OpenAI o1-miniOpenAI o1-1217DeepSeek R1
Architecture--MoE--MoE
# Activated Params--37B--37B
# Total Params--671B--671B
EnglishMMLU (Pass@1)88.387.288.585.291.890.8
MMLU-Redux (EM)88.988.089.186.7-92.9
MMLU-Pro (EM)78.072.675.980.3-84.0
DROP (3-shot F1)88.383.791.683.990.292.2
IF-Eval (Prompt Strict)86.584.386.184.8-83.3
GPQA-Diamond (Pass@1)65.049.959.160.075.771.5
SimpleQA (Correct)28.438.224.97.047.030.1
FRAMES (Acc.)72.580.573.376.9-82.5
AlpacaEval2.0 (LC-winrate)52.051.170.057.8-87.6
ArenaHard (GPT-4-1106)85.280.485.592.0-92.3
CodeLiveCodeBench (Pass@1-COT)33.834.2-53.863.465.9
Codeforces (Percentile)20.323.658.793.496.696.3
Codeforces (Rating)7177591134182020612029
SWE Verified (Resolved)50.838.842.041.648.949.2
Aider-Polyglot (Acc.)45.316.049.632.961.753.3
MathAIME 2024 (Pass@1)16.09.339.263.679.279.8
MATH-500 (Pass@1)78.374.690.290.096.497.3
CNMO 2024 (Pass@1)13.110.843.267.6-78.8
ChineseCLUEWSC (EM)85.487.990.989.9-92.8
C-Eval (EM)76.776.086.568.9-91.8
C-SimpleQA (Correct)55.458.768.040.3-63.7










Distilled Model Evaluation











































































































ModelAIME 2024 pass@1AIME 2024 cons@64MATH-500 pass@1GPQA Diamond pass@1LiveCodeBench pass@1CodeForces rating
GPT-4o-05139.313.474.649.932.9759
Claude-3.5-Sonnet-102216.026.778.365.038.9717
o1-mini63.680.090.060.053.81820
QwQ-32B-Preview44.060.090.654.541.91316
DeepSeek-R1-Distill-Qwen-1.5B28.952.783.933.816.9954
DeepSeek-R1-Distill-Qwen-7B55.583.392.849.137.61189
DeepSeek-R1-Distill-Qwen-14B69.780.093.959.153.11481
DeepSeek-R1-Distill-Qwen-32B72.683.394.362.157.21691
DeepSeek-R1-Distill-Llama-8B50.480.089.149.039.61205
DeepSeek-R1-Distill-Llama-70B70.086.794.565.257.51633










5. Chat Website & API Platform


You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink"


We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com







6. How to Run Locally







DeepSeek-R1 Models


Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally.







DeepSeek-R1-Distill Models


DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.


For instance, you can easily start a service using vLLM:


vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager

You can also easily start a service using SGLang


python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2






Usage Recommendations


We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:



  1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.

  2. Avoid adding a system prompt; all instructions should be contained within the user prompt.

  3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."

  4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.







7. License


This code repository and the model weights are licensed under the MIT License.
DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that:



  • DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1.

  • DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license.

  • DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license.







8. Citation


@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
author={DeepSeek-AI and Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Ruoyu Zhang and Runxin Xu and Qihao Zhu and Shirong Ma and Peiyi Wang and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao and Zhuoshu Li and Ziyi Gao and Aixin Liu and Bing Xue and Bingxuan Wang and Bochao Wu and Bei Feng and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Qu and Hui Li and Jianzhong Guo and Jiashi Li and Jiawei Wang and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and J. L. Cai and Jiaqi Ni and Jian Liang and Jin Chen and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Liang Zhao and Litong Wang and Liyue Zhang and Lei Xu and Leyi Xia and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Meng Li and Miaojun Wang and Mingming Li and Ning Tian and Panpan Huang and Peng Zhang and Qiancheng Wang and Qinyu Chen and Qiushi Du and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and R. J. Chen and R. L. Jin and Ruyi Chen and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shengfeng Ye and Shiyu Wang and Shuiping Yu and Shunfeng Zhou and Shuting Pan and S. S. Li and Shuang Zhou and Shaoqing Wu and Shengfeng Ye and Tao Yun and Tian Pei and Tianyu Sun and T. Wang and Wangding Zeng and Wanjia Zhao and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and W. L. Xiao and Wei An and Xiaodong Liu and Xiaohan Wang and Xiaokang Chen and Xiaotao Nie and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and X. Q. Li and Xiangyue Jin and Xiaojin Shen and Xiaosha Chen and Xiaowen Sun and Xiaoxiang Wang and Xinnan Song and Xinyi Zhou and Xianzu Wang and Xinxia Shan and Y. K. Li and Y. Q. Wang and Y. X. Wei and Yang Zhang and Yanhong Xu and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Wang and Yi Yu and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yuan Ou and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yunfan Xiong and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Y. X. Zhu and Yanhong Xu and Yanping Huang and Yaohui Li and Yi Zheng and Yuchen Zhu and Yunxian Ma and Ying Tang and Yukun Zha and Yuting Yan and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhicheng Ma and Zhigang Yan and Zhiyu Wu and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Zizheng Pan and Zhen Huang and Zhipeng Xu and Zhongyu Zhang and Zhen Zhang},
year={2025},
eprint={2501.12948},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.12948},
}






9. Contact


If you have any questions, please raise an issue or contact us at service@deepseek.com.


categories: ["model","llm","persistent","chat","checkpoint-folder","bf16","deepseek-chat","rag","llm-reasoning"] filename: docker arguments: run --gpus device=~>GPU:DEVICENO<~ -t ~>DOCKERMOUNT<~ --rm harbor_llm_docker:latest python3 /wiro/llm/general_llm_chat.py --model_id="{{params_modelID}}" --prompt="{{params_prompt}}" --instructions="{{params_system_prompt}} You must always structure your output in the following format: Write your hidden reasoning and step-by-step thoughts here. Do not skip this block.Write only the final concise user-facing answer here.Never omit or reorder these tags. Always open and close both and tags correctly, even if the reasoning or the answer is very short." --temperature={{params_temperature}} --top_k={{params_top_k}} --top_p={{params_top_p}} --max_length={{params_max_tokens}} --min_length={{params_min_tokens}} --max_new_tokens={{params_max_new_tokens}} --min_new_tokens={{params_min_new_tokens}} --repetition_penalty={{params_repetition_penalty}} --length_penalty={{params_length_penalty}} --stop_strings="{{params_stop_sequences}}" --chat_id="{{params_session_id}}" --user_id="{{params_user_id}}" --model_path="~>FOLDER:137702<~" --model_type="deepseek-qwen-reasoning" --seed="{{params_seed}}" --chat --stream_output --cache_dir="~>FOLDER:123374<~" --output="~>FOLDEROUTPUT<~" --precision="{{params_precision}}" {{params_quantization}} {{params_do_sample}} parameters: [{"title":"","subtitle":"","items":[{"advanced":false,"type":"textarea","class":"is-12","required":true,"rows":"4","id":"prompt","placeholder":"prompt","label":"prompt","defaultvalue":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","value":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Prompt to send to the model.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"user_id","placeholder":"user_id","label":"user_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"session_id","placeholder":"session_id","label":"session_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session.","quick":true},{"advanced":true,"type":"textarea","class":"is-12","required":false,"rows":"4","id":"system_prompt","placeholder":"system_prompt","label":"system_prompt","defaultvalue":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","value":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"temperature","placeholder":"temperature","label":"temperature","defaultvalue":"0.7","value":"0.7","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"top_p","placeholder":"top_p","label":"top_p","defaultvalue":"0.95","value":"0.95","minvalue":"0","maxvalue":"1","incrementby":"0.01","optionsLoad":"","options":[],"note":"When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"top_k","placeholder":"top_k","label":"top_k","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"100","incrementby":"1","optionsLoad":"","options":[],"note":"When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"repetition_penalty","placeholder":"repetition_penalty","label":"repetition_penalty","defaultvalue":"1.0","value":"1.0","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"It is a hyperparameter used to reduce the likelihood of the model generating repetitive text by applying a penalty to previously generated tokens, encouraging more diverse and coherent output.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"length_penalty","placeholder":"length_penalty","label":"length_penalty","defaultvalue":"1","value":"1","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"A parameter that controls how long the outputs are. If < 1, the model will tend to generate shorter outputs, and > 1 will tend to generate longer outputs.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_tokens","placeholder":"max_tokens","label":"max_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"800000","incrementby":"1","optionsLoad":"","options":[],"note":"Maximum number of tokens to generate. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_tokens","placeholder":"min_tokens","label":"min_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_new_tokens","placeholder":"max_new_tokens","label":"max_new_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"4096","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to max_tokens. max_new_tokens only exists for backwards compatibility purposes. We recommend you use max_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_new_tokens","placeholder":"min_new_tokens","label":"min_new_tokens","defaultvalue":"-1","value":"-1","minvalue":"-1","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to min_tokens. min_new_tokens only exists for backwards compatibility purposes. We recommend you use min_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"stop_sequences","placeholder":"stop_sequences","label":"stop_sequences","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"A semicolon-separated list of sequences to stop generation at. For example, ';' will stop generation at the first instance of 'end' or ''.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"seed","placeholder":"seed","label":"seed","defaultvalue":"123456","value":"123456","minvalue":"0","maxvalue":"9999999","incrementby":"1","optionsLoad":"","options":[],"note":"seed-help"},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"quantization","placeholder":"quantization","label":"quantization","defaultvalue":"--quantization","value":"--quantization","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower."},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"do_sample","placeholder":"do_sample","label":"do_sample","defaultvalue":"--do_sample","value":"--do_sample","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results."}]}] time: 1737840724 marketplace: 1 trainallowed: 0 cps: 0.000000000000 ownercpt: 0.000000000000 onlymembers: 0 tags: ["text generation","transformers","safetensors","qwen2","conversational","text-generation-inference","inference endpoints","mit"] averagepoint: 0.00 commentcount: 0 ratedusercount: 0 priority: 0 sourceurl: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B sourcedownloadurl: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B license: [] licenseupdate: 1777648844 cleanslugowner: deepseek-ai cleanslugproject: deepseek-r1-distill-qwen-7b llm_thinking_regex: {"start": "","end": "","regex": "(?:)?(.*?)<\\/think>"} noindex: 0 modifiedtime: 0 - value: "1527" label: Qwen/Qwen3-30B-A3B-Thinking-2507 description: Qwen3-30B-A3B-Thinking-2507 model. triggerwords: [] generatesettings: [] modelfolderid: 699601 image: https://cdn.wiro.ai/uploads/models/Qwen-Qwen3-30B-A3B-Thinking-2507-cover.png computingtime: 1 second readme: Qwen3-30B-A3B-Thinking-2507 — An advanced version of the Qwen3-30B-A3B model, optimized for deeper and more accurate reasoning. It delivers significantly improved performance in logical reasoning, mathematics, science, coding, and academic benchmarks, while also enhancing general capabilities such as instruction following, tool usage, text generation, and alignment with human preferences. The model supports an extended 256K context window for superior long-context understanding. categories: ["model","llm","persistent","chat","checkpoint-folder","bf16","rag","qwen3","qwen3-reasoning","llm-reasoning"] filename: docker arguments: run --gpus device=~>GPU:DEVICENO<~ -t ~>DOCKERMOUNT<~ --rm harbor_llm_docker:latest python3 /wiro/llm/general_llm_chat.py --model_id="{{params_modelID}}" --prompt="{{params_prompt}}" --instructions="{{params_system_prompt}} You must always structure your output in the following format: Write your hidden reasoning and step-by-step thoughts here. Do not skip this block.Write only the final concise user-facing answer here.Never omit or reorder these tags. Always open and close both and tags correctly, even if the reasoning or the answer is very short." --temperature={{params_temperature}} --top_k={{params_top_k}} --top_p={{params_top_p}} --max_length={{params_max_tokens}} --min_length={{params_min_tokens}} --max_new_tokens={{params_max_new_tokens}} --min_new_tokens={{params_min_new_tokens}} --repetition_penalty={{params_repetition_penalty}} --length_penalty={{params_length_penalty}} --stop_strings="{{params_stop_sequences}}" --chat_id="{{params_session_id}}" --user_id="{{params_user_id}}" --model_path="~>FOLDER:699601<~" --model_type="qwen3-reasoning" --seed="{{params_seed}}" --chat --stream_output --cache_dir="~>FOLDER:123374<~" --output="~>FOLDEROUTPUT<~" --precision="{{params_precision}}" {{params_quantization}} {{params_do_sample}} parameters: [{"title":"","subtitle":"","items":[{"advanced":false,"type":"textarea","class":"is-12","required":true,"rows":"4","id":"prompt","placeholder":"prompt","label":"prompt","defaultvalue":"Can you say something interesting about Turkey?","value":"Can you say something interesting about Turkey?","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Prompt to send to the model.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"user_id","placeholder":"User ID","label":"User ID","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"session_id","placeholder":"Session ID","label":"Session ID","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session.","quick":true},{"advanced":true,"type":"textarea","class":"is-12","required":false,"rows":"4","id":"system_prompt","placeholder":"System Prompt","label":"System Prompt","defaultvalue":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","value":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"temperature","placeholder":"Temperature","label":"Temperature","defaultvalue":"0.7","value":"0.7","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"top_p","placeholder":"Top P","label":"Top P","defaultvalue":"0.95","value":"0.95","minvalue":"0","maxvalue":"1","incrementby":"0.01","optionsLoad":"","options":[],"note":"When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"top_k","placeholder":"Top K","label":"Top K","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"100","incrementby":"1","optionsLoad":"","options":[],"note":"When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"repetition_penalty","placeholder":"Repetition Penalty","label":"Repetition Penalty","defaultvalue":"1.0","value":"1.0","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"It is a hyperparameter used to reduce the likelihood of the model generating repetitive text by applying a penalty to previously generated tokens, encouraging more diverse and coherent output.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"length_penalty","placeholder":"Length Penalty","label":"Length Penalty","defaultvalue":"1","value":"1","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"A parameter that controls how long the outputs are. If < 1, the model will tend to generate shorter outputs, and > 1 will tend to generate longer outputs.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_tokens","placeholder":"Max Tokens","label":"Max Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"800000","incrementby":"1","optionsLoad":"","options":[],"note":"Maximum number of tokens to generate. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_tokens","placeholder":"Min Tokens","label":"Min Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_new_tokens","placeholder":"Max New Tokens","label":"Max New Tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"4096","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to max_tokens. max_new_tokens only exists for backwards compatibility purposes. We recommend you use max_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_new_tokens","placeholder":"Min New Tokens","label":"Min New Tokens","defaultvalue":"-1","value":"-1","minvalue":"-1","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to min_tokens. min_new_tokens only exists for backwards compatibility purposes. We recommend you use min_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"stop_sequences","placeholder":"Stop Sequences","label":"Stop Sequences","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"A semicolon-separated list of sequences to stop generation at. For example, ';' will stop generation at the first instance of 'end' or ''.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"seed","placeholder":"Seed","label":"Seed","defaultvalue":"123456","value":"123456","minvalue":"0","maxvalue":"9999999","incrementby":"1","optionsLoad":"","options":[],"note":"seed-help"},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"quantization","placeholder":"Quantization","label":"Quantization","defaultvalue":"--quantization","value":"--quantization","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower."},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"do_sample","placeholder":"Do Sample","label":"Do Sample","defaultvalue":"--do_sample","value":"--do_sample","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results."}]}] samples: ["https://cdn.wiro.ai/uploads/models/Qwen-Qwen3-30B-A3B-Thinking-2507-sample-1.txt"] time: 1754290535 marketplace: 1 trainallowed: 0 cps: 0.000000000000 ownercpt: 0.000000000000 onlymembers: 0 tags: ["text generation","transformers","safetensors","english","chat","conversational","text-generation-inference","inference endpoints","apache-2.0"] averagepoint: 0.00 commentcount: 0 ratedusercount: 0 priority: 0 sourceurl: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 sourcedownloadurl: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 license: [] licenseupdate: 1777456285 cleanslugowner: qwen cleanslugproject: qwen3-30b-a3b-thinking-2507 llm_thinking_regex: {"start": "","end": "","regex": "(?:)?(.*?)<\\/think>"} noindex: 0 modifiedtime: 0 - value: "686" label: microsoft/phi-4 description: PHI-4 by Microsoft is an advanced large language model designed for precise natural language understanding and generation. It offers robust performance in complex conversational AI and NLP tasks, making it well-suited for enterprise and research applications. triggerwords: [] generatesettings: [] modelfolderid: 133325 image: https://cdn.wiro.ai/uploads/models/microsoft-phi-4-cover.jpg computingtime: 1 second readme:






Phi-4 Model Card


Phi-4 Technical Report







Model Summary






























































DevelopersMicrosoft Research
Descriptionphi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures
Architecture14B parameters, dense decoder-only Transformer model
InputsText, best suited for prompts in the chat format
Context length16K tokens
GPUs1920 H100-80G
Training time21 days
Training data9.8T tokens
OutputsGenerated text in response to input
DatesOctober 2024 – November 2024
StatusStatic model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data
Release dateDecember 12, 2024
LicenseMIT







Intended Use


















Primary Use CasesOur model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:

1. Memory/compute constrained environments.
2. Latency bound scenarios.
3. Reasoning and logic.
Out-of-Scope Use CasesOur models is not specifically designed or evaluated for all downstream purposes, thus:

1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.







Data Overview







Training Datasets


Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:



  1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code.



  2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.).



  3. Acquired academic books and Q&A datasets.



  4. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness.




Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge.







Benchmark datasets


We evaluated phi-4 using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:



  • MMLU: Popular aggregated dataset for multitask language understanding.



  • MATH: Challenging competition math problems.



  • GPQA: Complex, graduate-level science questions.



  • DROP: Complex comprehension and reasoning.



  • MGSM: Multi-lingual grade-school math.



  • HumanEval: Functional code generation.



  • SimpleQA: Factual responses.









Safety







Approach


phi-4 has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.







Safety Evaluation and Red-Teaming


Prior to release, phi-4 followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by phi-4 in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks.


Please refer to the technical report for more details on safety alignment.







Model Quality


To understand the capabilities, we compare phi-4 with a set of models over OpenAI’s SimpleEval benchmark.


At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:



















































































CategoryBenchmarkphi-4 (14B)phi-3 (14B)Qwen 2.5 (14B instruct)GPT-4o-miniLlama-3.3 (70B instruct)Qwen 2.5 (72B instruct)GPT-4o
Popular Aggregated BenchmarkMMLU84.877.979.981.886.385.388.1
ScienceGPQA56.131.242.940.949.149.050.6
MathMGSM
MATH
80.6
80.4
53.5
44.6
79.6
75.6
86.5
73.0
89.1
66.3*
87.3
80.0
90.4
74.6
Code GenerationHumanEval82.667.872.186.278.9*80.490.6
Factual KnowledgeSimpleQA3.07.65.49.920.910.239.4
ReasoningDROP75.568.385.579.390.276.780.9


* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.







Usage







Input Formats


Given the nature of the training data, phi-4 is best suited for prompts using the chat format as follows:


<|im_start|>system<|im_sep|>
You are a medieval knight and must provide explanations to modern people.<|im_end|>
<|im_start|>user<|im_sep|>
How should I explain the Internet?<|im_end|>
<|im_start|>assistant<|im_sep|>






With transformers


import transformers

pipeline = transformers.pipeline(
"text-generation",
model="microsoft/phi-4",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)

messages = [
{"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
{"role": "user", "content": "How should I explain the Internet?"},
]

outputs = pipeline(messages, max_new_tokens=128)
print(outputs[0]["generated_text"][-1])






Responsible AI Considerations


Like other language models, phi-4 can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:



  • Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. phi-4 is not intended to support multilingual use.



  • Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.



  • Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.



  • Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.



  • Limited Scope for Code: Majority of phi-4 training data is based in Python and uses common packages such as typing, math, random, collections, datetime, itertools. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses.




Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include:



  • Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.



  • High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.



  • Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).



  • Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.



  • Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.




categories: ["model","llm","persistent","chat","checkpoint-folder","phi4-chat","bf16","rag"] filename: docker arguments: run --gpus device=~>GPU:DEVICENO<~ -t ~>DOCKERMOUNT<~ --rm harbor_llm_docker:latest python3 /wiro/llm/general_llm_chat.py --model_id="{{params_modelID}}" --prompt="{{params_prompt}}" --instructions="{{params_system_prompt}}" --temperature={{params_temperature}} --top_k={{params_top_k}} --top_p={{params_top_p}} --max_length={{params_max_tokens}} --min_length={{params_min_tokens}} --max_new_tokens={{params_max_new_tokens}} --min_new_tokens={{params_min_new_tokens}} --repetition_penalty={{params_repetition_penalty}} --length_penalty={{params_length_penalty}} --stop_strings="{{params_stop_sequences}}" --chat_id="{{params_session_id}}" --user_id="{{params_user_id}}" --model_path="~>FOLDER:133325<~" --model_type="phi4" --seed="{{params_seed}}" --chat --stream_output --cache_dir="~>FOLDER:123374<~" --output="~>FOLDEROUTPUT<~" --precision="{{params_precision}}" {{params_quantization}} {{params_do_sample}} parameters: [{"title":"","subtitle":"","items":[{"advanced":false,"type":"textarea","class":"is-12","required":true,"rows":"4","id":"prompt","placeholder":"prompt","label":"prompt","defaultvalue":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","value":"What are some interesting historical events that took place near the Tower of London, and how could they inspire a fictional story?","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Prompt to send to the model.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"user_id","placeholder":"user_id","label":"user_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID.","quick":true},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"session_id","placeholder":"session_id","label":"session_id","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session.","quick":true},{"advanced":true,"type":"textarea","class":"is-12","required":false,"rows":"4","id":"system_prompt","placeholder":"system_prompt","label":"system_prompt","defaultvalue":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","value":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"System prompt to send to the model. This is prepended to the prompt and helps guide system behavior.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"temperature","placeholder":"temperature","label":"temperature","defaultvalue":"0.7","value":"0.7","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"top_p","placeholder":"top_p","label":"top_p","defaultvalue":"0.95","value":"0.95","minvalue":"0","maxvalue":"1","incrementby":"0.01","optionsLoad":"","options":[],"note":"When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"top_k","placeholder":"top_k","label":"top_k","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"100","incrementby":"1","optionsLoad":"","options":[],"note":"When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"repetition_penalty","placeholder":"repetition_penalty","label":"repetition_penalty","defaultvalue":"1.0","value":"1.0","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"It is a hyperparameter used to reduce the likelihood of the model generating repetitive text by applying a penalty to previously generated tokens, encouraging more diverse and coherent output.","quick":false},{"advanced":true,"type":"float","class":"is-12","required":false,"rows":"0","id":"length_penalty","placeholder":"length_penalty","label":"length_penalty","defaultvalue":"1","value":"1","minvalue":"0","maxvalue":"5","incrementby":"0.01","optionsLoad":"","options":[],"note":"A parameter that controls how long the outputs are. If < 1, the model will tend to generate shorter outputs, and > 1 will tend to generate longer outputs.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_tokens","placeholder":"max_tokens","label":"max_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"800000","incrementby":"1","optionsLoad":"","options":[],"note":"Maximum number of tokens to generate. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_tokens","placeholder":"min_tokens","label":"min_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"Minimum number of tokens to generate. To disable, set to -1. A word is generally 2-3 tokens.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"max_new_tokens","placeholder":"max_new_tokens","label":"max_new_tokens","defaultvalue":"0","value":"0","minvalue":"0","maxvalue":"4096","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to max_tokens. max_new_tokens only exists for backwards compatibility purposes. We recommend you use max_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"number","class":"is-12","required":false,"rows":"0","id":"min_new_tokens","placeholder":"min_new_tokens","label":"min_new_tokens","defaultvalue":"-1","value":"-1","minvalue":"-1","maxvalue":"2048","incrementby":"1","optionsLoad":"","options":[],"note":"This parameter has been renamed to min_tokens. min_new_tokens only exists for backwards compatibility purposes. We recommend you use min_tokens instead. Both may not be specified.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"stop_sequences","placeholder":"stop_sequences","label":"stop_sequences","defaultvalue":"","value":"","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"A semicolon-separated list of sequences to stop generation at. For example, ';' will stop generation at the first instance of 'end' or ''.","quick":false},{"advanced":true,"type":"text","class":"is-12","required":false,"rows":"0","id":"seed","placeholder":"seed","label":"seed","defaultvalue":"123456","value":"123456","minvalue":"0","maxvalue":"9999999","incrementby":"1","optionsLoad":"","options":[],"note":"seed-help"},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"quantization","placeholder":"quantization","label":"quantization","defaultvalue":"--quantization","value":"--quantization","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower."},{"advanced":true,"type":"checkbox","class":"is-12","required":false,"rows":"0","id":"do_sample","placeholder":"do_sample","label":"do_sample","defaultvalue":"--do_sample","value":"--do_sample","minvalue":"","maxvalue":"","incrementby":"","optionsLoad":"","options":[],"note":"The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results."}]}] samples: ["https://cdn.wiro.ai/uploads/models/microsoft-phi-4-sample-1.txt"] time: 1736807201 marketplace: 1 trainallowed: 0 cps: 0.000000000000 ownercpt: 0.000000000000 onlymembers: 0 tags: ["text generation","transformers","safetensors","english","phi3","phi","nlp","math","code","chat","conversational","custom_code","text-generation-inference","inference endpoints","mit"] averagepoint: 0.00 commentcount: 0 ratedusercount: 0 priority: 0 sourceurl: https://huggingface.co/microsoft/phi-4 sourcedownloadurl: https://huggingface.co/microsoft/phi-4 license: [] licenseupdate: 1778399195 cleanslugowner: microsoft cleanslugproject: phi-4 noindex: 0 modifiedtime: 0 - name: selectedModelPrivate label: select-model-private help: select-model-private-help type: select default: "" options: - name: websiteUrl label: Website URL help: Enter a website URL type: text default: https://www.wikipedia.org/wiki/Football - name: prompt label: prompt help: Prompt to send to the model. type: textarea default: When was the first football game played? - name: user_id label: user_id help: You can leave it blank. The user_id parameter is a unique identifier for the user. It is used to store and retrieve the chat history specific to that user. You should provide a value that uniquely identifies the user across different sessions. For example, it can be the user’s email address, username, or a system-generated ID. type: text default: - name: session_id label: session_id help: You can leave it blank. The session_id parameter represents a specific session for a user. It allows you to manage multiple sessions for the same user. If you want to maintain separate chat histories for different sessions of the same user, use a unique session_id for each session. If not specified or kept the same, the system will treat all interactions as part of the same session. type: text default: - name: system_prompt label: system_prompt help: System prompt to send to the model. This is prepended to the prompt and helps guide system behavior. type: textarea default: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. - name: temperature label: temperature help: Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value. type: float default: 0.7 - name: top_p label: top_p help: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens. type: float default: 0.90 - name: top_k label: top_k help: When decoding text, samples from the top k most likely tokens; lower to ignore less likely tokens. type: number default: 50 - name: chunk_size label: chunk_size help: Defines the size (number of tokens) of each chunk of text that is retrieved and processed by the model during the retrieval phase. Larger values may improve context, but can increase processing time and memory usage. type: number default: 256 - name: chunk_overlap label: chunk_overlap help: Specifies how many tokens from the previous chunk should overlap with the next chunk to ensure continuity and avoid missing important context. Higher overlap ensures smoother transitions but may result in redundant processing. type: number default: 25 - name: similarity_top_k label: similarity_top_k help: Determines the number of most similar documents or chunks to retrieve based on their similarity scores. A higher value increases the amount of context provided but may also introduce irrelevant information. type: number default: 5 - name: context_window label: context_window help: Use 0 to set max limit of the model. Specifies the maximum number of tokens a language model can process at once, including both the input query and retrieved chunks, ensuring the model operates within its token limit. type: number default: 0 - name: max_new_tokens label: max_new_tokens help: Use 0 to set dynamic response limit. Specifies the maximum number of tokens that the language model is allowed to generate in response to a query. This parameter controls the length of the model’s output, helping to prevent overly long or incomplete responses. type: number default: 0 - name: seed label: seed help: seed-help type: text default: 123456 - name: quantization label: quantization help: Quantization is a technique that reduces the precision of model weights (e.g., from FP32 to INT8) to decrease memory usage and improve inference speed. When enabled (true), the model uses less VRAM, making it suitable for resource-constrained environments, but might slightly affect output quality. When disabled (false), the model runs at full precision, ensuring maximum accuracy but requiring more GPU memory and running slower. type: checkbox default: --quantization - name: do_sample label: do_sample help: The do_sample parameter controls whether the model generates text deterministically or with randomness. For precise tasks like translations or code generation, set do_sample = false to ensure consistent and predictable outputs. For creative tasks like storytelling or poetry, set do_sample = true to allow the model to produce diverse and imaginative results. type: checkbox default: --do_sample ## Integration Header Prepare ```bash # Sign up Wiro dashboard and create project export YOUR_API_KEY="{{useSelectedProjectAPIKey}}"; export YOUR_API_SECRET="XXXXXXXXX"; # unix time or any random integer value export NONCE=$(date +%s); # hmac-SHA256 (YOUR_API_SECRET+Nonce) with YOUR_API_KEY export SIGNATURE="$(echo -n "${YOUR_API_SECRET}${NONCE}" | openssl dgst -sha256 -hmac "${YOUR_API_KEY}")"; ``` ## Run Command - Make HTTP Post Request ```bash curl -X POST "https://api.wiro.ai/v1/Run/wiro/rag-chat-website" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "selectedModel": "", "selectedModelPrivate": "", "websiteUrl": "https://www.wikipedia.org/wiki/Football", "prompt": "When was the first football game played?", "user_id": "", "session_id": "", "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.", "temperature": "0.7", "top_p": "0.90", "top_k": 50, "chunk_size": 256, "chunk_overlap": 25, "similarity_top_k": 5, "context_window": 0, "max_new_tokens": 0, "seed": "123456", "quantization": "--quantization", "do_sample": "--do_sample", "callbackUrl": "You can provide a callback URL; Wiro will send a POST request to it when the task is completed." }'; ``` ## Run Command - Response ```json { "errors": [], "taskid": "2221", "socketaccesstoken": "eDcCm5yyUfIvMFspTwww49OUfgXkQt", "result": true } ``` ## Get Task Detail - Make HTTP Post Request with Task ID ```bash curl -X POST "https://api.wiro.ai/v1/Task/Detail" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "taskid": "2221" }'; ``` ## Get Task Detail - Make HTTP Post Request with Socket Access Token ```bash curl -X POST "https://api.wiro.ai/v1/Task/Detail" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "tasktoken": "eDcCm5yyUfIvMFspTwww49OUfgXkQt" }'; ``` ## Get Task Detail - Response ```json { "total": "1", "errors": [], "tasklist": [ { "id": "2221", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "socketaccesstoken": "eDcCm5yyUfIvMFspTwww49OUfgXkQt", "parameters": {}, "debugoutput": "", "debugerror": "", "starttime": "1734513809", "endtime": "1734513813", "elapsedseconds": "6.0000", "status": "task_postprocess_end", "createtime": "1734513807", "canceltime": "0", "assigntime": "1734513807", "accepttime": "1734513807", "preprocessstarttime": "1734513807", "preprocessendtime": "1734513807", "postprocessstarttime": "1734513813", "postprocessendtime": "1734513814", "outputs": [ { "id": "6bc392c93856dfce3a7d1b4261e15af3", "name": "0.png", "contenttype": "image/png", "parentid": "6c1833f39da71e6175bf292b18779baf", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "size": "202472", "addedtime": "1734513812", "modifiedtime": "1734513812", "accesskey": "dFKlMApaSgMeHKsJyaDeKrefcHahUK", "url": "https://cdn1.wiro.ai/6a6af820-c5050aee-40bd7b83-a2e186c6-7f61f7da-3894e49c-fc0eeb66-9b500fe2/0.png" } ], "size": "202472" } ], "result": true } ``` ## Kill Task - Make HTTP Post Request with Task ID ```bash curl -X POST "https://api.wiro.ai/v1/Task/Kill" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "taskid": "534574" }'; ``` ## Kill Task - Make HTTP Post Request with Socket Access Token ```bash curl -X POST "https://api.wiro.ai/v1/Task/Kill" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "socketaccesstoken": "ZpYote30on42O4jjHXNiKmrWAZqbRE" }'; ``` ## Kill Task - Response ```json { "errors": [], "tasklist": [ { "id": "534574", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "name": "", "socketaccesstoken": "ZpYote30on42O4jjHXNiKmrWAZqbRE", "parameters": { "inputImage": "https://api.wiro.ai/v1/File/mCmUXgZLG1FNjjjwmbtPFr2LVJA112/inputImage-6060136.png" }, "debugoutput": "", "debugerror": "", "starttime": "1734513809", "endtime": "1734513813", "elapsedseconds": "6.0000", "status": "task_cancel", "cps": "0.000585000000", "totalcost": "0.003510000000", "guestid": null, "projectid": "699", "modelid": "598", "description": "", "basemodelid": "0", "runtype": "model", "modelfolderid": "", "modelfileid": "", "callbackurl": "", "marketplaceid": null, "createtime": "1734513807", "canceltime": "0", "assigntime": "1734513807", "accepttime": "1734513807", "preprocessstarttime": "1734513807", "preprocessendtime": "1734513807", "postprocessstarttime": "1734513813", "postprocessendtime": "1734513814", "pexit": "0", "categories": "["tool","image-to-image","quick-showcase","compare-landscape"]", "outputs": [ { "id": "6bc392c93856dfce3a7d1b4261e15af3", "name": "0.png", "contenttype": "image/png", "parentid": "6c1833f39da71e6175bf292b18779baf", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "size": "202472", "addedtime": "1734513812", "modifiedtime": "1734513812", "accesskey": "dFKlMApaSgMeHKsJyaDeKrefcHahUK", "foldercount": "0", "filecount": "0", "ispublic": 0, "expiretime": null, "url": "https://cdn1.wiro.ai/6a6af820-c5050aee-40bd7b83-a2e186c6-7f61f7da-3894e49c-fc0eeb66-9b500fe2/0.png" } ], "size": "202472" } ], "result": true } ``` ## Cancel Task - Make HTTP Post Request (For tasks on queue) ```bash curl -X POST "https://api.wiro.ai/v1/Task/Cancel" -H "Content-Type: multipart/form-data" -H "x-api-key: ${YOUR_API_KEY}" -H "x-nonce: ${NONCE}" -H "x-signature: ${SIGNATURE}" -d '{ "taskid": "634574" }'; ``` ## Cancel Task - Response ```json { "errors": [], "tasklist": [ { "id": "634574", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "name": "", "socketaccesstoken": "ZpYote30on42O4jjHXNiKmrWAZqbRE", "parameters": { "inputImage": "https://api.wiro.ai/v1/File/mCmUXgZLG1FNjjjwmbtPFr2LVJA112/inputImage-6060136.png" }, "debugoutput": "", "debugerror": "", "starttime": "1734513809", "endtime": "1734513813", "elapsedseconds": "6.0000", "status": "task_cancel", "cps": "0.000585000000", "totalcost": "0.003510000000", "guestid": null, "projectid": "699", "modelid": "598", "description": "", "basemodelid": "0", "runtype": "model", "modelfolderid": "", "modelfileid": "", "callbackurl": "", "marketplaceid": null, "createtime": "1734513807", "canceltime": "0", "assigntime": "1734513807", "accepttime": "1734513807", "preprocessstarttime": "1734513807", "preprocessendtime": "1734513807", "postprocessstarttime": "1734513813", "postprocessendtime": "1734513814", "pexit": "0", "categories": "["tool","image-to-image","quick-showcase","compare-landscape"]", "outputs": [ { "id": "6bc392c93856dfce3a7d1b4261e15af3", "name": "0.png", "contenttype": "image/png", "parentid": "6c1833f39da71e6175bf292b18779baf", "uuid": "15bce51f-442f-4f44-a71d-13c6374a62bd", "size": "202472", "addedtime": "1734513812", "modifiedtime": "1734513812", "accesskey": "dFKlMApaSgMeHKsJyaDeKrefcHahUK", "foldercount": "0", "filecount": "0", "ispublic": 0, "expiretime": null, "url": "https://cdn1.wiro.ai/6a6af820-c5050aee-40bd7b83-a2e186c6-7f61f7da-3894e49c-fc0eeb66-9b500fe2/0.png" } ], "size": "202472" } ], "result": true } ``` ## Task Status Information This section defines the possible task status values returned by the API when polling for task completion. ### Completed Task Statuses (Polling can stop) These indicate that the task has reached a terminal state — either success or failure. Once any of these is received, polling should stop. - task_postprocess_end : Task completed successfully and post-processing is done. - task_cancel : Task was cancelled by the user or system. ### Running Task Statuses (Continue polling) These statuses indicate that the task is still in progress. Polling should continue if one of these is returned. - task_queue : Task is waiting in the queue. - task_accept : Task has been accepted for processing. - task_assign : Task is being assigned to a worker. - task_preprocess_start : Preprocessing is starting. - task_preprocess_end : Preprocessing is complete. - task_start : Task execution has started. - task_output : Output is being generated.